Systems and methods for content extraction from mark-up language text accessible at an internet domain
First Claim
1. A method performed by a computer processor for automatically classifying a markup language text that is accessible at an Internet domain comprising:
- retrieving from one or more data repositories, data associated with the Internet domain;
computing a first identifier for the Internet domain based on at least the data associated with the Internet domain and the markup language text;
computing a measure of similarity between the computed first identifier and each of a first plurality of previously classified identifiers; and
assigning the markup language text a classification based on the computed measure of similarity between the computed first identifier and each of the first plurality of previously classified identifiers,wherein computing the first identifier comprises computing, for each of a plurality of words in a predetermined set of words, a frequency of each word in the markup language text and the search result, andwherein the predetermined set of words are generated by;
retrieving the markup language text from the Internet domain;
retrieving search results associated with the Internet domain from one or more search engines;
computing a frequency for each of a plurality of words in the search results and the markup language text; and
adding to the predetermined set of words each of the plurality of words whose frequency is greater than a threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are presented for content extraction from markup language text. The content extraction process may parse markup language text into a hierarchical data model and then apply one or more filters. Output filters may be used to make the process more versatile. The operation of the content extraction process and the one or more filters may be controlled by one or more settings set by a user, or automatically by a classifier. The classifier may automatically enter settings by classifying markup language text and entering settings based on this classification. Automatic classification may be performed by clustering unclassified markup language texts with previously classified markup language texts.
-
Citations
8 Claims
-
1. A method performed by a computer processor for automatically classifying a markup language text that is accessible at an Internet domain comprising:
-
retrieving from one or more data repositories, data associated with the Internet domain; computing a first identifier for the Internet domain based on at least the data associated with the Internet domain and the markup language text; computing a measure of similarity between the computed first identifier and each of a first plurality of previously classified identifiers; and assigning the markup language text a classification based on the computed measure of similarity between the computed first identifier and each of the first plurality of previously classified identifiers, wherein computing the first identifier comprises computing, for each of a plurality of words in a predetermined set of words, a frequency of each word in the markup language text and the search result, and wherein the predetermined set of words are generated by; retrieving the markup language text from the Internet domain; retrieving search results associated with the Internet domain from one or more search engines; computing a frequency for each of a plurality of words in the search results and the markup language text; and adding to the predetermined set of words each of the plurality of words whose frequency is greater than a threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification