SYSTEMS AND METHODS FOR CONTENT EXTRACTION FROM A MARK-UP LANGUAGE TEXT ACCESSIBLE AT AN INTERNET DOMAIN
First Claim
1. A method for automatically classifying a markup language text that is accessible at an Internet domain comprising:
- (a) retrieving from one or more data repositories, data associated with the Internet domain;
(b) computing a first identifier for the Internet domain based on at least the data associated with the Internet domain and the markup language text;
(c) computing a measure of similarity between the computed first identifier and each of a first plurality of previously classified identifiers; and
(d) assigning the markup language text a classification based on the computed measure of similarity between the computed first identifier and each of the first plurality of previously classified identifiers.
0 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods are presented for content extraction from markup language text. The content extraction process may parse markup language text into a hierarchical data model and then apply one or more filters. Output filters may be used to make the process more versatile. The operation of the content extraction process and the one or more filters may be controlled by one or more settings set by a user, or automatically by a classifier. The classifier may automatically enter settings by classifying markup language text and entering settings based on this classification. Automatic classification may be performed by clustering unclassified markup language texts with previously classified markup language texts.
-
Citations
10 Claims
-
1. A method for automatically classifying a markup language text that is accessible at an Internet domain comprising:
-
(a) retrieving from one or more data repositories, data associated with the Internet domain; (b) computing a first identifier for the Internet domain based on at least the data associated with the Internet domain and the markup language text; (c) computing a measure of similarity between the computed first identifier and each of a first plurality of previously classified identifiers; and (d) assigning the markup language text a classification based on the computed measure of similarity between the computed first identifier and each of the first plurality of previously classified identifiers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
Specification