Systems and methods for content extraction
First Claim
1. A method for extracting content from input markup language text comprising:
- (a) parsing the input markup language text into a first hierarchical data model;
(b) generating a second hierarchical data model based on the first hierarchical data model using one or more filters to remove content from the first hierarchical data model; and
(c) generating output markup language text from the second hierarchical data model.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are presented for content extraction from markup language text. The content extraction process may parse markup language text into a hierarchical data model and then apply one or more filters. Output filters may be used to make the process more versatile. The operation of the content extraction process and the one or more filters may be controlled by one or more settings set by a user, or automatically by a classifier. The classifier may automatically enter settings by classifying markup language text and entering settings based on this classification. Automatic classification may be performed by clustering unclassified markup language texts with previously classified markup language texts.
129 Citations
33 Claims
-
1. A method for extracting content from input markup language text comprising:
-
(a) parsing the input markup language text into a first hierarchical data model;
(b) generating a second hierarchical data model based on the first hierarchical data model using one or more filters to remove content from the first hierarchical data model; and
(c) generating output markup language text from the second hierarchical data model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for automatically classifying a markup language text that is accessible at an Internet domain comprising:
-
(a) retrieving from one or more data repositories, data associated with the Internet domain;
(b) computing a first identifier for the Internet domain based on at least the data associated with the Internet domain and the markup language text;
(c) computing a measure of similarity between the computed first identifier and each of a first plurality of previously classified identifiers; and
(d) assigning the markup language text a classification based on the computed measure of similarity between the computed first identifier and each of the first plurality of previously classified identifiers. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A system for extracting content from markup language text comprising:
-
(a) a parser for converting the markup language text into a hierarchical data model;
(b) one or more filters for removing content from the hierarchical data model; and
(c) a data repository containing settings for each of the one or more filters. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33)
-
Specification