System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents
First Claim
Patent Images
1. A computer-implemented method for automatically detecting and extracting semantically significant text from a hypertext markup language (HTML) document comprising:
- receiving a HTML document, wherein the HTML document is in a set of HTML documents;
parsing the HTML document into a parse tree comprising a plurality of sub-trees;
creating a trimmed parse tree by removing at least one of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link text that exceeds a predetermined threshold, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one;
segmenting the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node;
processing the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and
removing each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents is disclosed. The method may include receiving a HTML document, parsing the HTML document into a parse tree, segmenting the parse tree into one or more segments of one or more unique paths, processing the one or more segments based at least the HTML document, and extracting one or more processed segments from the at least the HTML document based on a predetermined number.
-
Citations
16 Claims
-
1. A computer-implemented method for automatically detecting and extracting semantically significant text from a hypertext markup language (HTML) document comprising:
-
receiving a HTML document, wherein the HTML document is in a set of HTML documents; parsing the HTML document into a parse tree comprising a plurality of sub-trees; creating a trimmed parse tree by removing at least one of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link text that exceeds a predetermined threshold, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one; segmenting the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node; processing the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and removing each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for automatically detecting and extracting semantically significant text from a hypertext markup language (HTML) document comprising:
-
a parser computing apparatus to receive a HTML document, parse the HTML document into a parse tree comprising a plurality of sub-trees, and create a trimmed parse tree by removing at least on of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link test that exceeds a predetermined threshold, wherein the HTML document is in a set of HTML documents, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one; a segmenter computing apparatus to segment the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node; a processor computing apparatus to process the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and an extractor computing apparatus to remove each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computer-accessible medium encoded with computer program code effective to perform the following:
-
receive a HTML document, wherein the HTML document is in a set of HTML documents; parse the HTML document into a parse tree comprising a plurality of sub-trees; create a trimmed parse tree by removing at least one of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link text that exceeds a predetermined threshold, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one; segment the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node; process the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and remove each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents.
-
Specification