System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents

US 8,051,372 B1
Filed: 04/12/2007
Issued: 11/01/2011
Est. Priority Date: 04/12/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for automatically detecting and extracting semantically significant text from a hypertext markup language (HTML) document comprising:

receiving a HTML document, wherein the HTML document is in a set of HTML documents;

parsing the HTML document into a parse tree comprising a plurality of sub-trees;

creating a trimmed parse tree by removing at least one of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link text that exceeds a predetermined threshold, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one;

segmenting the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node;

processing the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and

removing each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents is disclosed. The method may include receiving a HTML document, parsing the HTML document into a parse tree, segmenting the parse tree into one or more segments of one or more unique paths, processing the one or more segments based at least the HTML document, and extracting one or more processed segments from the at least the HTML document based on a predetermined number.

Citations

16 Claims

1. A computer-implemented method for automatically detecting and extracting semantically significant text from a hypertext markup language (HTML) document comprising:
- receiving a HTML document, wherein the HTML document is in a set of HTML documents;
  
  parsing the HTML document into a parse tree comprising a plurality of sub-trees;
  
  creating a trimmed parse tree by removing at least one of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link text that exceeds a predetermined threshold, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one;
  
  segmenting the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node;
  
  processing the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and
  
  removing each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1, wherein the HTML document comprises a web page.
  - 3. The method according to claim 1, wherein parsing the HTML document into the parse tree comprises parsing source text associated with the HTML document.
  - 4. The method according to claim 3, wherein source text comprises HTML tags.
  - 5. The method according to claim 1, wherein the parse tree comprises one or more root nodes, one or more branch nodes, and one or more leaf nodes.
  - 6. The method according to claim 1, wherein the at least one of the plurality of sub-trees removed is associated with at least one of a division HTML tag and a table HTML tag.
  - 7. The method according to claim 1, wherein removing each processed segment comprises extracting template text.
  - 8. The method according to claim 7, wherein template text comprises one or more headers, one or more footers, one or more navigations, or one or more advertisements.
  - 9. The method according to claim 1, wherein the predetermined number comprises a specified minimum frequency occurrence of the respective processed segments.

10. A system for automatically detecting and extracting semantically significant text from a hypertext markup language (HTML) document comprising:
- a parser computing apparatus to receive a HTML document, parse the HTML document into a parse tree comprising a plurality of sub-trees, and create a trimmed parse tree by removing at least on of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link test that exceeds a predetermined threshold, wherein the HTML document is in a set of HTML documents, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one;
  
  a segmenter computing apparatus to segment the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node;
  
  a processor computing apparatus to process the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and
  
  an extractor computing apparatus to remove each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system according to claim 10, wherein to parse the HTML document into the parse tree, the parse module parses source text associated with the HTML document.
  - 12. The system according to claim 10, wherein the parse tree comprises one or more root nodes, one or more branch nodes, and one or more leaf nodes.
  - 13. The system according to claim 10, wherein the at least one of the plurality of sub-trees removed is associated with at least one of a division HTML tag and a table HTML tag.
  - 14. The system according to claim 10, wherein the extractor computing apparatus is further configured to remove each processed segment by removing template text.
  - 15. The system according to claim 10, wherein the predetermined number comprises a specified minimum frequency occurrence of the respective processed segments.

16. A computer-accessible medium encoded with computer program code effective to perform the following:
- receive a HTML document, wherein the HTML document is in a set of HTML documents;
  
  parse the HTML document into a parse tree comprising a plurality of sub-trees;
  
  create a trimmed parse tree by removing at least one of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link text that exceeds a predetermined threshold, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one;
  
  segment the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node;
  
  process the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and
  
  remove each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
New York Times Company
Original Assignee
New York Times Company
Inventors
Sandhaus, Evan Stapleton
Primary Examiner(s)
Hong; Stephen S.
Assistant Examiner(s)
Nazar; Ahamed I

Application Number

US11/734,467
Time in Patent Office

1,664 Days
Field of Search

715/234, 715/853, 715/243, 707/999.104, 707/E17.123
US Class Current

715/234
CPC Class Codes

G06F 16/81 Indexing, e.g. XML tags; Da...

Y10S 707/99945 Object-oriented database st...

System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links