×

System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents

  • US 8,051,372 B1
  • Filed: 04/12/2007
  • Issued: 11/01/2011
  • Est. Priority Date: 04/12/2007
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for automatically detecting and extracting semantically significant text from a hypertext markup language (HTML) document comprising:

  • receiving a HTML document, wherein the HTML document is in a set of HTML documents;

    parsing the HTML document into a parse tree comprising a plurality of sub-trees;

    creating a trimmed parse tree by removing at least one of the plurality of sub-trees from the parse tree based on the at least one of the plurality of sub-trees containing an amount of link text that exceeds a predetermined threshold, wherein the amount of link text is associated with a ratio of a total length of link text amount to a total length of text amount, and wherein the ratio is greater than zero and less than one;

    segmenting the trimmed parse tree into a plurality of segments of unique paths, wherein each unique path begins at a root node and ends at a text node;

    processing the plurality of segments based on the set of HTML documents to determine a document frequency for each of the plurality of segments; and

    removing each processed segment that is associated with a corresponding document frequency that exceeds a predetermined number from the set of HTML documents.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×