Method and system for classifying display pages using summaries
First Claim
1. A method in a computer system for classifying web pages, the method comprising:
- retrieving a web page;
automatically generating a summary of the retrieved web page byidentifying objects of the web page, the objects having sentences;
building a term frequency by inverted document frequency index for each object;
calculating similarity between pairs of objects based on the term frequency by inverted document frequency indexes of the objects;
when the calculated similarity between a pair of objects satisfies a similarity threshold, linking the pair objects to indicate that the objects satisfy the threshold;
selecting as a core object of the web page the object that has the most links;
assigning high scores to sentences of the core object and to objects with links to the core object and low scores to all other sentences;
selecting sentences to form the summary of the web page based on the assigned scores; and
determining a classification for the retrieved web page based on the automatically generated summary.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for classifying display pages based on automatically generated summaries of display pages. A web page classification system uses a web page summarization system to generate summaries of web pages. The summary of a web page may include the sentences of the web page that are most closely related to the primary topic of the web page. The summarization system may combine the benefits of multiple summarization techniques to identify the sentences of a web page that represent the primary topic of the web page. Once the summary is generated, the classification system may apply conventional classification techniques to the summary to classify the web page. The classification system may use conventional classification techniques such as a Naïve Bayesian classifier or a support vector machine to identify the classifications of a web page based on the summary generated by the summarization system.
-
Citations
42 Claims
-
1. A method in a computer system for classifying web pages, the method comprising:
-
retrieving a web page; automatically generating a summary of the retrieved web page by identifying objects of the web page, the objects having sentences; building a term frequency by inverted document frequency index for each object; calculating similarity between pairs of objects based on the term frequency by inverted document frequency indexes of the objects; when the calculated similarity between a pair of objects satisfies a similarity threshold, linking the pair objects to indicate that the objects satisfy the threshold; selecting as a core object of the web page the object that has the most links; assigning high scores to sentences of the core object and to objects with links to the core object and low scores to all other sentences; selecting sentences to form the summary of the web page based on the assigned scores; and determining a classification for the retrieved web page based on the automatically generated summary. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method in a computer system for summarizing a web page, the method comprising:
-
retrieving the web page; for each sentence of the retrieved web page, assigning a score to the sentence based on multiple summarization techniques wherein one of the summarization techniques is identifying objects of the web page, the objects having sentences; building a term frequency by inverted document frequency index for each object; calculating similarity between pairs of objects based on the term frequency by inverted document frequency indexes of the objects; when the calculated similarity between a pair of objects satisfies a similarity threshold, linking the pair of objects to indicate that the objects satisfy the threshold; selecting as a core object of the web page the object that has the most links; and assigning a high score to sentences of the core object and to objects with links to the core object and a low score to all other sentences; and combining the scores assigned to the sentence to generate a combined score for the sentence; and selecting the sentences with the highest combined scores to form a summary of the retrieved web page. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A computer-readable storage medium containing instructions for causing a computer system to generate a summary for a display page by a method comprising:
-
for each sentence of the display page, generating a score that is based on multiple summarization techniques wherein one of the summarization techniques is calculating similarity between pairs of objects of the display page, the objects having sentences; when the calculated similarity between a pair of objects satisfies a similarity threshold, linking the pair of objects to indicate that the objects satisfy the threshold; selecting as a core object of the display page the object that has the most links; and assigning high score to sentences of the core object and to objects with links to the core object and low score to all other sentences; and selecting the sentences with the highest generated scores to form a summary of the display page. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
37. A computer system embodied on a computer-readable storage medium for classifying display pages, comprising:
-
means for automatically generating a summary of the display page by calculating similarity between pairs of objects of the display page, the objects having sentences; when the calculated similarity between a pair of objects satisfies a similarity threshold, linking the pair of objects to indicate that the objects satisfy the threshold; selecting as a core object of the display page the object that has the most links; and selecting sentences of the core object and objects with links to the core object to form the summary of the display page; and means for identifying a classification for the display page based on the automatically generated summary. - View Dependent Claims (38, 39, 40, 41, 42)
-
Specification