METHOD AND SYSTEM FOR CLASSIFYING DISPLAY PAGES USING SUMMARIES
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for classifying display pages based on automatically generated summaries of display pages. A web page classification system uses a web page summarization system to generate summaries of web pages. The summary of a web page may include the sentences of the web page that are most closely related to the primary topic of the web page. The summarization system may combine the benefits of multiple summarization techniques to identify the sentences of a web page that represent the primary topic of the web page. Once the summary is generated, the classification system may apply conventional classification techniques to the summary to classify the web page. The classification system may use conventional classification techniques such as a Naïve Bayesian classifier or a support vector machine to identify the classifications of a web page based on the summary generated by the summarization system.
35 Citations
62 Claims
-
1-42. -42. (canceled)
-
43. A method in a computer system for identifying a core object of a web page, the method comprising:
-
identifying objects of the web page, an object representing an information area of the web page and having content comprising words; for each pair of identified objects, calculating similarity between the pair of identified objects based on similarity between words of the identified objects; determining whether the calculated similarity between the pair of identified objects satisfied a threshold of similarity; and when it is determined that the calculated similarity between the pair of identified objects satisfies a threshold of similarity, indicating that the pair of identified objects are similar; and selecting as the core object of the web page the identified object that has been indicated as being similar to the most other identified objects wherein the content of the core object represents a primary topic of the web page. - View Dependent Claims (44, 45, 46, 47, 48, 49, 50)
-
-
51. A computer-readable storage medium storing computer-executable instructions for controlling a computing device to identify a core object of a document, by a method comprising:
-
identifying objects of the document, an object representing an information area of the document, having content comprising words, and being a basic object or a composite object, a basic object representing an information area that cannot be further divided, a composite object representing basic objects or other composite objects that combined perform a function; for each pair of identified objects, calculating similarity between the pair of identified objects based on similarity between words of the identified objects; determining whether the calculated similarity between the pair of identified objects satisfied a threshold of similarity; and when it is determined that the calculated similarity between the pair of identified objects satisfies a threshold of similarity, indicating that the pair of identified objects are similar; and selecting as the core object of the document the identified object that has been indicated as being similar to the most other identified objects. - View Dependent Claims (52, 53, 54, 55, 56, 57, 58)
-
-
59. A computing device with a processor and memory for identifying a core object of a web page, comprising:
-
a component that identifies objects of the web page, an object representing an information area of the web page, having content comprising words, and being a basic object or a composite object, a basic object representing an information area of the web page that cannot be further divided, a composite object representing basic objects or other composite objects that combined perform a function; a component that, for each pair of identified objects, calculates similarity between the pair of identified objects based on similarity between words of the identified objects determined using based on term frequency by inverse document frequency and cosine similarity; determines whether the calculated similarity between the pair of identified objects satisfied a threshold of similarity; and when it is determined that the calculated similarity between the pair of identified objects satisfies a threshold of similarity, establishes a link between the identified objects indicating that identified objects of the pair are similar; and a component that selects as the core object of the web page the identified object that has the most links to other identified objects. - View Dependent Claims (60, 61, 62)
-
Specification