LARGE SCALE CONCEPT DISCOVERY FOR WEBPAGE AUGMENTATION USING SEARCH ENGINE INDEXERS
First Claim
1. A method comprising:
- retrieving, by a training computer, training data comprising a plurality of web documents;
extracting, by the training computer, information from the training data, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents;
learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document;
generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed;
learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document;
generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document; and
using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data.
9 Assignments
0 Petitions
Accused Products
Abstract
Disclosed is a method and system for retrieving data; extracting information from the data; learning to disambiguate the extracted information such that a particular sense of each phrase within the extracted information is determined; generating a disambiguation classifier from the learning to disambiguate step, the disambiguation classifier configured to determine a sense of a phrase within a document; learning to select a portion of the information as being relevant to a theme of the data; generating a selection classifier from the learning to select step, the selection classifier configured to select a topic in a document that is relevant to a theme of the document; and using the disambiguation classifier and the selection classifier by an indexing computer to determine a set of topics from a web document retrieved by the indexing computer.
54 Citations
32 Claims
-
1. A method comprising:
-
retrieving, by a training computer, training data comprising a plurality of web documents; extracting, by the training computer, information from the training data, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents; learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document; generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed; learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document; generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document; and using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer readable storage medium storing computer program instructions capable of being executed by a computer processor on a computing device, the computer program instructions defining the steps of:
-
extracting, by the training computer, information from retrieved training data comprising a plurality of web documents, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents; learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document; generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed; learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document; generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document; and using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data. - View Dependent Claims (14)
-
-
15. A method comprising:
-
retrieving, by an indexing computer, a web document; tokenizing, by the indexing computer, the web document to determine phrases in the web document that correspond with phrases in stored data; applying, by the indexing computer, a disambiguation classifier on each determined phrase in the web document to obtain a sense for the each determined phrase; and applying, by the indexing computer, a selection classifier on the sense for the each determined phrase to obtain a set of topics for the web document. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A computer readable storage medium storing computer program instructions capable of being executed by a computer processor on a computing device, the computer program instructions defining the steps of:
-
tokenizing, by an indexing computer, a retrieved web document to determine phrases in the web document that correspond with phrases in stored data; applying, by the indexing computer, a disambiguation classifier on each determined phrase in the web document to obtain a sense for the each determined phrase; and applying, by the indexing computer, a selection classifier on the sense for the each determined phrase to obtain a set of topics for the web document. - View Dependent Claims (21)
-
-
22. A method comprising:
-
retrieving, by a server computer over a network from a computing device, a Uniform Resource Locator of a web document navigated to by a user of the computing device; determining, by the server computer, a set of topics associated with the web document, the determining of the set of topics being based on results of a classifier previously applied to stored data; and transmitting, from the server computer to the computing device, an item associated with the set of topics for display by the computing device. - View Dependent Claims (23, 24, 25, 26)
-
-
27. A computer readable storage medium storing computer program instructions capable of being executed by a computer processor on a computing device, the computer program instructions defining the steps of:
-
determining, by a server computer, a set of topics associated with a web document to which a user using a computing device has navigated, the determining of the set of topics being based on results of a classifier previously applied to stored data; and transmitting, from the server computer to the computing device, an item associated with the set of topics. - View Dependent Claims (28)
-
-
29. A system comprising:
-
a training module executing on a server computer and configured to generate a disambiguation classifier, the disambiguation classifier configured to determine a sense of a phrase within a document, and a selection classifier, the selection classifier configured to select a topic in the document that is relevant to a theme of the document; an indexing module executing on the server computer and configured to apply the disambiguation classifier and the selection classifier on a web document to determine a set of topics from the web document; and a run-time module executing on the server computer and configured to transmit the set of topics determined by the indexing module for the web document to a computing device when the computing device has navigated to the web document. - View Dependent Claims (30)
-
-
31. A system comprising:
a run-time module executing on a server computer and configured to transmit a topic previously determined for a particular web document over a network to a computing device when the computing device has navigated to the web document. - View Dependent Claims (32)
Specification