Large scale concept discovery for webpage augmentation using search engine indexers
First Claim
1. A method comprising:
- retrieving, by a training computer, training data comprising a plurality of web documents;
extracting, by the training computer, information from the training data, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents;
learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document;
generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed;
learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document;
generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document;
using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document;
determining, by the indexing computer, one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and
transmitting, by the indexing computer, topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities.
9 Assignments
0 Petitions
Accused Products
Abstract
Disclosed is a method and system for retrieving data; extracting information from the data; learning to disambiguate the extracted information such that a particular sense of each phrase within the extracted information is determined; generating a disambiguation classifier from the learning to disambiguate step, the disambiguation classifier configured to determine a sense of a phrase within a document; learning to select a portion of the information as being relevant to a theme of the data; generating a selection classifier from the learning to select step, the selection classifier configured to select a topic in a document that is relevant to a theme of the document; and using the disambiguation classifier and the selection classifier by an indexing computer to determine a set of topics from a web document retrieved by the indexing computer.
-
Citations
14 Claims
-
1. A method comprising:
-
retrieving, by a training computer, training data comprising a plurality of web documents; extracting, by the training computer, information from the training data, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents; learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document; generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed; learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document; generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document; using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document; determining, by the indexing computer, one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and transmitting, by the indexing computer, topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer readable storage medium storing computer program instructions capable of being executed by a computer processor on a computing device, the computer program instructions defining the steps of:
-
extracting, by the training computer, information from retrieved training data comprising a plurality of web documents, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents; learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document; generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed; learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document; generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document; using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document; determining, by the indexing computer, one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and transmitting, by the indexing computer, topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities. - View Dependent Claims (13)
-
-
14. A system comprising:
-
a processor; a storage medium for tangibly storing thereon program logic for execution by the processor, the program logic comprising; a training module configured to; retrieve training data comprising a plurality of web documents, extract information from the training data, the extracted information comprising a plurality of phrases extracted from each document of the plurality of web documents, learning to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document, generating, as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed, learning, using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document, generating, as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document, and an indexing module executing on the processor and configured to; use the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document; determine one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and transmit topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities.
-
Specification