Large scale concept discovery for webpage augmentation using search engine indexers

US 8,886,623 B2
Filed: 04/07/2010
Issued: 11/11/2014
Est. Priority Date: 04/07/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

retrieving, by a training computer, training data comprising a plurality of web documents;

extracting, by the training computer, information from the training data, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents;

learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document;

generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed;

learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document;

generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document;

using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document;

determining, by the indexing computer, one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and

transmitting, by the indexing computer, topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a method and system for retrieving data; extracting information from the data; learning to disambiguate the extracted information such that a particular sense of each phrase within the extracted information is determined; generating a disambiguation classifier from the learning to disambiguate step, the disambiguation classifier configured to determine a sense of a phrase within a document; learning to select a portion of the information as being relevant to a theme of the data; generating a selection classifier from the learning to select step, the selection classifier configured to select a topic in a document that is relevant to a theme of the document; and using the disambiguation classifier and the selection classifier by an indexing computer to determine a set of topics from a web document retrieved by the indexing computer.

Citations

14 Claims

1. A method comprising:
- retrieving, by a training computer, training data comprising a plurality of web documents;
  
  extracting, by the training computer, information from the training data, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents;
  
  learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document;
  
  generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed;
  
  learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document;
  
  generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document;
  
  using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document;
  
  determining, by the indexing computer, one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and
  
  transmitting, by the indexing computer, topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 further comprising determining, by the training computer, link data associated with the extracted information.
  - 3. The method of claim 2 wherein the learning to disambiguate the extracted information further comprises learning, from the link data, to disambiguate the extracted information.
  - 4. The method of claim 1 wherein the retrieving the training data further comprises retrieving a plurality of articles from a knowledge collection website.
  - 5. The method of claim 2 wherein the determining link data further comprises determining inlinks.
  - 6. The method of claim 2 wherein the determining link data further comprises determining outlinks.
  - 7. The method of claim 2 wherein the determining link data further comprises determining redirects.
  - 8. The method of claim 2 wherein the determining link data further comprises determining category hierarchy.
  - 9. The method of claim 1 wherein the training computer and the indexing computer are the same computer.
  - 10. The method of claim 1 wherein the training computer comprises a plurality of computers.
  - 11. The method of claim 1 wherein the indexing computer comprises a plurality of computers.

12. A non-transitory computer readable storage medium storing computer program instructions capable of being executed by a computer processor on a computing device, the computer program instructions defining the steps of:
- extracting, by the training computer, information from retrieved training data comprising a plurality of web documents, the extracted information comprising a plurality of phrases extracted from each document of said plurality of web documents;
  
  learning, by the training computer, to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document;
  
  generating, by the training computer as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed;
  
  learning, by the training computer using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document;
  
  generating, by the training computer as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document;
  
  using, by an indexing computer, the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document;
  
  determining, by the indexing computer, one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and
  
  transmitting, by the indexing computer, topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities.
- View Dependent Claims (13)
- - 13. The non-transitory computer readable storage medium of claim 12 wherein the step of using the disambiguation classifier and the selection classifier to determine a set of topics from a web document further comprises using the disambiguation classifier and the selection classifier to determine a set of topics and a set of categories from the web document.

14. A system comprising:
- a processor;
  
  a storage medium for tangibly storing thereon program logic for execution by the processor, the program logic comprising;
  
  a training module configured to;
  
  retrieve training data comprising a plurality of web documents,extract information from the training data, the extracted information comprising a plurality of phrases extracted from each document of the plurality of web documents,learning to disambiguate the extracted information by analysis of a context derived from words proximate each phrase such that a particular sense of each phrase of the plurality of phrases is determined for each web document,generating, as a result of the learning to disambiguate step, a disambiguation classifier capable of determining a sense of a phrase within a document to be analyzed,learning, using the disambiguated extracted information from each web document, to select a portion of the extracted information of each web document as being relevant to a theme of the each web document,generating, as a result of the learning to select step, a selection classifier capable of selecting a topic in a document that is relevant to the theme of the document, andan indexing module executing on the processor and configured to;
  
  use the disambiguation classifier and the selection classifier to determine a set of topics from a new web document that is not a part of the training data and a set of categories from the new web document;
  
  determine one or more entities associated with the set of topics, the one or more entities selected from a group of entities consisting of text, a graphic, an icon, a video, and a link; and
  
  transmit topic and category information to a client computer for display, the topic and category information obtained from a group of topic and category information consisting of the set of topics, the set of categories, and the one or more entities.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Garg, Priyank Shanker, Monga, Rohan, Sambrani, Hemanth, Vasudevan, Sudharsan
Primary Examiner(s)
Mahmoudi, Tony
Assistant Examiner(s)
Gurmu, Muluemebet

Application Number

US12/755,652
Publication Number

US 20110252045A1
Time in Patent Office

1,679 Days
Field of Search

707/706, 707/709, 707/715
US Class Current

707/706
CPC Class Codes

G06F 16/35 Clustering; Classification

G06F 16/951 Indexing; Web crawling tech...

Large scale concept discovery for webpage augmentation using search engine indexers

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Large scale concept discovery for webpage augmentation using search engine indexers

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links