Identifying the Unifying Subject of a Set of Facts

US 20110047153A1
Filed: 11/04/2010
Published: 02/24/2011
Est. Priority Date: 05/31/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of processing a set of documents for generating a facts database, comprising:

at a system having one or more processors and memory storing one or more modules to be executed by the one or more processors;

accessing a source document from a document host;

extracting one or more facts from the source document, each fact including an attribute-value pair and a list of documents that include the fact;

identifying a set of linking documents that have one or more links to the source document, wherein a respective link contains anchor text;

generating a set of candidate labels from the anchor text of the linking documents;

assigning a score to each candidate label;

selecting the candidate label with a highest score as a unifying subject of the one or more facts; and

for the unifying subject, storing in the facts database an information set distinct from the source document, wherein the object includes the unifying subject, one or more entries corresponding to the one or more facts extracted from the source document, and information associating the source document with the information set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for identifying a subject of a document and facts included within are described. A source document that includes facts and linking documents that include hyperlinks to the source document are identified. The anchor texts of the hyperlinks are identified and candidate labels are generated based on the anchor texts. One of the candidate labels is selected based on first predefined criteria and associated with the source document and/or the facts included within the source document.

Citations

9 Claims

1. A computer-implemented method of processing a set of documents for generating a facts database, comprising:
- at a system having one or more processors and memory storing one or more modules to be executed by the one or more processors;
  
  accessing a source document from a document host;
  
  extracting one or more facts from the source document, each fact including an attribute-value pair and a list of documents that include the fact;
  
  identifying a set of linking documents that have one or more links to the source document, wherein a respective link contains anchor text;
  
  generating a set of candidate labels from the anchor text of the linking documents;
  
  assigning a score to each candidate label;
  
  selecting the candidate label with a highest score as a unifying subject of the one or more facts; and
  
  for the unifying subject, storing in the facts database an information set distinct from the source document, wherein the object includes the unifying subject, one or more entries corresponding to the one or more facts extracted from the source document, and information associating the source document with the information set.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, further comprising:
    - selecting one or more second labels of the candidate labels according to second predefined criteria; and
      
      associating the selected second labels with the source document and the one or more facts extracted from the source document.
  - 3. The method of claim 1, wherein selecting the candidate label comprises:
    - for each of the set of candidate labels;
      
      determining a set of frequencies of one or more substrings of the respective candidate label;
      
      generating a frequency vector associated with the respective candidate label based on the set of frequencies;
      
      determining a centroid vector based on the frequency vectors of the candidate labels, wherein the selected candidate label is associated with the respective frequency vector having a shortest distance to the centroid vector.

4. A server system for processing a set of documents for generating a facts database, comprising:
- one or more processors;
  
  memory storing one or more programs to be executed by the one or more processors, the one or more programs including;
  
  a document identification module to access a source document from a document host;
  
  an extraction module to extract one or more facts from the source document, each fact including an attribute-value pair and a list of documents that include the fact;
  
  a linking document module to identify a set of linking documents, that have one or more links to the source document, wherein a respective link contains anchor text; and
  
  a label module having instructions to;
  
  generate a set of candidate labels from the anchor text of the linking documents;
  
  assign a score to each candidate label;
  
  select the candidate label with a highest score as a unifying subject of the one or more facts; and
  
  for the unifying subject, store in the facts database an information set distinct from the source document, wherein the object includes the unifying subject, one or more entries corresponding to the one or more facts extracted from the source document, and information associating the source document with the information set.
- View Dependent Claims (5, 6)
- - 5. The system of claim 4, wherein the label selection instructions further include instructions to select one or more second labels of the candidate labels according to second predefined criteria;
    - andwherein the label association instructions further include instructions to associate the selected second labels with the source document and the one or more facts extracted from the source document.
  - 6. The system of claim 4, wherein the label selection instructions include instructions to:
    - for each of the set of candidate labels;
      
      determine a set of frequencies of one or more substrings of the respective candidate label;
      
      generate a frequency vector associated with the respective candidate label based on the set of frequencies; and
      
      determine a centroid vector based on the frequency vectors of the candidate labels, wherein the selected candidate label is associated with the respective frequency vector having a shortest distance to the centroid vector.

7. A non-transitory computer readable storage medium storing one or more computer programs executed by a computerized server system, the one or more computer programs comprising instructions to generate a facts database, the instructions including:
- instructions to access a source document from a document host;
  
  instructions to extract one or more facts from the source document, each fact including an attribute-value pair and a list of documents that include the fact;
  
  instructions to identify a set of linking documents that have one or more links to the source document, wherein a respective link contains anchor text;
  
  instructions to generate a set of candidate labels from the anchor text of the linking documents;
  
  instructions to assign a score to each candidate label;
  
  instructions to select the candidate label with a highest score as a unifying subject of the one or more facts; and
  
  instructions to, for the unifying subject, store in the facts database an information set distinct from the source document, wherein the object includes the unifying subject, one or more entries corresponding to the one or more facts extracted from the source document, and information associating the source document with the information set.
- View Dependent Claims (8, 9)
- - 8. The computer readable storage medium of claim 7, further comprising instructions to:
    - select one or more second labels of the candidate labels according to second predefined criteria; and
      
      associate the selected second labels with the source document and the one or more facts extracted from the source document.
  - 9. The computer readable storage medium of claim 7, wherein the instructions for selecting the label comprise instructions to:
    - for each of the set of candidate labels;
      
      determine a set of frequencies of one or more substrings of the respective candidate label;
      
      generate a frequency vector associated with the respective candidate label based on the set of frequencies; and
      
      determine a centroid vector based on the frequency vectors of the candidate labels, wherein the selected label is associated with the respective frequency.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Jonathan T. Betz
Inventors
Betz, Jonathan T.

Granted Patent

US 8,078,573 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/723
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Y10S 707/96 Object-relational

Identifying the Unifying Subject of a Set of Facts

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying the Unifying Subject of a Set of Facts

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links