Phrase extraction using subphrase scoring

US 9,355,169 B1
Filed: 09/13/2012
Issued: 05/31/2016
Est. Priority Date: 03/30/2007
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method of extracting a set of phrases from a plurality of documents, the method comprising:

for each document;

identifying a plurality of candidate phrases occurring within the document, wherein a candidate phrase includes two or more consecutive words that are determined to occur in the document, andscoring candidate phrases in the document to produce respective document phrase scores for the candidate phrases for the document, the document phrase score for a candidate phrase being based on attributes of individual occurrences of the candidate phrase in the document, with at least some candidate phrases appearing repeatedly having a higher document phrase score than candidate phrases appearing once;

for a candidate phrase of the plurality of the candidate phrases;

creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and

selecting the candidate phrase for inclusion in the extracted set based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.

Citations

21 Claims

1. A computer implemented method of extracting a set of phrases from a plurality of documents, the method comprising:
- for each document;
  
  identifying a plurality of candidate phrases occurring within the document, wherein a candidate phrase includes two or more consecutive words that are determined to occur in the document, andscoring candidate phrases in the document to produce respective document phrase scores for the candidate phrases for the document, the document phrase score for a candidate phrase being based on attributes of individual occurrences of the candidate phrase in the document, with at least some candidate phrases appearing repeatedly having a higher document phrase score than candidate phrases appearing once;
  
  for a candidate phrase of the plurality of the candidate phrases;
  
  creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and
  
  selecting the candidate phrase for inclusion in the extracted set based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein identifying a plurality of candidate phrases occurring in the document includes:
    - identifying as a candidate phrase a sequence of words in the document terminated by a semantic marker.
  - 3. The method of claim 2, wherein the semantic marker is any one of group consisting of a linguistic, grammatical, structural, or typographic indicator.
  - 4. The method of claim 1, wherein scoring a candidate phrase in the document to produce a document phrase score includes:
    - scoring the occurrences of the candidate phrase in the document to produce respective instance phrase scores for the candidate phrase for the document, an instance phrase score being based on a location attribute of the occurrence of the candidate phrase within the document, and based on a position attribute of the occurrence of the candidate phrase relative to a sequence of words containing the occurrence of the candidate phrase; and
      
      combining the respective instance phrase scores of the candidate phrase in the document into a document phrase score.
  - 5. The method of claim 4, wherein scoring a candidate phrase in the document to produce a document phrase score for the candidate phrase includes:
    - determining for the candidate phrase two or more subphrases within the candidate phrase, wherein a subphrase contains two or more words; and
      
      scoring each determined subphrase in the document as a function of the position of the subphrase relative to the sequence of words containing the candidate phrase, and the document phrase score of the candidate phrase.
  - 6. The method of claim 4, wherein the instance phrase score is further based on typeface attributes of the occurrence of the candidate phrase within the document.
  - 7. The method of claim 1, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when a maximum value of the document phrase scores exceeds a first threshold.
  - 8. The method of claim 1, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when the combined score exceeds a second threshold.
  - 9. The method of claim 1, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when the number of documents for which the candidate phrase had at least a minimum document phrase score exceeds a third threshold.
  - 10. The method of claim 1, wherein identifying a plurality of candidate phrases occurring within the document includes:
    - identifying as a candidate phrase every sequence of N words, where N is an integer greater than 2.
  - 11. The method of claim 1, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when a maximum value of the document phrase scores exceeds a first threshold, or when the combined score exceeds a second threshold, or when the number of documents for which the candidate phrase had at least a minimum document phrase score exceeds a third threshold.

12. A system for extracting a set of phrases from a plurality of documents, the system comprising:
- one or more computer readable media comprising executable instructions; and
  
  one or more processors configured to execute the instructions, wherein execution of the instructions causes the system to;
  
  for each document;
  
  identify a plurality of candidate phrases occurring in the document, wherein a candidate phrase includes two or more consecutive words that are determined to occur in the document;
  
  score candidate phrases in the document to produce respective document phrase scores for the candidate phrases for the document, the document phrase score for a candidate phrase being based on attributes of individual occurrences of the candidate phrase in the document, with at least some candidate phrases appearing repeatedly having a higher document phrase score than candidate phrases appearing once, andfor a candidate phrase of the plurality of the candidate phrases;
  
  create, via the one or more processors, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and
  
  selecting the candidate phrase for inclusion in the extracted set based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 13. The system of claim 12, wherein identifying a plurality of candidate phrases contained in the document further includes:
    - identifying as a candidate phrase a sequence of words in the document terminated by a semantic marker.
  - 14. The system of claim 13, wherein the semantic marker is any one of a group consisting of a linguistic, grammatical, structural, or typographic indicator.
  - 15. The system of claim 12, wherein scoring a candidate phrase in the document to produce a document phrase score for the candidate phrase further includes:
    - scoring occurrences of the candidate phrase in the document to produce respective instance phrase scores, the instance phrase score being based on typeface attributes of the occurrence of the candidate phrase within the document.
  - 16. The system of claim 12, wherein scoring a candidate phrase in the document to produce a document phrase score for the candidate phrase includes:
    - determining for the candidate phrase two or more subphrases within the candidate phrase, wherein a subphrase contains two or more words; and
      
      scoring each determined subphrase in the document as a function of the position of the subphrase relative to a sequence of words containing the candidate phrase.
  - 17. The system of claim 12, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when a maximum value of the document phrase scores exceeds a first threshold.
  - 18. The system of claim 12, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when the combined score exceeds a second threshold.
  - 19. The system of claim 12, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when the number of documents for which the candidate phrase had at least a minimum document phrase score exceeds a third threshold.
  - 20. The system of claim 12, wherein identifying a plurality of candidate phrases contained in the document further includes:
    - identifying as a candidate phrase every sequence of N words, where N is an integer greater than 2.
  - 21. The system of claim 12, wherein selecting the candidate phrase for inclusion in the extracted set based on the combined score and based on the document phrase scores includes:
    - selecting the candidate phrase when a maximum value of the document phrase scores exceeds a first threshold, or when the combined score exceeds a second threshold, or when the number of documents for which the candidate phrase had at least a minimum document phrase score exceeds a third threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Mazumdar, Soham, Przebinda, Viktor, Zunger, Yonatan
Primary Examiner(s)
Singh, Amresh

Application Number

US13/615,541
Time in Patent Office

1,356 Days
Field of Search

707/713
US Class Current

1/1
CPC Class Codes

G06F 16/313 Selection or weighting of t...

G06F 16/951 Indexing; Web crawling tech...

Phrase extraction using subphrase scoring

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Phrase extraction using subphrase scoring

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links