Discriminating search results by phrase analysis

US 8,396,850 B2
Filed: 02/27/2009
Issued: 03/12/2013
Est. Priority Date: 02/27/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

parsing, by a server computing device, each document of a corpus of documents to determine phrases found in each of the documents;

analyzing, by the server computing device, each determined phrase with respect to each document to determine a frequency of occurrence of the phrase in the document relative to a frequency of occurrence of the phrase in the corpus;

identifying, by the server computing device, documents that comprise a same statistically improbable phrase, wherein the statistically improbably phrase is one of the determined phrases having both of;

a probability of occurrence in a document of the corpus of documents that is higher than probability of occurrence of other phrases in the document; and

a probability of occurrence in the corpus of documents that is lower than probability of occurrence of other phrases in the corpus of documents; and

grouping, by the server computing device, the identified documents that comprise the statistically improbable phrase into a single group of documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A statistical analysis parses documents for phrases in the documents. Each document is analyzed with a phrase analysis engine to determine a key phrase that frequently occur throughout each document. One or more documents are grouped together based a corresponding statistically improbable phrase.

53 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- parsing, by a server computing device, each document of a corpus of documents to determine phrases found in each of the documents;
  
  analyzing, by the server computing device, each determined phrase with respect to each document to determine a frequency of occurrence of the phrase in the document relative to a frequency of occurrence of the phrase in the corpus;
  
  identifying, by the server computing device, documents that comprise a same statistically improbable phrase, wherein the statistically improbably phrase is one of the determined phrases having both of;
  
  a probability of occurrence in a document of the corpus of documents that is higher than probability of occurrence of other phrases in the document; and
  
  a probability of occurrence in the corpus of documents that is lower than probability of occurrence of other phrases in the corpus of documents; and
  
  grouping, by the server computing device, the identified documents that comprise the statistically improbable phrase into a single group of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1 further comprising:
    - adding to the single group of documents, those documents with a similar statistically improbable phrase.
  - 3. The computer-implemented method of claim 2 wherein one or more of the documents with the similar statistically improbable phrase are not associated with the key phrase.
  - 4. The computer-implemented method of claim 1 wherein the statistically improbable phrase is determined by:
    - performing a statistical analysis to determine a frequency of occurrence of a phrase in a document relative to a set of documents; and
      
      associating the phrase with one or more documents that have a lower probability of occurrence of the phrase in the set of documents.
  - 5. The computer-implemented method of claim 1 further comprising:
    - sorting one or more documents based on one or more statistically improbable phrases.
  - 6. The computer-implemented method of claim 1 further comprising:
    - adjusting a degree of correlation between documents in a group based on the number of shared key phrases.
  - 7. The computer-implemented method of claim 1 further comprising:
    - displaying the different groups of the one or more documents in response to a search query phrase.

8. A server comprising:
- a processing device;
  
  a memory coupled to the processing device, the memory storing a corpus of documents; and
  
  a phrase analysis engine executable from the memory by the processing device, the phrase analysis engine comprising;
  
  a parser configured to parse each document of the corpus of documents to determine phrases found in each of the documents;
  
  an analyzer configured to;
  
  analyze each determined phrase with respect to each document to determine a frequency of occurrence of the phrase in the document relative to a frequency of occurrence of the phrase in the corpus; and
  
  identify documents that comprise a same statistically improbable phrase, wherein the statistically improbably phrase is one of the determined phrases having both of;
  
  a probability of occurrence in a document of the corpus of documents that is higher than probability of occurrence of other phrases in the document; and
  
  a probability of occurrence in the corpus of documents that is lower than probability of occurrence of other phrases in the corpus of documents; and
  
  a categorizer configured to group the identified documents that comprise the statistically improbable phrase into a single group of documents.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The server of claim 8 wherein the phrase analysis engine further comprises a correlater configured to add to the single group of documents, those documents with a similar statistically improbable phrase.
  - 10. The server of claim 9 wherein one or more of the documents with the similar statistically improbable phrase are not associated with the key phrase.
  - 11. The server of claim 8 wherein the statistically improbable phrase is determined by performing a statistical analysis to determine a frequency of occurrence of a phrase in a document relative to a set of documents, and associating the phrase with one or more documents that have a lower probability of occurrence of the phrase in the set of documents.
  - 12. The server of claim 8 wherein the categorizer is further configured to sort one or more documents based on one or more statistically improbable phrases.
  - 13. The server of claim 8 wherein the categorizer is further configured to adjust a degree of correlation between documents in a group based on the number of shared key phrases.
  - 14. The server of claim 8 wherein the server is coupled to a display, the display configured to display the different groups of the one or more documents in response to a search query phrase.

15. A non-transitory computer-accessible storage medium including data that, when accessed by a computer system, cause the computer system to perform a method comprising:
- parsing, by a server computing device, each document of a corpus of documents to determine phrases found in each of the documents;
  
  analyzing, by the server computing device, each determined phrase with respect to each document to determine a frequency of occurrence of the phrase in the document relative to a frequency of occurrence of the phrase in the corpus;
  
  identifying, by the server computing device, documents that comprise a same statistically improbable phrase, wherein the statistically improbably phrase is one of the determined phrases having both of;
  
  a probability of occurrence in a document of the corpus of documents that is higher than probability of occurrence of other phrases in the document; and
  
  a probability of occurrence in the corpus of documents that is lower than probability of occurrence of other phrases in the corpus of documents; and
  
  grouping, by the server computing device, the identified documents that comprise the statistically improbable phrase into a single group of documents.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer-accessible storage medium of claim 15 wherein the method further comprises:
    - adding to the single group of documents, those documents with a similar statistically improbable phrase.
  - 17. The non-transitory computer-accessible storage medium of claim 16 wherein one or more of the documents with the similar statistically improbable phrase are not associated with the key phrase.
  - 18. The non-transitory computer-accessible storage medium of claim 15 wherein the statistically improbable phrase is determined by:
    - performing a statistical analysis to determine a frequency of occurrence of a phrase in a document relative to a set of documents; and
      
      associating the phrase with one or more documents that have a lower probability of occurrence of the phrase in the set of documents.
  - 19. The non-transitory computer-accessible storage medium of claim 15 wherein the method further comprises:
    - sorting one or more documents based on one or more statistically improbable phrases.
  - 20. The non-transitory computer-accessible storage medium of claim 15 wherein the method further comprises:
    - adjusting a degree of correlation between documents in a group based on the number of shared key phrases.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Red Hat, Inc. (International Business Machines Corporation)
Original Assignee
Red Hat, Inc. (International Business Machines Corporation)
Inventors
Schneider, James Paul
Primary Examiner(s)
Beausoliel, Jr., Robert
Assistant Examiner(s)
ALLEN, NICHOLAS E

Application Number

US12/395,507
Publication Number

US 20100223273A1
Time in Patent Office

1,474 Days
Field of Search

707/706, 707/999.002
US Class Current

707/706
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 40/289 Phrasal analysis, e.g. fini...

Discriminating search results by phrase analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

53 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Discriminating search results by phrase analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

53 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links