DETECTING SPAM DOCUMENTS IN A PHRASE BASED INFORMATION RETRIEVAL SYSTEM

US 20110131223A1
Filed: 10/13/2009
Published: 06/02/2011
Est. Priority Date: 07/26/2004
Status: Active Grant

First Claim

Patent Images

1. (canceled)

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.

Citations

19 Claims

1. (canceled)

2. A computer program product stored on one or more non-transitory computer readable storage media and comprising instructions that, when executed, cause an apparatus to:
- determine, for a document that contains a first phrase, a number of related phrases related to the first phrase expected to be present in the document;
  
  determine for the document, and for the first phrase in the document, an actual number of related phrases present in the document; and
  
  identify the document as a spam document by comparing the actual number of related phrases present in the document with the expected number of related phrases,wherein determining the number of related phrases expected to be present in the document includes;
  
  traversing an index of a plurality of documents;
  
  for each of the indexed documents, determining a set of phrases in the document, and for each phrase in the set, determining a number of related phrases also in the document; and
  
  determining the expected number of related phrases based on the determined number of related phrases across the traversed documents.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 3. The computer program product of claim 2, whereindetermining the expected number of related phrases based on the determined number of related phrases across the traversed documents includes determining the expected number of related phrases as a medium of the determined number of related phrases across the traversed documents.
  - 4. The computer program product of claim 2, wherein identifying the document as a spam document, further comprises:
    - determining, for a second phrase contained in the document, a number of the related phrases related to a second phrase expected to be present in the document;
      
      determining for the document, and for the second phrase in the document, an actual number of related phrases present in the document;
      
      determining, for a third phrase contained in the document, a number of the related phrases related to a third phrase expected to be present in the document;
      
      determining for the document, and for the third phrase in the document, an actual number of related phrases present in the document; and
      
      identifying the document as a spam document when the actual number of related phrases present in the document for any of the first phrase, the second phrase, or the third phrase exceeds the expected number of related phrases.
  - 5. The computer program product of claim 2, wherein the determination of the number of the related phrases expected to be present in the document is based on a statistical analysis of a plurality of documents that include the first phrase and related phrases, and wherein identifying the document as a spam document, further comprises:
    - determining a standard deviation of the expected number of related phrases; and
      
      responsive to the actual number of related phrases present in the document exceeding the expected number of related phrases by at least a multiple of a standard deviation of the expected number of related phrases, identifying the document as a spam document.
  - 6. The computer program product of claim 2, wherein identifying the document as a spam document, further comprises:
    - responsive to the actual number of related phrases present in the document exceeding the expected number of related phrases by at least a multiple of the expected number of related phrases, identifying the document as a spam document.
  - 7. The computer program product of claim 2, wherein identifying the document as a spam document, further comprises:
    - determining, for a second phrase contained in the document, a number of the related phrases related to a second phrase expected to be present in the document;
      
      determining for the document, and for the second phrase in the document, an actual number of related phrases present in the document;
      
      determining, for a third phrase contained in the document, a number of the related phrases related to a third phrase expected to be present in the document;
      
      determining for the document, and for the third phrase in the document, an actual number of related phrases present in the document;
      
      identifying the document as a spam document where, for each of the first phrase, the second phrase, and the third phrase, the actual number of related phrases present in the document exceeds the expected number of related phrases based on a threshold.
  - 8. The computer program product of claim 2, wherein identifying the document as a spam document, further comprises:
    - identifying the document as a spam document when the actual number of related phrases present in the document exceeds a predetermined maximum expected number of related phrases.
  - 9. The computer program product of claim 2, wherein identifying the document as a spam document, further comprises:
    - determining for a document, first, second, and third most significant phrases present in the document, wherein the first phrase is the most significant phrase;
      
      determining, for the second most significant phrase, a number of the related phrases related to a second most significant phrase expected to be present in the document;
      
      determining, for the third most significant phrase, a number of the related phrases related to a third most significant phrase expected to be present in the document;
      
      for each of the first, second, and third most significant phrases, determining an actual number of related phrases present in the document; and
      
      responsive to the actual number of related phrases exceeding the expected number of related phrases for each of the first, second, and third most significant phases based on a threshold, identifying the document as a spam document.
  - 10. The computer program product of claim 2, wherein identifying the document as a spam document, further comprises:
    - determining for a document, first, second, and third most significant phrases present in the document, wherein the first phrase is the most significant phrase;
      
      determining, for the second most significant phrase, a number of the related phrases related to a second most significant phrase expected to be present in the document;
      
      determining, for the third most significant phrase, a number of the related phrases related to a third most significant phrase expected to be present in the document;
      
      for each of the first, second, and third most significant phrases, determining an actual number of related phrases present in the document; and
      
      responsive to the actual number of related phrases exceeding the expected number of related phrases for any of the first, second, or third most significant phases based on a threshold, identifying the document as a spam document.
  - 11. The computer program product of claim 2, further comprising instructions that, when executed, cause the apparatus to:
    - add the identified document to a spam list that includes a list of spam documents.
  - 12. The computer program product of claim 10, further comprising instructions that, when executed, cause the apparatus to:
    - add the identified document to a spam list that includes a list of spam documents associated with the most significant phrase; and
      
      for each related phrase of the most significant phrase, add the identified document to a list of spam documents associated with the related phrase.
  - 13. The computer program product of claim 2, further comprising instructions that, when executed, cause the apparatus to add the identified document to a spam list of documents based on the comparison indicating that the actual number of related phrases exceeds the expected number of related phrases.
  - 14. The computer program product of claim 2, wherein the determination of the number of the related phrases expected to be present in the document is based on a statistical analysis of a plurality of documents that include the phrase and related phrases.
  - 15. The computer program product of claim 2, wherein identifying the document as a spam document based on the comparison of the actual number of related phrases present in the document with the expected number of related phrases indicating that the actual number of related phrases present in the document exceeds the expected number of related phrases.
  - 16. The computer program product of claim 2, further comprising instructions that, when executed, cause the apparatus to:
    - receive a search query;
      
      determine a set of documents that match the search query; and
      
      determine a relevance of each document of the set to the search query, wherein the relevance depends on whether the document is listed on a spam list.

17. A computer program product stored on one or more non-transitory computer readable storage media and comprising instructions that, when executed, cause an apparatus to:
- receive a search query;
  
  retrieve a set of documents relevant to the search query, each document having a relevance score;
  
  determine, for each document in the set of documents, whether the document has been identified as a spam document;
  
  down-weight the relevance score of the document in response to a document being identified as a spam document; and
  
  organize the set of documents by their relevance scores, wherein the relevance scores by which the documents are organized include down-weighted relevance scores for documents that have been identified as spam documents,wherein whether the document has been identified as a spam document is based on;
  
  determining, for a document that contains a first phrase, a number of related phrases related to the first phrase expected to be present in the document;
  
  determining for the document, and for the first phrase in the document, an actual number of related phrases present in the document; and
  
  identifying the document as a spam document by comparing the actual number of related phrases present in the document with the expected number of related phrases,wherein determining the number of related phrases expected to be present in the document includes;
  
  traversing an index of a plurality of documents;
  
  for each of the indexed documents, determining a set of phrases in the document, and for each phrase in the set, determining a number of related phrases also in the document; and
  
  determining the expected number of related phrases based on the determined number of related phrases across the traversed documents.
- View Dependent Claims (18, 19)
- - 18. The computer program product of claim 17, wherein whether the document has been identified as a spam document is further based on:
    - determining, for a second phrase contained in the document, a number of the related phrases related to a second phrase expected to be present in the document;
      
      determining for the document, and for the second phrase in the document, an actual number of related phrases present in the document;
      
      determining, for a third phrase contained in the document, a number of the related phrases related to a third phrase expected to be present in the document;
      
      determining for the document, and for the third phrase in the document, an actual number of related phrases present in the document; and
      
      identifying the document as a spam document when the actual number of related phrases present in the document for any of the first phrase, the second phrase, or the third phrase exceeds the expected number of related phrases for the respective first, second, or third phrase.
  - 19. The computer program product of claim 17, wherein whether the document has been identified as a spam document is further based on:
    - determining, for a second phrase contained in the document, a number of the related phrases related to a second phrase expected to be present in the document;
      
      determining for the document, and for the second phrase in the document, an actual number of related phrases present in the document;
      
      determining, for a third phrase contained in the document, a number of the related phrases related to a third phrase expected to be present in the document;
      
      determining for the document, and for the third phrase in the document, an actual number of related phrases present in the document;
      
      identifying the document as a spam document where, for each of the first phrase, the second phrase, and the third phrase, the actual number of related phrases present in the document exceeds the expected number of related phrases based on a threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Patterson, Anna Lynn

Granted Patent

US 8,078,629 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/758
CPC Class Codes

G06F 16/313   Selection or weighting of t...

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99937   Sorting

DETECTING SPAM DOCUMENTS IN A PHRASE BASED INFORMATION RETRIEVAL SYSTEM

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

DETECTING SPAM DOCUMENTS IN A PHRASE BASED INFORMATION RETRIEVAL SYSTEM

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links