Detecting spam documents in a phrase based information retrieval system
First Claim
Patent Images
1. A computer implemented method for identifying spam documents in an information retrieval system, the method comprising:
- maintaining a list of phrases in a memory, each phrase associated with a list of related phrases;
determining, for a document that contains a first phrase from the list of phrases, a number of the related phrases related to the first phrase expected to be present in the document;
determining for the document, and for the first phrase in the document, an actual number of related phrases present in the document; and
identifying the document as a spam document by comparing the actual number of related phrases present in the document with the expected number of related phrases,wherein determining the number of related phrases related to the first phrase expected to be present in the document includes;
traversing an index of a plurality of documents;
for each of the indexed documents;
determining a set of phrases in the indexed document from the list of phrases, andfor each phrase in the set, determining a number of related phrases also in the indexed document; and
determining the expected number of related phrases based on the determined number of related phrases, related to the first phrase, in the indexed documents.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.
-
Citations
15 Claims
-
1. A computer implemented method for identifying spam documents in an information retrieval system, the method comprising:
-
maintaining a list of phrases in a memory, each phrase associated with a list of related phrases; determining, for a document that contains a first phrase from the list of phrases, a number of the related phrases related to the first phrase expected to be present in the document; determining for the document, and for the first phrase in the document, an actual number of related phrases present in the document; and identifying the document as a spam document by comparing the actual number of related phrases present in the document with the expected number of related phrases, wherein determining the number of related phrases related to the first phrase expected to be present in the document includes; traversing an index of a plurality of documents; for each of the indexed documents; determining a set of phrases in the indexed document from the list of phrases, and for each phrase in the set, determining a number of related phrases also in the indexed document; and determining the expected number of related phrases based on the determined number of related phrases, related to the first phrase, in the indexed documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
Specification