DETECTING SPAM DOCUMENTS IN A PHRASE BASED INFORMATION RETRIEVAL SYSTEM
1 Assignment
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.
-
Citations
19 Claims
-
1. (canceled)
-
2. A computer program product stored on one or more non-transitory computer readable storage media and comprising instructions that, when executed, cause an apparatus to:
-
determine, for a document that contains a first phrase, a number of related phrases related to the first phrase expected to be present in the document; determine for the document, and for the first phrase in the document, an actual number of related phrases present in the document; and identify the document as a spam document by comparing the actual number of related phrases present in the document with the expected number of related phrases, wherein determining the number of related phrases expected to be present in the document includes; traversing an index of a plurality of documents; for each of the indexed documents, determining a set of phrases in the document, and for each phrase in the set, determining a number of related phrases also in the document; and determining the expected number of related phrases based on the determined number of related phrases across the traversed documents. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product stored on one or more non-transitory computer readable storage media and comprising instructions that, when executed, cause an apparatus to:
-
receive a search query; retrieve a set of documents relevant to the search query, each document having a relevance score; determine, for each document in the set of documents, whether the document has been identified as a spam document; down-weight the relevance score of the document in response to a document being identified as a spam document; and organize the set of documents by their relevance scores, wherein the relevance scores by which the documents are organized include down-weighted relevance scores for documents that have been identified as spam documents, wherein whether the document has been identified as a spam document is based on; determining, for a document that contains a first phrase, a number of related phrases related to the first phrase expected to be present in the document; determining for the document, and for the first phrase in the document, an actual number of related phrases present in the document; and identifying the document as a spam document by comparing the actual number of related phrases present in the document with the expected number of related phrases, wherein determining the number of related phrases expected to be present in the document includes; traversing an index of a plurality of documents; for each of the indexed documents, determining a set of phrases in the document, and for each phrase in the set, determining a number of related phrases also in the document; and determining the expected number of related phrases based on the determined number of related phrases across the traversed documents. - View Dependent Claims (18, 19)
-
Specification