Method and system for detection of authors
First Claim
1. A method for processing information, comprising:
- calculating, using a computer, a compression distance between a pair of different documents that do not contain duplicated content, comprising measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises;
compressing each of the documents to create first and second compressed files;
concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file;
finding respective first and second differences in size between the first and second compressed files and the third compressed file; and
computing a product of the first and second differences; and
responsively to the compression distance, identifying the pair of documents as having a common author, wherein identifying the pair comprises grouping at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author.
5 Assignments
0 Petitions
Accused Products
Abstract
A method and system are provided for detection of authors across different types of information sources such as across documents on the Web. The method includes obtaining a compression signature for a document, and determining the similarity between compression signatures of two or more documents. If the similarity is greater than a threshold measure, the two or more documents are considered to be by the same author. Scored pairs of documents are clustered to provide a group of documents by the same author.
The group of documents by the same author can be used for user profiling, noise reduction, contribution sizing, detecting fraudulent contributions, obtaining other search results by the same author, or mating a document with undisclosed authorship to a document of known author.
25 Citations
14 Claims
-
1. A method for processing information, comprising:
-
calculating, using a computer, a compression distance between a pair of different documents that do not contain duplicated content, comprising measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises; compressing each of the documents to create first and second compressed files; concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file; finding respective first and second differences in size between the first and second compressed files and the third compressed file; and computing a product of the first and second differences; and responsively to the compression distance, identifying the pair of documents as having a common author, wherein identifying the pair comprises grouping at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product comprising a computer-readable storage medium in which computer readable program code is stored, which program code, when read by a computer, causes the computer to:
-
calculate a compression distance between a pair of different documents that do not contain duplicated content, by measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises; compressing each of the documents to create first and second compressed files; concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file; finding respective first and second differences in size between the first and second compressed files and the third compressed file; and computing a product of the first and second differences; and identify the pair of documents as having a common author responsively to the compression distance, wherein the instructions cause the computer to compute distances between multiple documents of unknown authorship, and to group at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computer system, comprising:
-
a memory, which is configured to store program code; and a processor, which is coupled to read and execute the program code so as to calculate a compression distance between a pair of different documents that do not contain duplicated content, by measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises; compressing each of the documents to create first and second compressed files; concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file; finding respective first and second differences in size between the first and second compressed files and the third compressed file; and computing a product of the first and second differences; and identify the pair of documents as having a common author responsively to the compression distance, wherein identifying the pair comprises grouping at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author.
-
Specification