Method and system for detection of authors

US 7,752,208 B2
Filed: 04/11/2007
Issued: 07/06/2010
Est. Priority Date: 04/11/2007
Status: Active Grant

First Claim

Patent Images

1. A method for processing information, comprising:

calculating, using a computer, a compression distance between a pair of different documents that do not contain duplicated content, comprising measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises;

compressing each of the documents to create first and second compressed files;

concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file;

finding respective first and second differences in size between the first and second compressed files and the third compressed file; and

computing a product of the first and second differences; and

responsively to the compression distance, identifying the pair of documents as having a common author, wherein identifying the pair comprises grouping at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system are provided for detection of authors across different types of information sources such as across documents on the Web. The method includes obtaining a compression signature for a document, and determining the similarity between compression signatures of two or more documents. If the similarity is greater than a threshold measure, the two or more documents are considered to be by the same author. Scored pairs of documents are clustered to provide a group of documents by the same author.

The group of documents by the same author can be used for user profiling, noise reduction, contribution sizing, detecting fraudulent contributions, obtaining other search results by the same author, or mating a document with undisclosed authorship to a document of known author.

25 Citations

View as Search Results

14 Claims

1. A method for processing information, comprising:
- calculating, using a computer, a compression distance between a pair of different documents that do not contain duplicated content, comprising measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises;
  
  compressing each of the documents to create first and second compressed files;
  
  concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file;
  
  finding respective first and second differences in size between the first and second compressed files and the third compressed file; and
  
  computing a product of the first and second differences; and
  
  responsively to the compression distance, identifying the pair of documents as having a common author, wherein identifying the pair comprises grouping at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein calculating the compression distance comprises computing the compression distance between a first document by a given author and a second document of unknown authorship, and wherein identifying the pair comprises identifying the second document as belonging to the given author.
  - 3. The method according to claim 1, wherein calculating the compression difference comprises:
    - compressing at least one of the documents to create a first compressed file;
      
      concatenating the documents to generate a concatenated document and compressing the concatenated document to create a second compressed file; and
      
      finding a difference in size between the first and second compressed files.
  - 4. The method according to claim 1, wherein grouping the set of the documents comprises chaining together pairs of the documents between which the respective compression distances are below the specified threshold.
  - 5. The method according to claim 1, wherein the documents have respective uniform resource locators (URLs), and wherein grouping the set of the documents comprises identifying a cluster of the documents that share a common feature in the respective URLs, and selecting the documents in the cluster for which the compression distances are below the specified threshold.
  - 6. The method according to claim 1, and comprising deriving a user profile of the common author from the grouping of the set of the documents.
  - 7. The method according to claim 1, comprising adjusting results provided by a certain engine responsively to the grouping of the set of the documents as belonging to the common author.

8. A computer program product comprising a computer-readable storage medium in which computer readable program code is stored, which program code, when read by a computer, causes the computer to:
- calculate a compression distance between a pair of different documents that do not contain duplicated content, by measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises;
  
  compressing each of the documents to create first and second compressed files;
  
  concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file;
  
  finding respective first and second differences in size between the first and second compressed files and the third compressed file; and
  
  computing a product of the first and second differences; and
  
  identify the pair of documents as having a common author responsively to the compression distance, wherein the instructions cause the computer to compute distances between multiple documents of unknown authorship, and to group at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The product according to claim 8, wherein the instructions cause the computer to compute the compression distance between a first document by a given author and a second document of unknown authorship, and to identify the second document as belonging to the given author responsively to the compression distance.
  - 10. The product according to claim 8, wherein the instructions cause the computer to calculate the compression difference by:
    - compressing at least one of the documents to create a first compressed file;
      
      concatenating the documents to generate a concatenated document and compressing the concatenated document to create a second compressed file; and
      
      finding a difference in size between the first and second compressed files.
  - 11. The product according to claim 8, wherein the instructions cause the computer to chain together pairs of the documents between which the respective compression distances are below the specified threshold.
  - 12. The product according to claim 8, wherein the documents have respective uniform resource locators (URLs), and wherein the instructions cause the computer to identify a cluster of the documents that share a common feature in the respective URLs, and to select the documents in the cluster for which the compression distances are below the specified threshold.
  - 13. The product according to claim 8, wherein the instructions cause the computer to derive a user profile of the common author from the grouping of the set of the documents.

14. A computer system, comprising:
- a memory, which is configured to store program code; and
  
  a processor, which is coupled to read and execute the program code so as tocalculate a compression distance between a pair of different documents that do not contain duplicated content, by measuring how much a respective compression of each of the documents is improved by using information included in the other of the documents, wherein said measuring comprises;
  
  compressing each of the documents to create first and second compressed files;
  
  concatenating the documents to generate a concatenated document and compressing the concatenated document to create a third compressed file;
  
  finding respective first and second differences in size between the first and second compressed files and the third compressed file; and
  
  computing a product of the first and second differences; and
  
  identify the pair of documents as having a common author responsively to the compression distance, wherein identifying the pair comprises grouping at least two of the documents between which the compression distance is below a specified threshold as belonging to the common author.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
X Corp. (f/k/a Twitter, Inc.) (X Holdings Corp.)
Original Assignee
International Business Machines Corporation
Inventors
Yogev, Sivan, Yom-Tov, Elad, Amitay, Einat
Primary Examiner(s)
Le; Miranda

Application Number

US11/733,808
Publication Number

US 20080256093A1
Time in Patent Office

1,182 Days
Field of Search

707/5, 707/9, 707/101, 707/758, 707/749, 707/999.101, 705/50
US Class Current

707/749
CPC Class Codes

G06F 16/334   Query execution G06F16/335 ...

G06F 16/35   Clustering; Classification

Y10S 707/99942   Manipulating data structure...

Method and system for detection of authors

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

25 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for detection of authors

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links