Determining the relationship between source code bases

US 8,290,962 B1
Filed: 09/28/2005
Issued: 10/16/2012
Est. Priority Date: 09/28/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for comparing a first set of documents to a second set of documents, the method comprising:

identifying, using a computer-implemented device, the first set of documents based on a first criterion;

identifying, using the computer-implemented device, the second set of documents based on a second criterion, where the second criterion is different from the first criterion;

constructing a matrix using the computer-implemented device, the matrix containing information regarding pairs of documents from the first and second sets of documents, where constructing the matrix further comprises;

mapping, using the computer-implemented device and to each of the pairs of documents, a value representative of a number of lines that are common to both documents in each of the pairs of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents, andexcluding, using the computer-implemented device, a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount;

calculating similarity scores, using the computer-implemented device, for each of the pairs of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by;

determining a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents,determining a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio,selecting the first ratio or the second ratio as a selected ratio, anddetermining the similarity score based on the selected ratio; and

outputting the similarity scores, using the computer-implemented device.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automated technique compares two sets of documents (such as two source codebases) to automatically determine documents within each set that are similar to one another. The technique constructs a matrix relating pairs of documents from the first and second sets of documents to lines that occur in both documents in each of the pairs of documents. A similarity score is calculated for each of the pairs of documents based on the lines from the matrix.

Citations

26 Claims

1. A computer-implemented method for comparing a first set of documents to a second set of documents, the method comprising:
- identifying, using a computer-implemented device, the first set of documents based on a first criterion;
  
  identifying, using the computer-implemented device, the second set of documents based on a second criterion, where the second criterion is different from the first criterion;
  
  constructing a matrix using the computer-implemented device, the matrix containing information regarding pairs of documents from the first and second sets of documents, where constructing the matrix further comprises;
  
  mapping, using the computer-implemented device and to each of the pairs of documents, a value representative of a number of lines that are common to both documents in each of the pairs of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents, andexcluding, using the computer-implemented device, a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount;
  
  calculating similarity scores, using the computer-implemented device, for each of the pairs of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by;
  
  determining a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents,determining a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio,selecting the first ratio or the second ratio as a selected ratio, anddetermining the similarity score based on the selected ratio; and
  
  outputting the similarity scores, using the computer-implemented device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - constructing an index, using the computer-implemented device, relating each line in the first and second sets of documents to documents in the first and second sets of documents in which each of the lines occurs; and
      
      using the index to construct the matrix.
  - 3. The method of claim 2, where constructing the index further comprises:
    - sequentially reading each of the lines, using the computer-implemented device, in the documents from the first and second sets into computer memory.
  - 4. The method of claim 1, where the first and second sets of documents include text files.
  - 5. The method of claim 4, where the text files are files in a program codebase.
  - 6. The method of claim 1, where the lines include lines of programming code.
  - 7. The method of claim 1, further comprising:
    - sorting the pairs of documents, using the computer-implemented device, based on the similarity scores.
  - 8. The method of claim 1, further comprising:
    - locating documents, using the computer-implemented device, in the first and second sets of documents that are identical; and
      
      removing the documents, using the computer-implemented device, in the first and second sets of documents that are identical from the first and second sets of documents before constructing the matrix.
  - 9. The method of claim 8, where locating documents in the first and second sets of documents that are identical includes:
    - locating the identical documents, using the computer-implemented device, based on checksum values calculated for each of the documents in the first and second sets of documents, where the checksum values of the identical documents are the same.
  - 10. The method of claim 8, where locating documents in the first and second sets of documents that are identical include:
    - locating the identical documents based on filenames of the documents in the first and second sets of documents, where the filenames of the identical documents are the same.

11. A device comprising:
- one or more processors to;
  
  identify a first set of documents based on a first criterion;
  
  identify a second set of documents based on a second criterion, where the second criterion is different from the first criterion;
  
  construct a matrix containing information regarding pairs of documents selected from the first and second sets of documents, where, when constructing the matrix, the one or more processors are further to;
  
  map a tag identifying a number of lines that are common to both documents in each of the pairs of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents, andexclude a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount;
  
  store the matrix;
  
  calculate similarity scores for each of the pairs of documents based on the matrix, where, when calculating the similarity score for a pair of documents, of the pairs of documents, the one or more processors are further to;
  
  determine a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents,determine a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio,select the first ratio or the second ratio as a selected ratio, anddetermine the similarity score based on the selected ratio; and
  
  output the similarity scores.
- View Dependent Claims (12)
- - 12. The device of claim 11, where the one or more processors are further to:
    - exclude from the matrix, when constructing the matrix, pairs of documents in which the document from the first set of documents and the document from the second set of documents do not have a same file extension.

13. A computer-implemented method comprising:
- receiving, using a computer-implemented device, an identification of a first set of documents;
  
  receiving, using the computer-implemented device, an identification of a second set of documents, the identification of the second set of documents being based on a different criterion than a criterion used in the identification of the first set of documents;
  
  processing document segments, using the computer-implemented device, which represent portions of the documents in the first and second sets of documents, to construct an index that relates each of the segments to the documents in which the segment occurs;
  
  constructing a matrix, using the computer-implemented device, based on the index, the matrix mapping pairs of documents from the first and second sets of documents to a numeric value representative of a number of segments that are common to both documents of a particular pair of documents, where each pair of documents comprises a document from the first set of documents and a document from the second set of documents;
  
  excluding from the matrix, using the computer-implemented device, pairs of documents, when the documents in a pair of documents differ in size by at least a particular amount;
  
  calculating, using the computer-implemented device, similarity scores for pairs of documents in the first and second sets of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by;
  
  determining a first ratio of the number of segments that are common to the pair of documents and a number of segments of a first document of the pair of documents,determining a second ratio of the number of segments that are common to the pair of documents and a number of segments of a second document of the pair of documents, where the first ratio is different than the second ratio,selecting the first ratio or the second ratio as a selected ratio, anddetermining the similarity score based on the selected ratio; and
  
  outputting, using the computer-implemented device, the similarity scores.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The method of claim 13, where processing document segments in the first and second sets of documents includes:
    - sequentially processing the segments, using the computer-implemented device, to construct the index, where previously processed segments are not stored in computer memory.
  - 15. The method of claim 13, where the first and second sets of documents include text files.
  - 16. The method of claim 15, where the text files are files in a program codebase.
  - 17. The method of claim 16, where the segments each represents a line of programming code.
  - 18. The method of claim 13, further comprising:
    - sorting the pairs of documents, using the computer-implemented device, based on the similarity scores.
  - 19. The method of claim 13, further comprising:
    - locating documents, using the computer-implemented device, in the first and second sets of documents that are identical; and
      
      removing the documents, using the computer-implemented device, in the first and second sets of documents that are identical from the first and second sets of documents before constructing the index.
  - 20. The method of claim 19, where locating the documents in the first and second sets of documents that are identical includes:
    - locating the documents, using the computer-implemented device, in the first and second sets of documents that are identical based on checksum values calculated for each of the documents in the first and second sets of documents.

21. A computer-implemented system comprising:
- a memory to store a matrix relating pairs of documents; and
  
  a processor, coupled to the memory, to;
  
  identify a first document set based on a first criterion;
  
  identify a second document set based on a second criterion, where the second criterion is different from the first criterion;
  
  locate pairs of documents, in the first and second document sets, that are identical;
  
  remove the identical documents from the first and second document sets;
  
  construct the matrix, after removing the identical documents, by relating pairs of documents from the first document set and the second document set to generate the pairs of documents, and by mapping, to each of the pairs of documents, an indicator representing a number of lines that are common to both documents, in each of the pairs of documents, and by excluding a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount;
  
  add the identical documents into the matrix, after constructing the matrix; and
  
  calculate similarity scores for the pairs of documents, based on the matrix, after adding the identical documents, where, when calculating the similarity score for a pair of documents, of the pairs of documents, the processor is further to;
  
  determine a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents,determine a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio,select the first ratio or the second ratio as a selected ratio, anddetermine the similarity score based on the selected ratio; and
  
  output the similarity scores.
- View Dependent Claims (22, 23)
- - 22. The computer-implemented system of claim 21, where the processor is further to:
    - construct an index relating each line in the first and second document sets to documents in the first and second document sets in which each of the lines occurs; and
      
      use the index to construct the matrix.
  - 23. The computer-implemented system of claim 21, where pairs of identical documents having a similarity score below a predetermined threshold value are removed from the matrix.

24. A non-transitory computer-readable memory medium comprising:
- one or more instructions which, when executed by at least one processor, cause the at least one processor to receive an identification of a first set of documents;
  
  one or more instructions which, when executed by the at least one processor, cause the at least one processor to receive an identification of a second set of documents, the identification of the second set of documents being based on a different criterion than a criterion used in the identification of the first set of documents;
  
  one or more instructions which, when executed by the at least one processor, cause the at least one processor to process document segments that represent portions of the documents in the first and second sets of documents to construct an index that relates each of the segments to the documents in which the segment occurs;
  
  one or more instructions which, when executed by the at least one processor, cause the at least one processor to construct a matrix, based on the index, containing information regarding pairs of documents from the first and second sets of documents, where the one or more instructions to construct the matrix further include one or more instructions to;
  
  map a numeric indicator representing a number of segments that are common to both documents of a particular pair of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents; and
  
  exclude from the matrix, when constructing the matrix, a pair of documents, where the documents in the pair differ in size by at least a particular amount;
  
  one or more instructions which, when executed by the at least one processor, cause the at least one processor to calculate similarity scores for pairs of documents in the first and second sets of documents, based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, and the one or more instructions to calculate the similarity score include;
  
  one or more instructions to determine a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents,one or more instructions to determine a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio,one or more instructions to select the first ratio or the second ratio as a selected ratio, andone or more instructions to determine the similarity score based on the selected ratio; and
  
  one or more instructions which, when executed by the at least one processor, cause the at least one processor to output the similarity scores.
- View Dependent Claims (25)
- - 25. The medium of claim 24, further comprising:
    - one or more instructions to generate a first histogram of file sizes for the first set of documents and a second histogram of file sizes for the second set of documents; and
      
      one or more instructions to include in the matrix, when constructing the matrix, only pairs of documents where both documents for a particular pair of documents have a file size within a particular range, based on the first histogram and the second histogram.

26. A computer-implemented method, for comparing a first set of documents to a second set of documents, the method comprising:
- receiving, using a computer-implemented device, an identification of a first set of documents;
  
  receiving, using the computer-implemented device, an identification of a second set of documents;
  
  processing lines from the documents, using the computer-implemented device, in the first and second sets of documents to construct an index that relates each line in the first and second sets of documents to documents in the first and second sets of documents in which the line occurs, each of the lines representing a line of programming code;
  
  removing, from the index and using the computer-implemented device, lines based on one or more of;
  
  a line appearing in only one document in the first set of documents or in only one document in the second set of documents, ora line appearing in more than a threshold number of documents;
  
  constructing, based on the index and using the computer-implemented device, a matrix containing information regarding pairs of documents from the first and second sets of documents, where constructing the matrix further comprises;
  
  mapping a numeric tag representative of a number of lines that are common to both documents in each of the pairs of documents;
  
  calculating, using the computer-implemented device, similarity scores for the pairs of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by;
  
  determining a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents,determining a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio,selecting the first ratio or the second ratio as a selected ratio, anddetermining the similarity score based on the selected ratio; and
  
  outputting, using the computer-implemented device, the similarity scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Chu, Andy
Primary Examiner(s)
Meng, Jau-Shya

Application Number

US11/236,859
Time in Patent Office

2,575 Days
Field of Search

None
US Class Current

707/749
CPC Class Codes

G06F 16/319 Inverted lists

Determining the relationship between source code bases

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Determining the relationship between source code bases

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links