Determining the relationship between source code bases
First Claim
Patent Images
1. A computer-implemented method for comparing a first set of documents to a second set of documents, the method comprising:
- identifying, using a computer-implemented device, the first set of documents based on a first criterion;
identifying, using the computer-implemented device, the second set of documents based on a second criterion, where the second criterion is different from the first criterion;
constructing a matrix using the computer-implemented device, the matrix containing information regarding pairs of documents from the first and second sets of documents, where constructing the matrix further comprises;
mapping, using the computer-implemented device and to each of the pairs of documents, a value representative of a number of lines that are common to both documents in each of the pairs of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents, andexcluding, using the computer-implemented device, a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount;
calculating similarity scores, using the computer-implemented device, for each of the pairs of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by;
determining a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents,determining a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio,selecting the first ratio or the second ratio as a selected ratio, anddetermining the similarity score based on the selected ratio; and
outputting the similarity scores, using the computer-implemented device.
2 Assignments
0 Petitions
Accused Products
Abstract
An automated technique compares two sets of documents (such as two source codebases) to automatically determine documents within each set that are similar to one another. The technique constructs a matrix relating pairs of documents from the first and second sets of documents to lines that occur in both documents in each of the pairs of documents. A similarity score is calculated for each of the pairs of documents based on the lines from the matrix.
-
Citations
26 Claims
-
1. A computer-implemented method for comparing a first set of documents to a second set of documents, the method comprising:
-
identifying, using a computer-implemented device, the first set of documents based on a first criterion; identifying, using the computer-implemented device, the second set of documents based on a second criterion, where the second criterion is different from the first criterion; constructing a matrix using the computer-implemented device, the matrix containing information regarding pairs of documents from the first and second sets of documents, where constructing the matrix further comprises; mapping, using the computer-implemented device and to each of the pairs of documents, a value representative of a number of lines that are common to both documents in each of the pairs of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents, and excluding, using the computer-implemented device, a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount; calculating similarity scores, using the computer-implemented device, for each of the pairs of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by; determining a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents, determining a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio, selecting the first ratio or the second ratio as a selected ratio, and determining the similarity score based on the selected ratio; and outputting the similarity scores, using the computer-implemented device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A device comprising:
one or more processors to; identify a first set of documents based on a first criterion; identify a second set of documents based on a second criterion, where the second criterion is different from the first criterion; construct a matrix containing information regarding pairs of documents selected from the first and second sets of documents, where, when constructing the matrix, the one or more processors are further to; map a tag identifying a number of lines that are common to both documents in each of the pairs of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents, and exclude a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount; store the matrix; calculate similarity scores for each of the pairs of documents based on the matrix, where, when calculating the similarity score for a pair of documents, of the pairs of documents, the one or more processors are further to; determine a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents, determine a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio, select the first ratio or the second ratio as a selected ratio, and determine the similarity score based on the selected ratio; and output the similarity scores. - View Dependent Claims (12)
-
13. A computer-implemented method comprising:
-
receiving, using a computer-implemented device, an identification of a first set of documents; receiving, using the computer-implemented device, an identification of a second set of documents, the identification of the second set of documents being based on a different criterion than a criterion used in the identification of the first set of documents; processing document segments, using the computer-implemented device, which represent portions of the documents in the first and second sets of documents, to construct an index that relates each of the segments to the documents in which the segment occurs; constructing a matrix, using the computer-implemented device, based on the index, the matrix mapping pairs of documents from the first and second sets of documents to a numeric value representative of a number of segments that are common to both documents of a particular pair of documents, where each pair of documents comprises a document from the first set of documents and a document from the second set of documents; excluding from the matrix, using the computer-implemented device, pairs of documents, when the documents in a pair of documents differ in size by at least a particular amount; calculating, using the computer-implemented device, similarity scores for pairs of documents in the first and second sets of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by; determining a first ratio of the number of segments that are common to the pair of documents and a number of segments of a first document of the pair of documents, determining a second ratio of the number of segments that are common to the pair of documents and a number of segments of a second document of the pair of documents, where the first ratio is different than the second ratio, selecting the first ratio or the second ratio as a selected ratio, and determining the similarity score based on the selected ratio; and outputting, using the computer-implemented device, the similarity scores. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-implemented system comprising:
-
a memory to store a matrix relating pairs of documents; and a processor, coupled to the memory, to; identify a first document set based on a first criterion; identify a second document set based on a second criterion, where the second criterion is different from the first criterion; locate pairs of documents, in the first and second document sets, that are identical; remove the identical documents from the first and second document sets; construct the matrix, after removing the identical documents, by relating pairs of documents from the first document set and the second document set to generate the pairs of documents, and by mapping, to each of the pairs of documents, an indicator representing a number of lines that are common to both documents, in each of the pairs of documents, and by excluding a pair of documents from the matrix, when the documents, in the pair of documents, differ in size by at least a particular amount; add the identical documents into the matrix, after constructing the matrix; and calculate similarity scores for the pairs of documents, based on the matrix, after adding the identical documents, where, when calculating the similarity score for a pair of documents, of the pairs of documents, the processor is further to; determine a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents, determine a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio, select the first ratio or the second ratio as a selected ratio, and determine the similarity score based on the selected ratio; and output the similarity scores. - View Dependent Claims (22, 23)
-
-
24. A non-transitory computer-readable memory medium comprising:
-
one or more instructions which, when executed by at least one processor, cause the at least one processor to receive an identification of a first set of documents; one or more instructions which, when executed by the at least one processor, cause the at least one processor to receive an identification of a second set of documents, the identification of the second set of documents being based on a different criterion than a criterion used in the identification of the first set of documents; one or more instructions which, when executed by the at least one processor, cause the at least one processor to process document segments that represent portions of the documents in the first and second sets of documents to construct an index that relates each of the segments to the documents in which the segment occurs; one or more instructions which, when executed by the at least one processor, cause the at least one processor to construct a matrix, based on the index, containing information regarding pairs of documents from the first and second sets of documents, where the one or more instructions to construct the matrix further include one or more instructions to; map a numeric indicator representing a number of segments that are common to both documents of a particular pair of documents, where each of the pairs of documents comprises a document from the first set of documents and a document from the second set of documents; and exclude from the matrix, when constructing the matrix, a pair of documents, where the documents in the pair differ in size by at least a particular amount; one or more instructions which, when executed by the at least one processor, cause the at least one processor to calculate similarity scores for pairs of documents in the first and second sets of documents, based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, and the one or more instructions to calculate the similarity score include; one or more instructions to determine a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents, one or more instructions to determine a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio, one or more instructions to select the first ratio or the second ratio as a selected ratio, and one or more instructions to determine the similarity score based on the selected ratio; and one or more instructions which, when executed by the at least one processor, cause the at least one processor to output the similarity scores. - View Dependent Claims (25)
-
-
26. A computer-implemented method, for comparing a first set of documents to a second set of documents, the method comprising:
-
receiving, using a computer-implemented device, an identification of a first set of documents; receiving, using the computer-implemented device, an identification of a second set of documents; processing lines from the documents, using the computer-implemented device, in the first and second sets of documents to construct an index that relates each line in the first and second sets of documents to documents in the first and second sets of documents in which the line occurs, each of the lines representing a line of programming code; removing, from the index and using the computer-implemented device, lines based on one or more of; a line appearing in only one document in the first set of documents or in only one document in the second set of documents, or a line appearing in more than a threshold number of documents; constructing, based on the index and using the computer-implemented device, a matrix containing information regarding pairs of documents from the first and second sets of documents, where constructing the matrix further comprises; mapping a numeric tag representative of a number of lines that are common to both documents in each of the pairs of documents; calculating, using the computer-implemented device, similarity scores for the pairs of documents based on the matrix, where the similarity score is calculated for a pair of documents, of the pairs of documents, by; determining a first ratio of the number of lines that are common to the pair of documents and a number of lines of a first document of the pair of documents, determining a second ratio of the number of lines that are common to the pair of documents and a number of lines of a second document of the pair of documents, where the first ratio is different than the second ratio, selecting the first ratio or the second ratio as a selected ratio, and determining the similarity score based on the selected ratio; and outputting, using the computer-implemented device, the similarity scores.
-
Specification