Clustering documents using citation patterns
First Claim
1. A method for clustering documents using one or more processors, the method comprising:
- in each document in a collection of documents, locating citations, the citations including references to other documents;
for each pair of one or more pairs of documents in the collection of documents, performing actions comprising;
comparing the pair of documents based on overlapping citations at a first document structure level to generate a first citation overlap level, each of the overlapping citations at the first citation overlap level including common references to a same other document, the common references occurring at the first document structure level in each document of the pair;
comparing the pair of documents based on overlapping citations at a second document structure level to generate a second citation overlap level, each of the overlapping citations at the second citation overlap level including common references to a same other document, the common references occurring at the second document structure level in each document of the pair, where the second document structure level is more specific than the first document structure level; and
combining the first citation overlap level and the second citation overlap level to generate a citation overlap score for the pair of documents;
determining a plurality of clusters of related documents in the collection of documents according to the citation overlap scores;
ranking the plurality of clusters including;
generating a weighted cluster citation overlap score for each pair of documents in each cluster of the plurality of clusters, andpenalizing the weighted cluster citation overlap score for one or more clusters based on whether each cluster contains documents that contain only overlapping citations with other documents at the first document structure level;
andproviding a listing of the plurality of clusters of related documents based on the ranking.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for clustering documents, such as for scientific documents, taking into account the citation patterns of the documents are disclosed. In one embodiment, the method includes locating citations to other documents, e.g., search result documents, comparing each pair of documents to be clustered for overlapping citations in a first, a more specific second, and an even more specific optional third citation generality, and determining clusters of related documents based on the comparisons. The levels of generalities may be, for example, document-, paragraph-, and/or citation-level generalities. The locating may locate only citations to the other documents to be clustered. The clusters may be determined based on a weighted score of the amount of overlapping citations in the various generalities and/or by performing factor analysis using the comparison results. The clusters may be ranked to determine the dominant clusters.
-
Citations
30 Claims
-
1. A method for clustering documents using one or more processors, the method comprising:
-
in each document in a collection of documents, locating citations, the citations including references to other documents; for each pair of one or more pairs of documents in the collection of documents, performing actions comprising; comparing the pair of documents based on overlapping citations at a first document structure level to generate a first citation overlap level, each of the overlapping citations at the first citation overlap level including common references to a same other document, the common references occurring at the first document structure level in each document of the pair; comparing the pair of documents based on overlapping citations at a second document structure level to generate a second citation overlap level, each of the overlapping citations at the second citation overlap level including common references to a same other document, the common references occurring at the second document structure level in each document of the pair, where the second document structure level is more specific than the first document structure level; and combining the first citation overlap level and the second citation overlap level to generate a citation overlap score for the pair of documents; determining a plurality of clusters of related documents in the collection of documents according to the citation overlap scores; ranking the plurality of clusters including; generating a weighted cluster citation overlap score for each pair of documents in each cluster of the plurality of clusters, and penalizing the weighted cluster citation overlap score for one or more clusters based on whether each cluster contains documents that contain only overlapping citations with other documents at the first document structure level; and providing a listing of the plurality of clusters of related documents based on the ranking. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more computers configured to perform operations comprising; in each document in a collection of documents, locating citations, the citations including references to other documents; for each pair of one or more pairs of documents in the collection of documents, performing actions comprising; comparing the pair of documents based on overlapping citations at a first document structure level to generate a first citation overlap level, each of the overlapping citations at the first citation overlap level including common references to a same other document, the common references occurring at the first document structure level in each document of the pair; comparing the pair of documents based on overlapping citations at a second document structure level to generate a second citation overlap level, each of the overlapping citations at the second citation overlap level including common references to a same other document, the common references occurring at the second document structure level in each document of the pair, where the second document structure level is more specific than the first document structure level; and combining the first citation overlap level and the second citation overlap level to generate a citation overlap score for the pair of documents; determining a plurality of clusters of related documents in the collection of documents according to the citation overlap scores; ranking the plurality of clusters including; generating a weighted cluster citation overlap score for each pair of documents in each cluster of the plurality of clusters, and penalizing the weighted cluster citation overlap score for one or more clusters based on whether each cluster contains documents that contain only overlapping citations with other documents at the first document structure level; and providing a listing of the plurality of clusters of related documents based on the ranking. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer program product embodied on a non-transitory computer-readable medium, the computer program product including instructions, which when executed by a computer system, are configured to cause the computer system to perform operations comprising:
-
in each document in a collection of documents, locating citations, the citations including references to other documents; for each pair of one or more pairs of documents in the collection of documents, performing actions comprising; comparing the pair of documents based on overlapping citations at a first document structure level to generate a first citation overlap level, each of the overlapping citations at the first citation overlap level including common references to a same other document, the common references occurring at the first document structure level in each document of the pair; comparing the pair of documents based on overlapping citations at a second document structure level to generate a second citation overlap level, each of the overlapping citations at the second citation overlap level including common references to a same other document, the common references occurring at the second document structure level in each document of the pair, where the second document structure level is more specific than the first document structure level; and combining the first citation overlap level and the second citation overlap level to generate a citation overlap score for the pair of documents; determining a plurality of clusters of related documents in the collection of documents according to the citation overlap scores; ranking the plurality of clusters including; generating a weighted cluster citation overlap score for each pair of documents in each cluster of the plurality of clusters, and penalizing the weighted cluster citation overlap score for one or more clusters based on whether each cluster contains documents that contain only overlapping citations with other documents at the first document structure level; and providing a listing of the plurality of clusters of related documents based on the ranking. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification