Method and apparatus for clustering a collection of linked documents using co-citation analysis
First Claim
1. A method for clustering documents contained in a collection of linked documents, said method comprising the steps of:
- a) specifying a link frequency threshold, said link frequency threshold indicating a number of times a document is linked to from another document in said collection;
b) for each document in said collection, determining an associated link frequency;
c) discarding each document in said collection whose link frequency is lower than said link frequency threshold;
d) creating a co-citation list, said co-citation list comprised of pairs of documents that are linked to by the same document in said collection; and
e) performing a suitable clustering operation on said co-citation list to generate document clusters.
5 Assignments
0 Petitions
Accused Products
Abstract
The method and apparatus of the present invention generates clusters of documents in a collection of linked documents based on co-citation analysis. The frequency linkage is determined for each document in the collection. In other words, the number of times each document is linked to by another document in the collection is determined. Further, a minimum frequency linkage (link frequency threshold) is specified based on a predetermined minimum frequency of document linkage. Additionally, a list of pairs of documents that are linked to by the same document is created so that each of the pairs of documents has a count of the number of times (co-citation frequency) that they are both linked to by another document. Pairs of linked documents are clustered using a suitable co-citation technique.
239 Citations
15 Claims
-
1. A method for clustering documents contained in a collection of linked documents, said method comprising the steps of:
-
a) specifying a link frequency threshold, said link frequency threshold indicating a number of times a document is linked to from another document in said collection; b) for each document in said collection, determining an associated link frequency; c) discarding each document in said collection whose link frequency is lower than said link frequency threshold; d) creating a co-citation list, said co-citation list comprised of pairs of documents that are linked to by the same document in said collection; and e) performing a suitable clustering operation on said co-citation list to generate document clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for clustering documents in a collection of linked documents, said system comprising
means for accessing said documents in said collection of linked documents; -
means for specifying a link frequency threshold; link analysis means for analyzing the links contained in said documents in said collection of linked documents, said link analysis means further comprising; means for determining the number of times a document has been linked to; and means for generating a co-citation list, said co-citation list indicating pairs of documents that have been linked to by the same document; clustering means for generating document clusters from said co-citation list. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for the clustering of documents in a collection of linked documents, said method steps comprising:
-
a) specifying a link frequency threshold, said link frequency threshold indicating a number of times a document is linked to from another document in said collection; b) for each document in said collection, determining an associated link frequency; c) discarding each document in said collection whose link frequency is lower than said link frequency threshold; d) creating a co-citation list, said co-citation list comprised of pairs of documents that are linked to by the same document in said collection; and e) performing a suitable clustering operation on said co-citation list to generate document clusters. - View Dependent Claims (15)
-
Specification