System And Methods For Clustering Large Database of Documents
First Claim
1. A method of organizing a plurality of documents for later;
- access, and retrieval within a computerized;
system, wherein the plurality of documents are contained within a dataset and wherein a class of documents contained in the dataset include one or more citations to one or more other documents, comprising the steps of;
creating a set of fingerprints for each respective document in the class, wherein each fingerprint comprises one or more citations contained in the respective document;
creating a plurality of clusters for the dataset based on the sets of fingerprints for the documents in the class;
assigning each respective document in the class to zero or more of the clusters based on the set of fingerprints for said respective document and wherein each respective cluster has documents assigned thereto based on a statistical similarity between the sets of fingerprints of said assigned documents;
for each remaining document in the dataset that has not yet been assigned to at least one cluster, assigning each said remaining document to one or more of the clusters based on a natural language processing comparison of each said remaining document with documents already assigned to each respective cluster;
creating a descriptive label for each respective cluster based on key terms contained in the documents assigned to the respective cluster; and
presenting one or more of the labeled clusters to a user of the computerized system.
1 Assignment
0 Petitions
Accused Products
Abstract
In a computerized system, a method of organizing a plurality of documents within a dataset of documents, wherein a plurality of documents within a class of the dataset each includes one or more citations to one or more other documents, comprising creating a set of fingerprints for each respective document in the class, wherein each fingerprint comprises one or more citations contained in the respective document, creating a plurality of clusters for the dataset based on the sets of fingerprints for the documents in the class, assigning each respective document in the dataset to one or more of the clusters, creating a descriptive label for each respective cluster, and presenting one or more of the labeled clusters to a user of the computerized system or providing the user with access to documents in at least one cluster.
230 Citations
64 Claims
-
1. A method of organizing a plurality of documents for later;
- access, and retrieval within a computerized;
system, wherein the plurality of documents are contained within a dataset and wherein a class of documents contained in the dataset include one or more citations to one or more other documents, comprising the steps of;creating a set of fingerprints for each respective document in the class, wherein each fingerprint comprises one or more citations contained in the respective document; creating a plurality of clusters for the dataset based on the sets of fingerprints for the documents in the class; assigning each respective document in the class to zero or more of the clusters based on the set of fingerprints for said respective document and wherein each respective cluster has documents assigned thereto based on a statistical similarity between the sets of fingerprints of said assigned documents; for each remaining document in the dataset that has not yet been assigned to at least one cluster, assigning each said remaining document to one or more of the clusters based on a natural language processing comparison of each said remaining document with documents already assigned to each respective cluster; creating a descriptive label for each respective cluster based on key terms contained in the documents assigned to the respective cluster; and presenting one or more of the labeled clusters to a user of the computerized system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- access, and retrieval within a computerized;
-
33. In a computerized system, a method of organizing documents in a dataset of a plurality of documents, wherein a class of documents contained in the dataset include one or more citations to one or more other documents, comprising the steps of:
-
for each document in the class, creating a set of fingerprints, wherein each fingerprint identifies one or more citations contained in the respective document; based on the sets of fingerprints for the documents in the class, creating a plurality of clusters for the dataset, wherein each cluster is defined as an overlap of fingerprints from two or more documents in the class; assigning documents in the class to zero or more of the clusters based on the citations contained in each respective document; assigning all remaining documents in the dataset, that have not yet been assigned to at least one cluster, to one or more clusters based on a natural language processing comparison of each said remaining document with documents already assigned to each respective cluster; creating a label for each respective cluster based on key terms contained in the documents assigned to the respective cluster; and providing to a user of the computerized system access to documents assigned to one or more clusters in response to a request by the user. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63)
-
-
64. In a computerized system, a method of organizing a plurality of documents for later access and retrieval within the computerized system, wherein the plurality of documents are contained within a dataset and wherein a class of documents contained in the dataset include one or more citations to one or more other documents, comprising the steps of:
-
identifying spurious citations contained in documents in the class; creating a set of fingerprints for each document in the class, wherein each fingerprint identifies one or more citations, other than spurious citations, contained in the respective document; creating an initial plurality of low-level clusters for the dataset based on the sets of fingerprints for the documents in the class, wherein each cluster is defined as an overlap of fingerprints from two or more documents in the class; creating a reduced plurality of high-level clusters by progressively merging pairs of low-level clusters to define a respective high-level cluster; assigning documents in the dataset to one or more of the clusters; creating a label for each respective cluster based on key terms contained in the documents assigned to the respective cluster; and selectively presenting one or more of the low-level and high-level clusters to a user of the computerized system.
-
Specification