AUTHOR DISAMBIGUATION
First Claim
Patent Images
1. A method comprising:
- under control of one or more processors of a computing device;
accessing a plurality of documents, each document having one or more authors;
generating a plurality of first clusters of one or more documents based at least in part on an author name and similarities between one or more document features;
generating a plurality of second clusters each including one or more of the plurality of first clusters;
for each second cluster, determining a starting point first cluster; and
ranking the one or more first clusters within each second cluster based at least in part on a similarity with the respective starting point first cluster.
2 Assignments
0 Petitions
Accused Products
Abstract
The techniques described herein automatically generate high precision clusters and high recall clusters for a set of documents having an author with a same or similar name. The high precision clusters and the high recall clusters can then be used in a labeling process so that efficient and accurate author disambiguation is realized.
22 Citations
20 Claims
-
1. A method comprising:
under control of one or more processors of a computing device; accessing a plurality of documents, each document having one or more authors; generating a plurality of first clusters of one or more documents based at least in part on an author name and similarities between one or more document features; generating a plurality of second clusters each including one or more of the plurality of first clusters; for each second cluster, determining a starting point first cluster; and ranking the one or more first clusters within each second cluster based at least in part on a similarity with the respective starting point first cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
13. A system comprising:
-
one or more processors; a memory, coupled to the one or more processors, storing; a document access module, operable by the one or more processors, that accesses a set of documents, each document having an author with a same or similar name; a document feature obtaining module, operable by the one or more processors, that obtains document features from the set of documents; a high precision clustering module, operable by the one or more processors, that creates one or more high precision clusters based at least in part on the document features; a high recall clustering module, operable by the one or more processors, that creates one or more high recall clusters based at least in part on the document features, each high recall cluster including one or more high precision clusters; and a ranking module, operable by the one or more processors, that identifies a starting point high precision cluster within each high recall cluster and ranks the one or more high precision clusters within each high recall cluster based at least in part on a degree of similarity to the respective starting point high precision cluster. - View Dependent Claims (14, 15, 16, 17)
-
-
18. One or more computer storage media, comprising computer-executable instructions that configure a computing device to:
-
access a set of documents, each document having at least one author with a same or similar name; obtain document features from the set of documents; generate a plurality of high precision clusters for the set of documents based at least in part on a similarity comparison of a first group of document features signals; generate a plurality of high recall clusters for the set of documents based at least in part on a similarity comparison of a second group of document features signals, each high recall cluster containing one or more high precision clusters; within each high recall cluster, identify a high precision cluster that has the highest overall inner similarity between documents; and rank the one or more high precision clusters within each high recall cluster based at least in part on a similarity with the identified high precision cluster. - View Dependent Claims (19, 20)
-
Specification