AUTHOR DISAMBIGUATION

US 20130198192A1
Filed: 01/26/2012
Published: 08/01/2013
Est. Priority Date: 01/26/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

under control of one or more processors of a computing device;

accessing a plurality of documents, each document having one or more authors;

generating a plurality of first clusters of one or more documents based at least in part on an author name and similarities between one or more document features;

generating a plurality of second clusters each including one or more of the plurality of first clusters;

for each second cluster, determining a starting point first cluster; and

ranking the one or more first clusters within each second cluster based at least in part on a similarity with the respective starting point first cluster.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The techniques described herein automatically generate high precision clusters and high recall clusters for a set of documents having an author with a same or similar name. The high precision clusters and the high recall clusters can then be used in a labeling process so that efficient and accurate author disambiguation is realized.

22 Citations

View as Search Results

20 Claims

1. A method comprising:
- under control of one or more processors of a computing device;
  
  accessing a plurality of documents, each document having one or more authors;
  
  generating a plurality of first clusters of one or more documents based at least in part on an author name and similarities between one or more document features;
  
  generating a plurality of second clusters each including one or more of the plurality of first clusters;
  
  for each second cluster, determining a starting point first cluster; and
  
  ranking the one or more first clusters within each second cluster based at least in part on a similarity with the respective starting point first cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method as recited in claim 1, further comprising providing one or more of the plurality of second clusters to one or more human labelers to implement a author disambiguation labeling process.
  - 3. The method as recited in claim 1, wherein each of the documents in the plurality of first clusters have an author with a same or similar name.
  - 4. The method as recited in claim 1, the plurality of first clusters are high precision clusters, and each of the one or more documents in each high precision cluster are associated with a same author within a predetermined confidence level.
  - 5. The method as recited in claim 4, wherein the plurality of second clusters are high recall clusters, wherein each high precision cluster associated with the same author are included in a single high recall cluster within a predetermined confidence level.
  - 6. The method as recited in claim 1, wherein the document features are associated with an author name.
  - 7. The method as recited in claim 1, wherein the document features are indicative of an author name.
  - 8. The method as recited in claim 1, wherein the plurality of first clusters are generated in accordance with one or more classifier model parameters learned in a training environment.
  - 9. The method as recited in claim 1, wherein the plurality of second clusters are generated in accordance with one or more classifier model parameters learned in a training environment.
  - 10. The method as recited in claim 1, wherein the starting point first cluster has the highest overall inner similarity between documents.
  - 11. The method as recited in claim 1, wherein the document features are selected from a group comprising:
    - a focus author name;
      
      a focus author email;
      
      a focus author affiliation;
      
      a focus author homepage;
      
      a coauthor name;
      
      a coauthor email;
      
      a coauthor affiliation;
      
      a coauthor homepage;
      
      a document title;
      
      a document reference;
      
      a document citation;
      
      a download Uniform Resource Locator; and
      
      publication year.
  - 12. The method as recited in claim 1, wherein the generating the plurality of second clusters comprises comparing one or more of the document features.

13. A system comprising:
- one or more processors;
  
  a memory, coupled to the one or more processors, storing;
  
  a document access module, operable by the one or more processors, that accesses a set of documents, each document having an author with a same or similar name;
  
  a document feature obtaining module, operable by the one or more processors, that obtains document features from the set of documents;
  
  a high precision clustering module, operable by the one or more processors, that creates one or more high precision clusters based at least in part on the document features;
  
  a high recall clustering module, operable by the one or more processors, that creates one or more high recall clusters based at least in part on the document features, each high recall cluster including one or more high precision clusters; and
  
  a ranking module, operable by the one or more processors, that identifies a starting point high precision cluster within each high recall cluster and ranks the one or more high precision clusters within each high recall cluster based at least in part on a degree of similarity to the respective starting point high precision cluster.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The system of claim 13, wherein the high precision clustering module employs a high precision similarity algorithm including first trained classifiers to create the one or more high precision clusters.
  - 15. The system of claim 14, wherein the high recall clustering module employs a high recall similarity algorithm including second trained classifiers to create the one or more high recall clusters, wherein the second trained classifiers are more relaxed compared to the first trained classifiers.
  - 16. The system of claim 13, wherein the high precision clustering module creates the one or more high precision clusters based on a first group of signals determined from the document features and the high recall clustering module creates the one or more high recall clusters based on a second group of signals determined from the document features, wherein the first group of signals and the second group of signals are different.
  - 17. The system of claim 13, further comprising a label association module, operable by the one or more processors, that receives human labeling information applied to the ranked high precision clusters and associates the human labeling information with the set of documents to improve author-based search functionality.

18. One or more computer storage media, comprising computer-executable instructions that configure a computing device to:
- access a set of documents, each document having at least one author with a same or similar name;
  
  obtain document features from the set of documents;
  
  generate a plurality of high precision clusters for the set of documents based at least in part on a similarity comparison of a first group of document features signals;
  
  generate a plurality of high recall clusters for the set of documents based at least in part on a similarity comparison of a second group of document features signals, each high recall cluster containing one or more high precision clusters;
  
  within each high recall cluster, identify a high precision cluster that has the highest overall inner similarity between documents; and
  
  rank the one or more high precision clusters within each high recall cluster based at least in part on a similarity with the identified high precision cluster.
- View Dependent Claims (19, 20)
- - 19. The one or more computer storage media of claim 18, wherein the document features are selected from a group comprising:
    - a focus author name;
      
      a focus author email;
      
      a focus author affiliation;
      
      a focus author homepage;
      
      a coauthor name;
      
      a coauthor email;
      
      a coauthor affiliation;
      
      a coauthor homepage;
      
      a document title;
      
      a document reference;
      
      a document citation;
      
      a download Uniform Resource Locator; and
      
      publication year.
  - 20. The one or more computer storage media of claim 18, wherein the plurality of high precision clusters are generated in accordance with a high precision similarity function including first trained classifiers, and the plurality of high recall clusters are generated in accordance with a high recall similarity function including second trained classifiers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Hu, Yunhua, Nie, Zaiqing

Granted Patent

US 9,305,083 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/738
CPC Class Codes

G06F 16/35 Clustering; Classification

G06F 16/951 Indexing; Web crawling tech...

AUTHOR DISAMBIGUATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

22 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

AUTHOR DISAMBIGUATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links