Person disambiguation using name entity extraction-based clustering

US 7,685,201 B2
Filed: 04/30/2007
Issued: 03/23/2010
Est. Priority Date: 09/08/2006
Status: Active Grant

First Claim

Patent Images

1. A method comprised of steps that are each performed by one or more computers, the steps manipulating data and information stored by the one or more computers, the steps of the method comprising:

disambiguating person data located from one or more sets of search results, including extracting information about a person based on name entity extraction, and calculating similarity data, wherein the calculating similarity data comprises using a vector space model, wherein using the vector space model comprises determining a vector for a person, the vector comprising a plurality of entity features including one or more entity locations related to the person, one or more entity organizations related to the person, and one or more entities that the person has been associated with the person, wherein calculating similarity data comprises using a calculation in which each entity feature of the person vector has an entity weight and a nearness weight, and wherein the calculation comprises, for each entity feature, combining the corresponding entity weight and nearness weight with an entity weight and nearness weight of a same entity feature of another person vector and aggregating the combined weights of the entity features.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is a technology for disambiguating data corresponding to persons that are located from search results, so that different persons having the same name can be clearly distinguished. Name entity extraction locates words (terms) that are within a certain distance of persons'"'"' names in the search results. The terms are used in disambiguating search results that correspond to different persons having the same name, such as location information, organization information, career information, and/or partner information. In one example, each person is represented as a vector, and similarity among vectors is calculated based on weighting that corresponds to nearness of the terms to a person, and/or the types of terms. Based on the similarity data, the person vectors that represent the same person are then merged into one cluster, so that each cluster represents (to a high probability) only one distinct person.

77 Citations

View as Search Results

17 Claims

1. A method comprised of steps that are each performed by one or more computers, the steps manipulating data and information stored by the one or more computers, the steps of the method comprising:
- disambiguating person data located from one or more sets of search results, including extracting information about a person based on name entity extraction, and calculating similarity data, wherein the calculating similarity data comprises using a vector space model, wherein using the vector space model comprises determining a vector for a person, the vector comprising a plurality of entity features including one or more entity locations related to the person, one or more entity organizations related to the person, and one or more entities that the person has been associated with the person, wherein calculating similarity data comprises using a calculation in which each entity feature of the person vector has an entity weight and a nearness weight, and wherein the calculation comprises, for each entity feature, combining the corresponding entity weight and nearness weight with an entity weight and nearness weight of a same entity feature of another person vector and aggregating the combined weights of the entity features.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 wherein extracting the information about a person comprises locating at least one word that is within a word distance in the search results.
  - 3. The method of claim 1 wherein the using the vector space model further comprises using at least one data item of a set of data items, the set including location information, organization information, career information, or partner information.
  - 4. The method of claim 3 wherein at least one data item of the set has a different weight from at least one other data item in the set.
  - 5. The method of claim 1 further comprising clustering person vectors that are similar into clusters, based on the similarity data.
  - 6. The method of claim 5 wherein clustering the person vectors comprises performing an initial cluster merging operation to obtain high quality clusters, and merging clusters that share common data.
  - 7. The method of claim 6 wherein merging clusters comprises selecting a cluster, and determining data that appear more in one cluster and less in other clusters for the selected cluster.
  - 8. The method of claim 7 wherein determining the data comprises using a calculation based on a term and a number of clusters in which the term appears, in which TF and tf represent term frequency, ICF represents inverted cluster frequency, and cf represents cluster frequency:
    - $TFICF = tf (term) \cdot \log \frac{\langle Cluster \rangle}{cf (term)} .$
  - 9. The method of claim 1 further comprising, determining an industry for a person based on the person data.
  - 10. The method of claim 9 wherein determining the industry comprises using at least one mechanism of a set of mechanisms, the set including person name detection, location name identification, affiliation detection, and keyword extraction.
  - 11. The method of claim 10 wherein determining the industry comprises using a relational classification mechanism to output the industry.

12. A computer-readable medium having computer executable instructions, which when executed perform steps, comprising:
- disambiguating person data located from one or more text snippets, including;
  
  receiving the text snippets from a search engine in response to a query comprising a person name, the snippets including the person name;
  
  for each snippet extracting therefrom entity names of entities related to the person name, computing weights of the names of the entities according to their respective text distances in the snippet from the person name in the snippet, and constructing a person feature vector comprised of features that correspond to the names of the entities and each feature having the computed weight of its corresponding entity name;
  
  calculating similarity measures between the person feature vectors, each similarity measure representing similarity between two different person feature vectors, where for a given first person feature vector and a given second person feature vector, weights of features of the first person feature vector are combined with weights of the same features from the second person feature vector to compute the similarity measure between the first and second person feature vectors; and
  
  clustering the person feature vectors into clusters of similar feature vectors based on the similarity measures.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The computer-readable medium of claim 12 wherein extracting the information comprises locating terms within a word distance in the snippets, wherein the terms correspond to at least one data item of a set of data items, the set including location information, organization information, career information, or partner information.
  - 14. The computer-readable medium of claim 13 wherein at least one data item of the set has a different weight from at least one other data item in the set, and wherein the weight is based on nearness to the person'"'"'s name or on a type of data item, or both on nearness and the type of data item.
  - 15. A computer-readable medium according to claim 12, wherein the disambiguating is performed by a web browser.
  - 16. The computer-readable media of claim 15 wherein the names of entities comprise names of locations, names of organizations, or partner information.

17. A method of disambiguating names performed by one or more computers, the method comprising the following steps performed by the one or more computers:
- receiving from a search engine text snippets, the text snippets have been found by the search engine in response to a query comprising a person name, the snippets including the person name;
  
  storing, by the one or more computers, the received text snippets;
  
  for each stored snippet, finding therein the person name and names of entities that are related to a person having the person name, computing weights of the names ofthe entities according to their respective text distances in the snippet from the person name in the snippet, and constructing a person feature vector comprised of features that correspond to the names of the entities and each feature having the computed weight of its corresponding entity name;
  
  calculating, by processing of the one or more computers, similarity measures between the person feature vectors, each similarity measure representing similarity between two different person feature vectors, where for a given first person feature vector and a given second person feature vector, weights of features of the first person feature vector are combined with weights of the same features from the second person feature vector to compute the similarity measure between the first and second person feature vectors;
  
  executing, by the one or more computers, a clustering algorithm to form clusters of the person feature vectors based on the similarity measures;
  
  merging clusters based on their having in-common same names of entities, each merged duster representing the same person name; and
  
  disambiguating the person name by treating each cluster as representing a different person having the same person name.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Huang, Shen, Chen, Zheng, Wang, Jian, Zeng, Hua-Jun
Primary Examiner(s)
Breene; John E
Assistant Examiner(s)
NGUYEN, PHONG H

Application Number

US11/796,818
Publication Number

US 20080065623A1
Time in Patent Office

1,058 Days
Field of Search

707/3, 707/4, 707/5, 707/10, 707/101, 707/999.003, 707/999.004, 707/999.005, 707/999.01, 707/999.101, 704/9
US Class Current

707/748
CPC Class Codes

G06F 16/338 Presentation of query results

G06F 16/355 Class or cluster creation o...

Person disambiguation using name entity extraction-based clustering

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

77 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Person disambiguation using name entity extraction-based clustering

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

77 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links