Identifying synonyms of entities using a document collection

US 8,533,203 B2
Filed: 06/04/2009
Issued: 09/10/2013
Est. Priority Date: 06/04/2009
Status: Active Grant

First Claim

Patent Images

1. A method of efficiently selecting synonyms of an entity name, the method comprising:

selecting a hit sequence from a document that is stored on a computing device, the hit sequence includes a contiguous string of tokens from a plurality of entity names in an entity name list;

arranging the tokens of the hit sequence into a suffix tree as linked groups of tokens of the hit sequence, wherein the suffix tree contains a suffix link identifier to(i) identify a discriminating token set (DTS) that is a sub-sequence of the hit sequence,(ii) manage and generate the suffix tree, and(iii) efficiently batch process the hit sequences;

generating a combination token index from the entity name list identifying a position of each of the tokens and indexes for one or more combinations of each of the tokens;

determining a discriminating token set map from the combination token index and the suffix tree, the DTS map including a matching of the entity name and the DTS;

storing a portion of adjacent text surrounding the DTS from the document as a DTS phrase;

identifying token pairs that are common between the entity name and the DTS phrase associated with the entity name, the token pairs being tokens that are a subset of both the entity name and the DTS phrase;

generating a score for the DTS based on an occurrence of the token pairs in the DTS phrase,wherein the score is an aggregate score for the DTS across a document collection and the score is generated by counting unique instances of the token pairs and assigning a numerical value to the DTS based on a count of the unique instances of the identified tokens; and

storing the DTS as a synonym of the entity name on the computing device when the generated score at least reaches the threshold value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Identifying synonyms of entities using a collection of documents is disclosed herein. In some aspects, a document from a collection of documents may be analyzed to identify hit sequences that include one or more tokens (e.g., words, number, etc.). The hit sequences may then be used to generate discriminating token sets (DTS'"'"'s) that are subsets of both the hit sequences and the entity names. The DTS'"'"'s are matched with corresponding entity names, and then used to create DTS phrases by selecting adjacent text in the document that is proximate to the DTS. The DTS phrases may be analyzed to determine whether the corresponding DTS is synonyms of the entity name. In various aspects, the tokens of an associated entity name that are present in the DTS phrases are used to generate a score for the DTS. When the score at least reaches a threshold, the DTS may be designated as a synonym. A list of synonyms may be generated for each entity name.

48 Citations

View as Search Results

13 Claims

1. A method of efficiently selecting synonyms of an entity name, the method comprising:
- selecting a hit sequence from a document that is stored on a computing device, the hit sequence includes a contiguous string of tokens from a plurality of entity names in an entity name list;
  
  arranging the tokens of the hit sequence into a suffix tree as linked groups of tokens of the hit sequence, wherein the suffix tree contains a suffix link identifier to(i) identify a discriminating token set (DTS) that is a sub-sequence of the hit sequence,(ii) manage and generate the suffix tree, and(iii) efficiently batch process the hit sequences;
  
  generating a combination token index from the entity name list identifying a position of each of the tokens and indexes for one or more combinations of each of the tokens;
  
  determining a discriminating token set map from the combination token index and the suffix tree, the DTS map including a matching of the entity name and the DTS;
  
  storing a portion of adjacent text surrounding the DTS from the document as a DTS phrase;
  
  identifying token pairs that are common between the entity name and the DTS phrase associated with the entity name, the token pairs being tokens that are a subset of both the entity name and the DTS phrase;
  
  generating a score for the DTS based on an occurrence of the token pairs in the DTS phrase,wherein the score is an aggregate score for the DTS across a document collection and the score is generated by counting unique instances of the token pairs and assigning a numerical value to the DTS based on a count of the unique instances of the identified tokens; and
  
  storing the DTS as a synonym of the entity name on the computing device when the generated score at least reaches the threshold value.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method as recited in claim 1, further comprising using the synonyms in a synonym list to retrieve documents associated with the entity name.
  - 3. The method as recited in claim 1, wherein the hit sequences are generated using a token table that contains at least one of unique tokens sets or core token sets.
  - 4. The method as recited in claim 1, further comprising utilizing a map-reduce framework to enable processing of a large collection of documents.
  - 5. The method as recited in claim 4, further comprising:
    - generating a suffix tree of the hit sequence to map unique token combinations of the hit sequence as the DTS to the entity name; and
      
      exploiting suffix links in the suffix tree to identify the DTS.

6. A computer-readable memory storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
- selecting a hit sequence from a document that includes a contiguous string of tokens from an entity name, the entity name being a label assigned to an entity by a producer of the entity or an authority on the entity;
  
  generating a suffix tree of the hit sequence to map unique token combinations of the hit sequence as a discriminating token set (DTS) to the entity name;
  
  exploiting suffix links in the suffix tree to identify the DTS, the suffix link used to efficiently batch process the hit sequences;
  
  selecting the DTS from the hit sequence, the DTS including tokens from the entity name that is associated with the DTS;
  
  storing a portion of adjacent text surrounding the DTS from the document as a DTS phrase;
  
  identifying token pairs that are common between the entity name and the DTS phrase associated with the entity name, the token pairs being tokens in the adjacent text surrounding the DTS that are a subset of both the entity name and the DTS phrase;
  
  generating a score for the DTS based on an occurrence of the token pairs in the DTS phrase,wherein the score is an aggregate score for the DTS across a document collection and the score is generated by counting unique instances of the token pairs and assigning a numerical value to the DTS based on a count of the unique instances of the identified tokens; and
  
  storing the DTS as a synonym of the entity name when the generated score at least reaches the threshold value.
- View Dependent Claims (7, 8, 9)
- - 7. The computer-readable memory of claim 6, further comprising using the synonyms in a synonym list to retrieve documents associated with the entity name.
  - 8. The computer-readable memory of claim 6, wherein the hit sequence is selected from a plurality of entity names in an entity name list, and wherein the hit sequences are generated using a token table that contains at least one of unique tokens sets or core token sets.
  - 9. The computer-readable memory of claim 6, further comprising utilizing a map-reduce framework to enable processing of a large collection of documents.

10. A method of selecting synonyms of an entity name, the method comprising:
- selecting a hit sequence from a document that is stored on a computing device, the hit sequence including tokens from an entity name and are generated using a token table that contains at least one of unique tokens sets or core token sets;
  
  generating a suffix tree of the hit sequence to map unique token combinations of the hit sequence as a discriminating token set (DTS) to the entity name;
  
  exploiting suffix links in the suffix tree to identify the DTS, the suffix link used to efficiently batch process the hit sequences;
  
  selecting the DTS from the hit sequence, the DTS including tokens from the entity name that is associated with the DTS;
  
  storing a portion of adjacent text surrounding the DTS from the document as a DTS phrase;
  
  identifying token pairs that are common between the entity name and the DTS phrase associated with the entity name, the token pairs being tokens that are a subset of both the entity name and the DTS phrase;
  
  generating a score for the DTS based on an occurrence of the token pairs in the DTS phrase, the score being based on a percentage of tokens in the DTS phrase that match one or more tokens from the entity name;
  
  storing the DTS as a synonym of the entity name on the computing device when the generated score at least reaches the threshold value; and
  
  utilizing a map-reduce framework to enable processing of a large collection of documents.
- View Dependent Claims (11, 12, 13)
- - 11. The method as recited in claim 10, further comprising using the synonyms in a synonym list to retrieve documents associated with the entity name.
  - 12. The method as recited in claim 10, wherein the score is an aggregate score for the DTS across a document collection, and wherein the generating the score includes counting unique instances of the token pairs and assigning a numerical value to the DTS based on a count of the unique instances of the identified tokens.
  - 13. The method as recited in claim 10, wherein the hit sequence includes a contiguous string of tokens that are selected from a plurality of entity names in an entity name list.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Xin, Dong
Primary Examiner(s)
PEACH, POLINA G

Application Number

US12/478,120
Publication Number

US 20100313258A1
Time in Patent Office

1,559 Days
Field of Search

None
US Class Current

707/749
CPC Class Codes

G06F 40/247 Thesauruses; Synonyms

G06F 40/295 Named entity recognition

Identifying synonyms of entities using a document collection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

48 Citations

13 Claims

Specification

Use Cases

Quick Links

Others

Identifying synonyms of entities using a document collection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

48 Citations

13 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others