Named entity transliteration using comparable CORPRA

US 8,560,298 B2
Filed: 10/21/2008
Issued: 10/15/2013
Est. Priority Date: 10/21/2008
Status: Active Grant

First Claim

Patent Images

1. A method of mining multilingual named entity transliteration comprising:

obtaining a document in a first language;

obtaining a plurality of additional documents, each additional document being in a second language that is different than the first language;

calculating a first probability distribution of the document based on words in the document in the first language;

for each additional document of the plurality of additional documents,calculating a second probability distribution of the additional document based on words in the additional document in the second language; and

calculating a cross language similarity score based on the first probability distribution of the document in the first language and the second probability distribution of the additional document in the second language;

selecting at least one of the additional documents based on a comparison of the cross language similarity scores calculated for the plurality of additional documents;

selecting a named entity in the document;

searching the selected additional document to identify a word in the selected additional document as a corresponding named entity by comparing the named entity to a one or more words in the selected additional document; and

storing the named entity and the identified word as named entity transliterations.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document in a first language and an additional document in a second language may be reviewed. It may be determined if the additional document is sufficiently similar to the document. If the additional document is determined sufficiently similar to the document, a named entity in the document may be selected. The method may search for a similar named entity by comparing the named entity to a word in the additional document and determining if the named entity and word are sufficiently similar. If a similar word to the named entity is located, the named entity and the similar named entities may be stored as name entity transliterations.

26 Citations

View as Search Results

20 Claims

1. A method of mining multilingual named entity transliteration comprising:
- obtaining a document in a first language;
  
  obtaining a plurality of additional documents, each additional document being in a second language that is different than the first language;
  
  calculating a first probability distribution of the document based on words in the document in the first language;
  
  for each additional document of the plurality of additional documents,calculating a second probability distribution of the additional document based on words in the additional document in the second language; and
  
  calculating a cross language similarity score based on the first probability distribution of the document in the first language and the second probability distribution of the additional document in the second language;
  
  selecting at least one of the additional documents based on a comparison of the cross language similarity scores calculated for the plurality of additional documents;
  
  selecting a named entity in the document;
  
  searching the selected additional document to identify a word in the selected additional document as a corresponding named entity by comparing the named entity to a one or more words in the selected additional document; and
  
  storing the named entity and the identified word as named entity transliterations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein selecting the additional document comprises selecting the document pair with the highest cross language similarity score.
  - 3. The method of claim 1, wherein searching for a similar named entity comprises calculating a cross language similarity score for the word and the named entity.
  - 4. The method of claim 3, wherein the cross language similarity score for the word and the named entity measures the degree of transliteration equivalence between the named entity and the word.
  - 5. The method of claim 4, wherein the cross language similarity score is calculated for a plurality of named entity pairs wherein named entity pairs comprise the named entity and the word in the additional document.
  - 6. The method of claim 5, wherein searching the selected additional document to identify the word comprises:
    - generating a group of words from the selected additional document by removing prepositions, verbs and adjectives from the selected additional document; and
      
      sequentially selecting words from the group of words and comparing features of each word to the named entity.
  - 7. The method of claim 6, wherein the named entity pair that has the cross language score at a maximum is selected as transliterations of each other.
  - 8. The method of claim 1, wherein calculating the first probability distribution comprises:
    - determining a probability of a word in the first language being in the document.
  - 9. The method of claim 8, wherein calculating the second probability distribution comprises:
    - determining a probability of a word in the second language being in the additional document.

10. A computer readable hardware storage medium storing computer executable instructions, which, when executed using a computer, perform a method of mining multilingual named entity transliteration, the method comprising:
- reviewing a document in a first language;
  
  reviewing an additional document in a second language that is different than the first language;
  
  calculating a cross language similarity score between the document and the additional document;
  
  comparing the cross language similarity score to a threshold;
  
  selecting a named entity in the document;
  
  searching for a sufficiently similar named entity in the additional document, comprising;
  
  obtaining a group of words from the additional document by scanning the additional document to identify words of a given type, wherein the words of the given type are omitted from the group of words;
  
  for each word in the group of words, calculating transliteration equivalence between the named entity and the word based on a feature vector for the named entity and the word in the additional document; and
  
  selecting a word from the group of words based on the calculated transliteration equivalence; and
  
  storing the named entity and the selected word as named entity transliterations.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The computer storage medium of claim 10, wherein the cross language similarity score is calculated using a Kullback-Leibler divergence.
  - 12. The computer storage medium of claim 11, wherein the cross language similarity score is calculated for a plurality of document and additional document pairs and selecting the document pair with the highest similarity score.
  - 13. The computer storage medium of claim 10, wherein a cross language similarity score is calculated for a plurality of named entity pairs wherein named entity pairs comprise the named entity and the word in the additional document.
  - 14. The computer storage medium of claim 10, wherein the named entity pair that has the cross language score at a maximum is selected as transliterations of each other.
  - 15. The computer storage medium of claim 10, wherein searching for a sufficiently similar named entity in the additional document does not require a name entity recognizer for the second language.
  - 16. The computer storage medium of claim 10, wherein the given type comprises at least one of prepositions, verbs, and adjectives.
  - 17. The computer storage medium of claim 16, wherein obtaining the group of words comprises omitting all prepositions, verbs and adjectives from the additional document.

18. A computer system comprising:
- a processor;
  
  one or more computer storage media storing executable instructions, which, when executed by the processor, configure the computer system to;
  
  review a document in a first language;
  
  review an additional document in a second language;
  
  calculate a probability distribution of the document based on words in the document in the first language;
  
  calculate a probability distribution of the additional document based on words in the additional document in the second language;
  
  determine if the additional document is sufficiently similar to the document by calculating a cross language similarity score using a Kullback-Leibler divergence between the probability distributions of the document and the additional document, and comparing the cross language similarity score to a threshold;
  
  if the additional document is determined to be sufficiently similar to the document;
  
  select a named entity in the document;
  
  search for a sufficiently similar named entity comprising comparing the named entity to a word in the additional document;
  
  if a sufficiently similar word to the named entity is located, store the named entity and the similar word as named entity transliterations.
- View Dependent Claims (19, 20)
- - 19. The computer system of claim 18, wherein the cross language similarity score is calculated for a plurality of document and additional document pairs, the computing system being further configured to:
    - select the document and additional document pair with the highest similarity score.
  - 20. The computer system of claim 18, wherein the computing system is configured to search for a sufficiently similar named entity by calculating a cross language similarity score for the word and the named entity wherein:
    - the cross language similarity score measures the degree of transliteration equivalence between the named entity and the wordthe word in the additional document is sequentially selected from a group of words in the additional document wherein the group does not include prepositions, verbs or adjectives in the additional document;
      
      wherein the cross language similarity score is calculated for a plurality of named entity pairs wherein named entity pairs comprise the named entity and the word in the additional document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kumaran, Arumugam, U, Raghavendra Udupa, Krishnan, Saravanan
Primary Examiner(s)
Godbold, Douglas

Application Number

US12/255,372
Publication Number

US 20100106484A1
Time in Patent Office

1,820 Days
Field of Search

704 2- 8
US Class Current

704/4
CPC Class Codes

G06F 40/129   Handling non-Latin characte...

G06F 40/295   Named entity recognition

G06F 40/45   Example-based machine trans...

Named entity transliteration using comparable CORPRA

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

26 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Named entity transliteration using comparable CORPRA

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others