Named entity transliteration using comparable CORPRA
First Claim
Patent Images
1. A method of mining multilingual named entity transliteration comprising:
- obtaining a document in a first language;
obtaining a plurality of additional documents, each additional document being in a second language that is different than the first language;
calculating a first probability distribution of the document based on words in the document in the first language;
for each additional document of the plurality of additional documents,calculating a second probability distribution of the additional document based on words in the additional document in the second language; and
calculating a cross language similarity score based on the first probability distribution of the document in the first language and the second probability distribution of the additional document in the second language;
selecting at least one of the additional documents based on a comparison of the cross language similarity scores calculated for the plurality of additional documents;
selecting a named entity in the document;
searching the selected additional document to identify a word in the selected additional document as a corresponding named entity by comparing the named entity to a one or more words in the selected additional document; and
storing the named entity and the identified word as named entity transliterations.
2 Assignments
0 Petitions
Accused Products
Abstract
A document in a first language and an additional document in a second language may be reviewed. It may be determined if the additional document is sufficiently similar to the document. If the additional document is determined sufficiently similar to the document, a named entity in the document may be selected. The method may search for a similar named entity by comparing the named entity to a word in the additional document and determining if the named entity and word are sufficiently similar. If a similar word to the named entity is located, the named entity and the similar named entities may be stored as name entity transliterations.
26 Citations
20 Claims
-
1. A method of mining multilingual named entity transliteration comprising:
-
obtaining a document in a first language; obtaining a plurality of additional documents, each additional document being in a second language that is different than the first language; calculating a first probability distribution of the document based on words in the document in the first language; for each additional document of the plurality of additional documents, calculating a second probability distribution of the additional document based on words in the additional document in the second language; and calculating a cross language similarity score based on the first probability distribution of the document in the first language and the second probability distribution of the additional document in the second language; selecting at least one of the additional documents based on a comparison of the cross language similarity scores calculated for the plurality of additional documents; selecting a named entity in the document; searching the selected additional document to identify a word in the selected additional document as a corresponding named entity by comparing the named entity to a one or more words in the selected additional document; and storing the named entity and the identified word as named entity transliterations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer readable hardware storage medium storing computer executable instructions, which, when executed using a computer, perform a method of mining multilingual named entity transliteration, the method comprising:
-
reviewing a document in a first language; reviewing an additional document in a second language that is different than the first language; calculating a cross language similarity score between the document and the additional document; comparing the cross language similarity score to a threshold; selecting a named entity in the document; searching for a sufficiently similar named entity in the additional document, comprising; obtaining a group of words from the additional document by scanning the additional document to identify words of a given type, wherein the words of the given type are omitted from the group of words; for each word in the group of words, calculating transliteration equivalence between the named entity and the word based on a feature vector for the named entity and the word in the additional document; and selecting a word from the group of words based on the calculated transliteration equivalence; and storing the named entity and the selected word as named entity transliterations. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer system comprising:
-
a processor; one or more computer storage media storing executable instructions, which, when executed by the processor, configure the computer system to; review a document in a first language; review an additional document in a second language; calculate a probability distribution of the document based on words in the document in the first language; calculate a probability distribution of the additional document based on words in the additional document in the second language; determine if the additional document is sufficiently similar to the document by calculating a cross language similarity score using a Kullback-Leibler divergence between the probability distributions of the document and the additional document, and comparing the cross language similarity score to a threshold; if the additional document is determined to be sufficiently similar to the document; select a named entity in the document; search for a sufficiently similar named entity comprising comparing the named entity to a word in the additional document; if a sufficiently similar word to the named entity is located, store the named entity and the similar word as named entity transliterations. - View Dependent Claims (19, 20)
-
Specification