Named entity recognition on chat data
First Claim
1. A method comprisingperforming by one or more computers:
- training a statistical classifier to identify named entities using training data comprising a plurality of features, wherein one of the features is a word shape feature that comprises a respective token for each letter of a respective word, the respective token indicating that each letter of the respective word is one of an upper case letter, a lower case letter, and a digit;
receiving a plurality of word strings in a first language, each received word string comprising a plurality of words;
identifying at least one named entity in each received word string using the trained statistical classifier; and
translating the received word strings from the first language to a second language, wherein translating comprises preserving the identified at least one named entity in the first language.
6 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving a plurality of word strings in a first language, each received word string comprising a plurality of words, identifying one or more named entities in each received word string using a statistical classifier that was trained using training data comprising a plurality of features, wherein one of the features is a word shape feature that comprises a respective token for each letter of a respective word wherein each token signifies a case of the letter or whether the letter is a digit, and translating the received word strings from the first language to a second language including preserving the respective identified named entities in each received word string during translation.
319 Citations
30 Claims
-
1. A method comprising
performing by one or more computers: -
training a statistical classifier to identify named entities using training data comprising a plurality of features, wherein one of the features is a word shape feature that comprises a respective token for each letter of a respective word, the respective token indicating that each letter of the respective word is one of an upper case letter, a lower case letter, and a digit; receiving a plurality of word strings in a first language, each received word string comprising a plurality of words; identifying at least one named entity in each received word string using the trained statistical classifier; and translating the received word strings from the first language to a second language, wherein translating comprises preserving the identified at least one named entity in the first language. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 22)
-
-
12. A system comprising
one or more computers programmed to perform operations comprising: -
training a statistical classifier to identify named entities using training data comprising a plurality of features, wherein one of the features is a word shape feature that comprises a respective token for each letter of a respective word, the respective token indicating that each letter of the respective word is one of an upper case letter, a lower case letter, and a digit; receiving a plurality of word strings in a first language, each received word string comprising a plurality of words; identifying at least one named entity in each received word string using the trained statistical classifier; and translating the received word strings from the first language to a second language, wherein translating comprises preserving the identified at least one named entity in the first language. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
23. A storage device having instructions stored thereon that when executed by one or more computers perform operations comprising:
-
training a statistical classifier to identify named entities using training data comprising a plurality of features, wherein one of the features is a word shape feature that comprises a respective token for each letter of a respective word, the respective token indicating that each letter of the respective word is one of an upper case letter, a lower case letter, and a digit; receiving a plurality of word strings in a first language, each received word string comprising a plurality of words; identifying at least one named entity in each received word string using the trained statistical classifier; and translating the received word strings from the first language to a second language, wherein translating comprises preserving the identified at least one named entity in the first language. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
Specification