Systematic mass normalization of international titles
First Claim
Patent Images
1. A system for generating a database of labeled foreign titles comprising:
- an interface to receive a title in a second language; and
a processor to;
store n-grams each with associated labels in a first language in a first database, wherein the first language and the second language are different;
sanitize the title in the second language into a sanitized title in the second language;
translate the sanitized title in the second language into a translated title in the first language;
break the translated title in the first language into intermediary n-grams in the first language;
determine parent n-grams of an intermediary n-gram of the intermediary n-grams, wherein the intermediary n-gram is a sub-string of the parent n-grams;
retrieve a set of labels associated with the parent n-grams using the first database, wherein the set of labels are in the first language; and
in response to determining that a matching threshold of a label of the set of labels is met, assign the label to the intermediary n-gram, wherein the matching threshold is a frequency the label occurs in the set of labels associated with the parent n-grams, wherein assigning the label comprises storing the label in the first language in a second database, and wherein the second database stores the title in the second language and the intermediary n-gram in the first language.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for generating a database of labeled foreign canonical titles includes an interface and a processor. The interface is to receive a title in a second language. The processor is to 1) store a set of n-grams in a first language in a first database; 2) sanitize the title into a sanitize title in the second language; 3) translate the sanitized title into a translated title in the first language; 4) break the translated title into n-grams; 5) determine labels for the n-grams using the first database; and 6) determine label to associate with the title.
23 Citations
20 Claims
-
1. A system for generating a database of labeled foreign titles comprising:
-
an interface to receive a title in a second language; and a processor to; store n-grams each with associated labels in a first language in a first database, wherein the first language and the second language are different; sanitize the title in the second language into a sanitized title in the second language; translate the sanitized title in the second language into a translated title in the first language; break the translated title in the first language into intermediary n-grams in the first language; determine parent n-grams of an intermediary n-gram of the intermediary n-grams, wherein the intermediary n-gram is a sub-string of the parent n-grams; retrieve a set of labels associated with the parent n-grams using the first database, wherein the set of labels are in the first language; and in response to determining that a matching threshold of a label of the set of labels is met, assign the label to the intermediary n-gram, wherein the matching threshold is a frequency the label occurs in the set of labels associated with the parent n-grams, wherein assigning the label comprises storing the label in the first language in a second database, and wherein the second database stores the title in the second language and the intermediary n-gram in the first language. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method for generating a database of labeled foreign titles comprising:
-
receiving a title in a second language; storing n-grams each with associated labels in a first language in a first database, wherein the first language and the second language are different; sanitizing, using a processor, the title in the second language into a sanitized title in the second language; translating the sanitized title in the second language into a translated title in the first language; breaking the translated title in the first language into intermediary n-grams in the first language; determining parent n-grams of an intermediary n-gram of the intermediary n-grams, wherein the intermediary n-gram is a sub-string of the parent n-grams; retrieving a set of labels associated with the parent n-grams using the first database, wherein the set of labels are in the first language; and in response to determining that a matching threshold of a label of the set of labels is met, assigning the label to the intermediary n-gram, wherein the matching threshold is a frequency the label occurs in the set of labels associated with the parent n-grams, wherein assigning the label comprises storing the label in the first language in a second database, and wherein the second database stores the title in the second language and the intermediary n-gram in the first language.
-
-
20. A computer program product for generating a database of labeled foreign titles, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for:
-
receiving a title in a second language; storing n-grams each with associated labels in a first language in a first database, wherein the first language and the second language are different; sanitizing, using a processor, the title in the second language into a sanitized title in the second language; translating the sanitized title in the second language into a translated title in the first language; breaking the translated title in the first language into intermediary n-grams in the first language; determining parent n-grams of an intermediary n-gram of the intermediary n-grams, wherein the intermediary n-gram is a sub-string of the parent n-grams; retrieving a set of labels associated with the parent n-grams using the first database, wherein the set of labels are in the first language; and in response to determining that a matching threshold of a label of the set of labels is met, assigning the label to the intermediary n-gram, wherein the matching threshold is a frequency the label occurs in the set of labels associated with the parent n-grams, wherein assigning the label comprises storing the label in the first language in a second database, and wherein the second database stores the title in the second language and the intermediary n-gram in the first language.
-
Specification