Identifying language of origin for words using estimates of normalized appearance frequency
First Claim
Patent Images
1. A method of identifying a language of origin of an input word, using a computer with a processor, comprising:
- generating a wide area network query based on the input word to obtain, with the processor, search results, comprising web pages, in a plurality of different languages;
estimating, with the processor, a normalized frequency of occurrence of the input word in each of the different languages based on the search results;
identifying, with the processor, the language of origin of the input word based on the estimated frequencies of occurrence, andoutputting an indication of the language of origin;
wherein the search results comprise web pages and wherein estimating a normalized frequency of occurrence in a selected language comprises;
obtaining a count of a number of web pages in the selected language in the search results that contain the input word; and
estimating a total number of web pages in the selected language by generating a wide area network query based on one or more function words in the selected language to obtain function word search results, and estimating the total number of web pages based on the function word search result.
2 Assignments
0 Petitions
Accused Products
Abstract
The language of origin of a word or named entity is predicted using estimates of frequency of occurrence of the word or named entity in different languages. In one embodiment, the normalized frequency of occurrence of the word or named entity in a variety of different languages is estimated and the values are used as features in a feature vector which is scored and used to identify language of origin.
-
Citations
11 Claims
-
1. A method of identifying a language of origin of an input word, using a computer with a processor, comprising:
-
generating a wide area network query based on the input word to obtain, with the processor, search results, comprising web pages, in a plurality of different languages; estimating, with the processor, a normalized frequency of occurrence of the input word in each of the different languages based on the search results; identifying, with the processor, the language of origin of the input word based on the estimated frequencies of occurrence, and outputting an indication of the language of origin; wherein the search results comprise web pages and wherein estimating a normalized frequency of occurrence in a selected language comprises; obtaining a count of a number of web pages in the selected language in the search results that contain the input word; and estimating a total number of web pages in the selected language by generating a wide area network query based on one or more function words in the selected language to obtain function word search results, and estimating the total number of web pages based on the function word search result. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for identifying a language of origin of an input word, comprising:
-
a feature extraction system comprising a frequency of occurrence estimation system estimating a frequency of occurrence of the input word in each of a plurality of different languages; a language identifier identifying the language of origin of the input word based on the frequency of occurrence estimated; a search engine coupled to the feature extraction system, the feature extraction system generating a wide area network query based on one or more function words in a selected language to obtain function word search results and extracting features from the function word search results, the features being indicative of the frequency of occurrence of the input word; the feature extraction system extracting normalized frequency of occurrence features based on a number of pages in the function word search results, in the selected language, that contain the one or more function words and an estimate of a total number of pages in the language; and a computer processor, activated by the frequency of occurrence estimation system, to facilitate estimating the frequency of occurrence. - View Dependent Claims (9, 10, 11)
-
Specification