Word detection
First Claim
Patent Images
1. A computer-implemented method, comprising:
- determining, by one or more computers, first word frequencies for existing words and a candidate word in a training corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word in a dictionary;
determining, by the one or more computers, second word frequencies for the constituent words and the candidate word in a development corpus;
determining, by the one or more computers, a candidate word entropy-related measure based on the second word frequency of the candidate word and the first word frequencies of the constituent words and the candidate word;
determining, by the one or more computers, an existing word entropy-related measure based on the second word frequencies of the constituent words and the first word frequencies of the constituent words and the candidate word; and
determining, by the one or more computers, that the candidate word is a new word if the candidate word entropy-related measure exceeds the existing word entropy-related measure.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison.
85 Citations
28 Claims
-
1. A computer-implemented method, comprising:
-
determining, by one or more computers, first word frequencies for existing words and a candidate word in a training corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word in a dictionary; determining, by the one or more computers, second word frequencies for the constituent words and the candidate word in a development corpus; determining, by the one or more computers, a candidate word entropy-related measure based on the second word frequency of the candidate word and the first word frequencies of the constituent words and the candidate word; determining, by the one or more computers, an existing word entropy-related measure based on the second word frequencies of the constituent words and the first word frequencies of the constituent words and the candidate word; and determining, by the one or more computers, that the candidate word is a new word if the candidate word entropy-related measure exceeds the existing word entropy-related measure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented method, comprising:
-
determining, by one or more computers, first word probabilities for existing words and a candidate word in a first corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word in a dictionary; determining, by the one or more computers, second word probabilities for the constituent words and the candidate word in a second corpus; determining, by the one or more computers, a first entropy-related value based on the second candidate word probability and the first word probabilities of the candidate word and the constituent words; determining, by the one or more computers, a second entropy-related value based on the second constituent word probabilities and the first word probabilities of the candidate word and the constituent words; and determining, by the one or more computers, that the candidate word is a new word if the first entropy-related value exceeds the second entropy-related value. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A computer-implemented method, comprising:
-
partitioning, by the one or more computers, a collection of web documents into a training corpus and a development corpus; training, by the one or more computers, a language model on the training corpus for first word probabilities of words in the training corpus, wherein the words in the training corpus include a candidate word defined by a sequence of two or more corresponding words in the training corpus, the two or more corresponding words existing words in a dictionary; counting, by the one or more computers, occurrences of the candidate word and the two or more corresponding words in the development corpus; determining, by the one or more computers, a first value based on the occurrences of the candidate word in the development corpus and the first word probabilities; determining, by the one or more computers, a second valued based on the occurrences of the two or more corresponding words in the development corpus and the first word probabilities; comparing, by the one or more computers, the first value to the second value; and determining, by the one or more computers, whether the candidate word is a new word based on the comparison. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A system, comprising:
-
a word processing module comprising computer instructions stored in a computer readable medium, and upon execution by a computer device configured to access and partition a word corpus into a training corpus and a development corpus, and to generate; first word probabilities for words stored in the training corpus, the words including a candidate word comprising two or more corresponding words; second word probabilities for the words in the development corpus; a new word analyzer module comprising computer instructions stored in a computer readable medium, and upon execution by a computer device configured to process the first and second word probabilities and generate; a first value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probability for the candidate word; and a second value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probabilities for the two or more corresponding words; and further configured to compare the first value to the second value and determine whether the candidate word is a new word based on the comparison. - View Dependent Claims (20, 21, 22, 23, 24, 25)
-
-
26. An apparatus comprising software stored in a computer readable medium, the software comprising computer readable instructions executable by a computer processing device and that upon such execution cause the computer processing device to:
-
determine first word frequencies for existing words and a candidate word in a training corpus, the candidate word defined by a sequence of constituent words, each constituent word an existing word, and each existing word a word existing in a dictionary; determine second word frequencies for the constituent words and the candidate word in a development corpus; determine a candidate word entropy-related measure based on the second word frequency of the candidate word and the first word frequencies of the constituent words and the candidate word; determine an existing word entropy-related measure based on the second word frequencies of the constituent words and the first word frequencies of the constituent words and the candidate word; and determine that the candidate word is a new word if the candidate word entropy-related measure exceeds the existing word entropy-related measure.
-
-
27. A system, comprising:
-
a data processing apparatus; and a computer memory device storing instructions executable by the data processing apparatus and upon such execution cause the data processing apparatus to perform operations comprising; determining first word probabilities for existing words and a candidate word in a first corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word, and each existing word being a word existing in a dictionary; determining second word probabilities for the constituent words and the candidate word in a second corpus; determining a first entropy-related value based on the second word probability of the candidate word and the first word probabilities of the candidate word and the constituent words; determining a second entropy-related value based on the second word probabilities of the constituent words and the first word probabilities of the candidate word and the constituent words; and determining whether the candidate word is a new word based on a comparison between the first entropy-related value and the second entropy-related value.
-
-
28. A system, comprising:
-
a data processing apparatus; and a computer memory device storing instructions executable by the data processing apparatus and upon such execution cause the data processing apparatus to perform operations comprising; access and partition a word corpus into a training corpus and a development corpus; generate first word probabilities for words stored in the training corpus, the words including a candidate word comprising two or more corresponding words; generate second word probabilities for the words in the development corpus; generate a first value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probability for the candidate word; generate a second value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probabilities for the two or more corresponding words; and compare the first value to the second value and determine whether the candidate word is a new word based on the comparison.
-
Specification