Word Detection

US 20090055168A1
Filed: 08/23/2007
Published: 02/26/2009
Est. Priority Date: 08/23/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

determining first word frequencies for existing words and a candidate word in a training corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word in a dictionary;

determining second word frequencies for the constituent words and the candidate word in a development corpus;

determining a candidate word entropy-related measure based on the second word frequency of the candidate word and the first word frequencies of the constituent words and the candidate word;

determining an existing word entropy-related measure based on the second word frequencies of the constituent words and the first word frequencies of the constituent words and the candidate word; and

determining that the candidate word is a new word if the candidate word entropy-related measure exceeds the existing word entropy-related measure.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison.

203 Citations

28 Claims

1. A computer-implemented method, comprising:
- determining first word frequencies for existing words and a candidate word in a training corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word in a dictionary;
  
  determining second word frequencies for the constituent words and the candidate word in a development corpus;
  
  determining a candidate word entropy-related measure based on the second word frequency of the candidate word and the first word frequencies of the constituent words and the candidate word;
  
  determining an existing word entropy-related measure based on the second word frequencies of the constituent words and the first word frequencies of the constituent words and the candidate word; and
  
  determining that the candidate word is a new word if the candidate word entropy-related measure exceeds the existing word entropy-related measure.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the training corpus and the development corpus comprise web documents.
  - 3. The method of claim 1, further comprising adding the candidate word to a dictionary of existing words if the candidate word is determined to be a new word.
  - 4. The method of claim 1, wherein:
    - determining first word frequencies comprises training a language model for probabilities of the existing words and the candidate words in the training corpus; and
      
      wherein determining second word frequencies comprises determining a word count value for each of the constituent words and the candidate word in the development corpus.
  - 5. The method of claim 4, wherein:
    - determining a candidate word entropy-related measure comprises;
      
      determining a first logarithmic value based on the probabilities of the candidate word and the constituent words; and
      
      determining the candidate word entropy-related measure based on the word count value of the candidate word and the first logarithmic value; and
      
      determining an existing word entropy-related measure comprises;
      
      determining second logarithmic values based on the probabilities of the candidate word and the constituent words; and
      
      determining the existing word entropy-related measure based on the word counts of the constituent words and the second logarithmic values.
  - 6. The method of claim 1, wherein the words each comprise one or more Hanzi characters.
  - 7. The method of claim 1, wherein the words each comprise one or more logographic characters.
  - 8. The method of claim 1, further comprising:
    - updating the dictionary with the candidate word if the candidate word is determine to be a new word.

9. A computer-implemented method, comprising:
- determining first word probabilities for existing words and a candidate word in a first corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word in a dictionary;
  
  determining second word probabilities for the constituent words and the candidate word in the second corpus;
  
  determining a first entropy-related value based on the second candidate word probability and the first word probabilities of the candidate word and the constituent words;
  
  determining a second entropy-related value based on the second constituent word probabilities and the first word probabilities of the candidate word and the constituent words; and
  
  determining that the candidate word is a new word if the first entropy-related value exceeds the second entropy-related value.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The method of claim 9, wherein identifying a word corpus comprises identifying web documents.
  - 11. The method of claim 9, wherein:
    - determining first word probabilities comprises training a language model on the first corpus for word probabilities of the existing words and the candidate word in the first corpus; and
      
      determining second word probabilities comprises determining word count values for each of the constituent words and candidate word.
  - 12. The method of claim 11, wherein:
    - determining a first entropy-related value comprises;
      
      determining a first logarithmic value based on the first word probabilities of the candidate word and the constituent words; and
      
      determining the first entropy-related value based on the word count value of the candidate word and the first logarithmic value; and
      
      determining the second entropy-related value comprises;
      
      determining second logarithmic values based on the first word probabilities of the candidate word and the constituent words; and
      
      determining the second entropy-related value based on the word counts of the constituent words and the second logarithmic values.
  - 13. The method of claim 9, wherein the words each comprise one or more Hanzi characters.

14. A computer-implemented method, comprising:
- partitioning a collection of web documents into a training corpus and a development corpus;
  
  training a language model on the training corpus for first word probabilities of words in the training corpus, wherein the words in the training corpus include a candidate word defined by a sequence of two or more corresponding words in the training corpus, the two or more corresponding words existing words in a dictionary;
  
  counting occurrences of the candidate word and the two or more corresponding words in the development corpus;
  
  determining a first value based on the occurrences of the candidate word in the development corpus and the first word probabilities;
  
  determining a second valued based on the occurrences of the two or more corresponding words in the development corpus and the first word probabilities; and
  
  comparing the first value to the second value; and
  
  determining whether the candidate word is a new word based on the comparison.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The method of claim 14, further comprising adding the candidate word to the dictionary if the candidate word is determined to be a new word.
  - 16. The method of claim 14, wherein training a language model on the training corpus for first word probabilities of words in the training corpus comprises training an n-gram language model.
  - 17. The method of claim 16, wherein:
    - determining a first value based on the occurrences of the candidate word in the development corpus and the first word probabilities corpus comprises;
      
      determining a first logarithmic value based on the first word probability for the candidate word and the first word probabilities of the two or more corresponding words; and
      
      multiplying the first logarithmic value by the counted occurrences of the candidate word; and
      
      wherein determining a second valued based on the occurrences of the two or more corresponding words in the development corpus and the first word probabilities comprises;
      
      determining second logarithmic values based on the first word probability of the candidate word and the first word probabilities of the two or more corresponding words; and
      
      multiplying the second logarithmic values by the counted occurrences of the two or more corresponding words.
  - 18. The method of claim 17, wherein the words each comprise one or more Hanzi characters.

19. A system, comprising:
- a word processing module comprising computer instructions stored in a computer readable medium, and upon execution by a computer device configured to access and partition a word corpus into a training corpus and a development corpus, and to generate;
  
  first word probabilities for words stored in the training corpus, the words including a candidate word comprising two or more corresponding words;
  
  second word probabilities for the words in the development corpus;
  
  a new word analyzer module comprising computer instructions stored in a computer readable medium, and upon execution by a computer device configured to process the first and second word probabilities and generate;
  
  a first value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probability for the candidate word; and
  
  a second value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probabilities for the two or more corresponding words;
  
  and further configured to compare the first value to the second value and determine whether the candidate word is a new word based on the comparison.
- View Dependent Claims (20, 21, 22, 23, 24, 25)
- - 20. The system of claim 19, further comprising a dictionary updater module comprising computer instructions stored in a computer readable medium, and upon execution by a computer device configured to update a dictionary with identified new words.
  - 21. The system of claim 19, wherein the word processing module comprises an n-gram language model.
  - 22. The system of claim 19, wherein the first and second values are entropy-related values.
  - 23. The system of claim 19, wherein the word corpus comprises web documents.
  - 24. The system of claim 19, wherein the word processing module comprises a Hanzi character processing module.
  - 25. The system of claim 24, wherein each word comprises one or more Hanzi characters.

26. An apparatus comprising software stored in a computer readable medium, the software comprising computer readable instructions executable by a computer processing device and that upon such execution cause the computer processing device to:
- determine first word frequencies for existing words and a candidate word in a training corpus, the candidate word defined by a sequence of constituent words, each constituent word an existing word, and each existing word a word existing in a dictionary;
  
  determine second word frequencies for the constituent words and the candidate word in a development corpus;
  
  determine a candidate word entropy-related measure based on the second word frequency of the candidate word and the first word frequencies of the constituent words and the candidate word;
  
  determine an existing word entropy-related measure based on the second word frequencies of the constituent words and the first word frequencies of the constituent words and the candidate word; and
  
  determine that the candidate word is a new word if the candidate word entropy-related measure exceeds the existing word entropy-related measure.

27. A system, comprising:
- means for determining first word probabilities for existing words and a candidate word in a first corpus, the candidate word defined by a sequence of constituent words, each constituent word being an existing word, and each existing word being a word existing in a dictionary;
  
  means for determining second word probabilities for the constituent words and the candidate word in a second corpus;
  
  means for determining a first entropy-related value based on the second word probability of the candidate word and the first word probabilities of the candidate word and the constituent words;
  
  means for determining a second entropy-related value based on the second word probabilities of the constituent words and the first word probabilities of the candidate word and the constituent words; and
  
  means for determining whether the candidate word is a new word based on a comparison between the first entropy-related value and the second entropy-related value.

28. A system, comprising:
- a word processing means configured to access and partition a word corpus into a training corpus and a development corpus, and to generate;
  
  first word probabilities for words stored in the training corpus, the words including a candidate word comprising two or more corresponding words;
  
  second word probabilities for the words in the development corpus;
  
  a new word analyzer means configured to receive the first and second word probabilities and generate;
  
  a first value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probability for the candidate word; and
  
  a second value based on the first word probabilities for the candidate word and the two or more corresponding words and the second word probabilities for the two or more corresponding words;
  
  and further configured to compare the first value to the second value and determine whether the candidate word is a new word based on the comparison.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Wang, Yonggang, Liu, Tang Xi, Yang, Bo, Wu, Jun, Zhang, Lei, Hong, Feng

Granted Patent

US 7,917,355 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/10
CPC Class Codes

G06F 40/129   Handling non-Latin characte...

G06F 40/216   using statistical methods

G06F 40/242   Dictionaries

G06F 40/53   Processing of non-Latin tex...

Word Detection

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

203 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Word Detection

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

203 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links