CJK NAME DETECTION

US 20100306139A1
Filed: 12/06/2007
Published: 12/02/2010
Est. Priority Date: 12/06/2007
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus;

applying the raw name detection model to a collection of semi-structured data to form annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;

applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and

generating a name detection model including;

deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names,deriving a not-name model using the semi-structured data not identifying names, andderiving a language model using the large annotated corpus.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Aspects directed to name detection are provided. A method includes generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring. The method includes applying the raw name detection model to a collection of semi-structured data to form annotated semi?structured data identifying n-grams identifying names and n?grams not identifying names and applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names. The method includes generating a name detection model, including deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names, deriving a not-name model using the semi?structured data not identifying names, and deriving a language model using the large annotated corpus.

30 Citations

View as Search Results

38 Claims

1. A method comprising:
- generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus;
  
  applying the raw name detection model to a collection of semi-structured data to form annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;
  
  applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and
  
  generating a name detection model including;
  
  deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names,deriving a not-name model using the semi-structured data not identifying names, andderiving a language model using the large annotated corpus.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 11, 12)
- - 2. The method of claim 1, further comprising:
    - applying the name detection model to the collection of semi-structured data to form the annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;
      
      applying the name detection model to the large unannotated corpus to form the large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and
      
      generating a refined name detection model including;
      
      deriving a refined name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names,deriving a refined not-name model using the semi-structured data not identifying names, andderiving a refined language model using the large annotated corpus.
  - 3. The method of claim 1, wherein the name model includes:
    - a collection of n-grams from the annotated semi-structured data identifying names and the large annotated corpus identifying names, where each n-gram includes a family name as a left character and a given name as right context, and each n-gram has a corresponding probability of identifying a name.
  - 4. The method of claim 1, wherein the not-name model includes:
    - a collection of n-grams from the annotated semi-structured data not identifying names, where each n-gram includes a family name as a left character and a given name as right context, and each n-gram has a corresponding probability of not identifying a name.
  - 5. The method of claim 1, wherein the raw name detection model includes:
    - a collection of n-grams from the annotated corpus, where each n-gram includes a left character that is a family name from the collection of family names, and each n-gram has a corresponding probability of identifying a name according to a relative frequency of the name in the annotated corpus.
  - 6. The method of claim 1, wherein the raw name model is generated using a collection of foreign family names.
  - 7. The method of claim 1, whereinthe collection of family names includes a plurality of sparse family names;
    - andthe raw name detection model uses a single probability of all sparse family names in place of a calculated probability of a specific sparse family name of the plurality of spare family names to identify probabilities of each n-gram, that includes a left character that is a sparse family name, identifying a name.
  - 8. The method of claim 1, wherein the collection of family names includes a plurality of foreign family names.
  - 11. The method of claim 1 further comprising:
    - receiving a string including a plurality of characters; and
      
      calculating a probability that a particular sequence of the string identifies a name, the name includes a family name and a given name, including;
      
      when the frequency of the particular sequence in a corpus is less than a threshold value, determining the probability that the particular sequence identifies a name as a function of a relative frequency that the portion of the sequence representing a given name occurs with any family name and the relative frequency of the portion of the sequence representing the family name.
  - 12. The method of claim 1 further comprising:
    - receiving user input data; and
      
      applying the raw name detection model to the user input data to form annotated user input data, the annotated user input data identifying n-grams identifying names and n-grams not identifying names;
      
      where generating the name detection model further includes;
      
      deriving the name model using the annotated user input data identifying names,deriving the not-name model using the annotated user input data not identifying names, andderiving a language model using the annotated user input data.

9. A method comprising:
- receiving an input string of characters; and
  
  applying a name detection model to the input string having a plurality of characters, including;
  
  identifying a most likely segmentation of the plurality of characters where the plurality of characters do not include one or more names,detecting one or more sequences of characters of the plurality of characters as potentially identifying one or more names,identifying a segmentation of the plurality of characters where the plurality of characters include the one or more potential names, andsegmenting the plurality of characters as including the one or more names when the likelihood of the segmentation including the potential one or more names is greater than the most likely segmentation not including one or more names.
- View Dependent Claims (10)
- - 10. The method of claim 9 further comprising:
    - detecting one or more names when the plurality of characters is segmented as including one or more names.

13. (canceled)

14. (canceled)

15. A computer program product, encoded on a tangible program carrier, operable to cause data processing apparatus to perform operations comprising:
- generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus;
  
  applying the raw name detection model to a collection of semi-structured data to form annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;
  
  applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and
  
  generating a name detection model including;
  
  deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names,deriving a not-name model using the semi-structured data not identifying names, andderiving a language model using the large annotated corpus.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 25, 26)
- - 16. The computer program product of claim 15, operable to cause data processing apparatus to perform operations further comprising:
    - applying the name detection model to the collection of semi-structured data to form the annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;
      
      applying the name detection model to the large unannotated corpus to form the large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and
      
      generating a refined name detection model including;
      
      deriving a refined name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names,deriving a refined not-name model using the semi-structured data not identifying names, andderiving a refined language model using the large annotated corpus.
  - 17. The computer program product of claim 15, wherein the name model includes:
    - a collection of n-grams from the annotated semi-structured data identifying names and the large annotated corpus identifying names, where each n-gram includes a family name as a left character and a given name as right context, and each n-gram has a corresponding probability of identifying a name.
  - 18. The computer program product of claim 15, wherein the not-name model includes:
    - a collection of n-grams from the annotated semi-structured data not identifying names, where each n-gram includes a family name as a left character and a given name as right context, and each n-gram has a corresponding probability of not identifying a name.
  - 19. The computer program product of claim 15, wherein the raw name detection model includes:
    - a collection of n-grams from the annotated corpus, where each n-gram includes a left character that is a family name from the collection of family names, and each n-gram has a corresponding probability of identifying a name according to a relative frequency of the name in the annotated corpus.
  - 20. The computer program product of claim 15, wherein the raw name model is generated using a collection of foreign family names.
  - 21. The computer program product of claim 15, whereinthe collection of family names includes a plurality of sparse family names;
    - andthe raw name detection model uses a single probability of all sparse family names in place of a calculated probability of a specific sparse family name of the plurality of spare family names to identify probabilities of each n-gram, that includes a left character that is a sparse family name, identifying a name.
  - 22. The computer program product of claim 15, wherein the collection of family names includes a plurality of foreign family names.
  - 25. The computer program product of claim 15, operable to cause data processing apparatus to perform operations further comprising:
    - receiving a string including a plurality of characters; and
      
      calculating a probability that a particular sequence of the string identifies a name, the name includes a family name and a given name, including;
      
      when the frequency of the particular sequence in a corpus is less than a threshold value, determining the probability that the particular sequence identifies a name as a function of a relative frequency that the portion of the sequence representing a given name occurs with any family name and the relative frequency of the portion of the sequence representing the family name.
  - 26. The computer program product of claim 15, operable to cause data processing apparatus to perform operations further comprising:
    - receiving user input data; and
      
      applying the raw name detection model to the user input data to form annotated user input data, the annotated user input data identifying n-grams identifying names and n-grams not identifying names;
      
      wherein generating the name detection model further includes;
      
      deriving the name model using the annotated user input data identifying names,deriving the not-name model using the annotated user input data not identifying names, andderiving a language model using the annotated user input data.

23. A computer program product, encoded on a tangible program carrier, operable to cause data processing apparatus to perform operations comprising:
- receiving an input string of characters; and
  
  applying a name detection model to the input string having a plurality of characters, including;
  
  identifying a most likely segmentation of the plurality of characters where the plurality of characters do not include one or more names,detecting one or more sequences of characters of the plurality of characters as potentially identifying one or more names,identifying a segmentation of the plurality of characters where the plurality of characters include the one or more potential names, andsegmenting the plurality of characters as including the one or more names when the likelihood of the segmentation including the potential one or more names is greater than the most likely segmentation not including one or more names.
- View Dependent Claims (24)
- - 24. The computer program product of claim 23, operable to cause data processing apparatus to perform operations further comprising:
    - detecting one or more names when the plurality of characters is segmented as including one or more names.

27. A system comprising:
- a raw name model including a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus;
  
  annotated semi-structured data formed by applying the raw name detection model to a collection of semi-structured data to form, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;
  
  large annotated corpus data formed by applying the raw name detection model to a collection of a large unannotated corpus, the large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names by applying the raw name detection model; and
  
  a name detection model including;
  
  a name model derived from the annotated semi-structured data identifying names and the large annotated corpus data identifying names,a not-name model derived from the semi-structured data not identifying names, anda language model derived from the large annotated corpus.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 37, 38)
- - 28. The system of claim 27, wherein:
    - the name detection model is applied to the collection of semi-structured data to form the annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;
      
      the name detection model is applied to the large unannotated corpus to form the large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and
      
      the system further comprises a refined name detection model including;
      
      a refined name model derived from the annotated semi-structured data identifying names and the large annotated corpus data identifying names,a refined not-name model derived from the semi-structured data not identifying names, anda refined language model derived from the large annotated corpus.
  - 29. The system of claim 27, wherein the name model includes:
    - a collection of n-grams from the annotated semi-structured data identifying names and the large annotated corpus identifying names, where each n-gram includes a family name as a left character and a given name as right context, and each n-gram has a corresponding probability of identifying a name.
  - 30. The system of claim 27, wherein the not-name model includes:
    - a collection of n-grams from the annotated semi-structured data not identifying names, where each n-gram includes a family name as a left character and a given name as right context, and each n-gram has a corresponding probability of not identifying a name.
  - 31. The system of claim 27, wherein the raw name detection model includes:
    - a collection of n-grams from the annotated corpus, where each n-gram includes a left character that is a family name from the collection of family names, and each n-gram has a corresponding probability of identifying a name according to a relative frequency of the name in the annotated corpus.
  - 32. The system of claim 27, wherein the raw name model is generated using a collection of foreign family names.
  - 33. The system of claim 27, whereinthe collection of family names includes a plurality of sparse family names;
    - andthe raw name detection model uses a single probability of all sparse family names in place of a calculated probability of a specific sparse family name of the plurality of spare family names to identify probabilities of each n-gram, that includes a left character that is a sparse family name, identifying a name.
  - 34. The system of claim 27, wherein the collection of family names includes a plurality of foreign family names.
  - 37. The system of claim 27 further comprising one or more computers operable to perform operations including:
    - receiving a string including a plurality of characters;
      
      calculating a probability that a particular sequence of the string identifies a name, the name includes a family name and a given name, including;
      
      when the frequency of the particular sequence in a corpus is less than a threshold value, determining the probability that the particular sequence identifies a name as a function of a relative frequency that the portion of the sequence representing a given name occurs with any family name and the relative frequency of the portion of the sequence representing the family name.
  - 38. The system of claim 27 further comprising one or more computers operable to perform operations including:
    - receiving user input data; and
      
      applying the raw name detection model to the user input data to form annotated user input data, the annotated user input data identifying n-grams identifying names and n-grams not identifying names;
      
      wherein generating the name detection model further includes;
      
      deriving the name model using the annotated user input data identifying names,deriving the not-name model using the annotated user input data not identifying names, andderiving a language model using the annotated user input data.

35. A system comprising one or more computers operable to perform operations including:
- receiving an input string of characters; and
  
  applying a name detection model to the input string having a plurality of characters, including;
  
  identifying a most likely segmentation of the plurality of characters where the plurality of characters do not include one or more names,detecting one or more sequences of characters of the plurality of characters as potentially identifying one or more names,identifying a segmentation of the plurality of characters where the plurality of characters include the one or more potential names, andsegmenting the plurality of characters as including the one or more names when the likelihood of the segmentation including the potential one or more names is greater than the most likely segmentation not including one or more names.
- View Dependent Claims (36)
- - 36. The system of claim 35 comprising one or more computers operable to perform operations further including:
    - detecting one or more names when the plurality of characters is segmented as including one or more names.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Wu, Jun, Zhang, Yifei, Xu, Hui

Granted Patent

US 8,478,787 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06F 40/295 Named entity recognition

CJK NAME DETECTION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

30 Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

CJK NAME DETECTION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

30 Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links