CJK NAME DETECTION
First Claim
1. A method comprising:
- generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus;
applying the raw name detection model to a collection of semi-structured data to form annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names;
applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and
generating a name detection model including;
deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names,deriving a not-name model using the semi-structured data not identifying names, andderiving a language model using the large annotated corpus.
3 Assignments
0 Petitions
Accused Products
Abstract
Aspects directed to name detection are provided. A method includes generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring. The method includes applying the raw name detection model to a collection of semi-structured data to form annotated semi?structured data identifying n-grams identifying names and n?grams not identifying names and applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names. The method includes generating a name detection model, including deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names, deriving a not-name model using the semi?structured data not identifying names, and deriving a language model using the large annotated corpus.
30 Citations
38 Claims
-
1. A method comprising:
-
generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus; applying the raw name detection model to a collection of semi-structured data to form annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names; applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and generating a name detection model including; deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names, deriving a not-name model using the semi-structured data not identifying names, and deriving a language model using the large annotated corpus. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 11, 12)
-
-
9. A method comprising:
-
receiving an input string of characters; and applying a name detection model to the input string having a plurality of characters, including; identifying a most likely segmentation of the plurality of characters where the plurality of characters do not include one or more names, detecting one or more sequences of characters of the plurality of characters as potentially identifying one or more names, identifying a segmentation of the plurality of characters where the plurality of characters include the one or more potential names, and segmenting the plurality of characters as including the one or more names when the likelihood of the segmentation including the potential one or more names is greater than the most likely segmentation not including one or more names. - View Dependent Claims (10)
-
-
13. (canceled)
-
14. (canceled)
-
15. A computer program product, encoded on a tangible program carrier, operable to cause data processing apparatus to perform operations comprising:
-
generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus; applying the raw name detection model to a collection of semi-structured data to form annotated semi-structured data, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names; applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names; and generating a name detection model including; deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names, deriving a not-name model using the semi-structured data not identifying names, and deriving a language model using the large annotated corpus. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 25, 26)
-
-
23. A computer program product, encoded on a tangible program carrier, operable to cause data processing apparatus to perform operations comprising:
-
receiving an input string of characters; and applying a name detection model to the input string having a plurality of characters, including; identifying a most likely segmentation of the plurality of characters where the plurality of characters do not include one or more names, detecting one or more sequences of characters of the plurality of characters as potentially identifying one or more names, identifying a segmentation of the plurality of characters where the plurality of characters include the one or more potential names, and segmenting the plurality of characters as including the one or more names when the likelihood of the segmentation including the potential one or more names is greater than the most likely segmentation not including one or more names. - View Dependent Claims (24)
-
-
27. A system comprising:
-
a raw name model including a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring as a name in the annotated corpus; annotated semi-structured data formed by applying the raw name detection model to a collection of semi-structured data to form, the annotated semi-structured data identifying n-grams identifying names and n-grams not identifying names; large annotated corpus data formed by applying the raw name detection model to a collection of a large unannotated corpus, the large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names by applying the raw name detection model; and a name detection model including; a name model derived from the annotated semi-structured data identifying names and the large annotated corpus data identifying names, a not-name model derived from the semi-structured data not identifying names, and a language model derived from the large annotated corpus. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 37, 38)
-
-
35. A system comprising one or more computers operable to perform operations including:
-
receiving an input string of characters; and applying a name detection model to the input string having a plurality of characters, including; identifying a most likely segmentation of the plurality of characters where the plurality of characters do not include one or more names, detecting one or more sequences of characters of the plurality of characters as potentially identifying one or more names, identifying a segmentation of the plurality of characters where the plurality of characters include the one or more potential names, and segmenting the plurality of characters as including the one or more names when the likelihood of the segmentation including the potential one or more names is greater than the most likely segmentation not including one or more names. - View Dependent Claims (36)
-
Specification