Proper name identification in chinese
First Claim
1. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify proper names in input text by performing steps comprising:
- locating a sequence of single-characters in the input text not forming part of a multiple-character word;
comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and
comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name;
locating non-proper name multiple-character words in the input text; and
identifying characters of the input text comprising each of the multiple-character words.
2 Assignments
0 Petitions
Accused Products
Abstract
A word segmentation method to identify proper names in input text includes locating a sequence of single-characters in the input text not forming part of a multiple-character word. The method further includes comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name, and comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name. Instructions can be provided on a computer readable medium to implement the method.
67 Citations
45 Claims
-
1. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify proper names in input text by performing steps comprising:
-
locating a sequence of single-characters in the input text not forming part of a multiple-character word;
comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and
comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name;
locating non-proper name multiple-character words in the input text; and
identifying characters of the input text comprising each of the multiple-character words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
submitting the first portion and the second portion to a syntactic parser with other words; and
parsing the input text with first portion, the second portion and the other words.
-
-
3. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
assigning an indication of probability that the first portion and the second portion corresponding to a proper name.
-
4. The computer readable medium of claim 3 including instructions readable by a computer which, when implemented, cause the computer to assign the indication of probability as a function of words proximate the first and second portions.
-
5. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
locating single-character words in the input text.
-
6. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer in the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name to do so as a function of character position.
-
7. The computer readable medium of claim 6 including instructions readable by a computer which, when implemented, cause the computer to include in the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name, a step comprising:
forming a record if adjacent single-characters are known to comprise the second portion as a function of character position.
-
8. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify proper names in unsegmented input text.
-
9. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify proper names in Chinese text.
-
10. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify full names.
-
11. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify full names wherein the first portion comprises a family name and the second portion forms at least a part of a given name.
-
12. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify institutional names.
-
13. The computer Readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify geographical names.
-
14. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify non-Chinese originated names written with Chinese characters in Chinese text by performing steps comprising:
-
locating a sequence of five or more single, Chinese characters in the Chinese text not forming part of a multiple-character word; and
comparing the sequence of single, Chinese characters to a lexical knowledge base to identify if the Chinese characters contained in the sequence of characters correspond to Chinese characters used in non-Chinese originated names. - View Dependent Claims (15, 16, 17, 18)
locating non-proper name multiple-character words in the Chinese text; and
identifying characters of the input text comprising each of the multiple-character words.
-
-
16. The computer readable medium of claim 15 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
locating single-character words in the Chinese text.
-
17. The computer readable medium of claim 16 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
-
submitting the non-Chinese originated name, the single-character words and the multiple-character words to a syntactic parser; and
parsing the Chinese text with the non-Chinese originated name, the single-character words and the multiple-character words.
-
-
18. The computer readable medium of claim 17 including instructions readable by a computer which, when implemented, cause the computer to parse the Chinese text as a function of at least some of the words assigned a probability, and perform a step comprising:
assigning a higher probability to the non-Chinese originated name than at least some of the other words.
-
19. A computer readable medium comprising
a lexical knowledge base for use in identifying proper names in input text, the lexical knowledge base comprising: -
for each of a plurality of words, an indication that the word corresponds to a first portion of a proper name;
for each of a plurality of characters, an indication that the character is part of a second portion of a proper name;
for each of a plurality of sequences of characters, an indication that the sequence of characters corresponds to all portions of a proper name; and
for each of a plurality of entries comprising a character sequence, an indication that a first character is a part of the first portion of a proper name and a second character is a first character of the second portion of a proper name. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
for each of a plurality of words, an indication that the word corresponds to a proper name.
-
-
22. The computer readable medium of claim 19 wherein the proper name is a full name and the first portion comprises a family name and the second portion comprises a given name.
-
23. The computer readable medium of claim 22 wherein the indication that the character is part of a given name includes positional information of the character in the given name.
-
24. The computer readable medium of claim 23 wherein the knowledge base further comprises:
for each of a plurality of entries comprising a character sequence, an indication that a first character is a family name and a second character is a first character of a given name.
-
25. The computer readable medium of claim 19 wherein the proper name comprises an institutional name.
-
26. The computer readable medium of claim 19 wherein the proper name comprises a geographical location.
-
27. A computer readable medium comprising
a lexical knowledge base for use in identifying non-Chinese originated names in Chinese text, the lexical knowledge base comprising: -
first field including a character; and
a second field associated with the first field and including an indication that the character is part of a non-Chinese originated name.
-
-
28. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base for identifying proper names in input text by performing steps comprising:
-
comparing a list of full proper names to be identified and a list of known portions of the full proper names and removing from each of the proper names any known portions contained therein to obtain a list comprising remaining portions of the full proper names; and
storing indications in the lexical knowledge base for the list of full proper names, for the list of known portions of the full proper names, for the list of remaining portions of the full proper names, for character sequences that a first character is a part of the known portion of a full proper name and a second character is a first character of the remaining portion of a full proper name, and for positional information of characters in each of the remaining portions of the full proper names, wherein the positional information is separate from the list of remaining portions of full proper names. - View Dependent Claims (29, 30, 31)
-
-
32. A word segmentation method to identify proper names in input text, the method comprising:
-
locating a sequence of single-characters in the input text not forming part of a multiple-character word;
comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and
comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name;
locating non-proper name multiple-character words in the input text; and
identifying characters of the input text comprising each of the multiple-character words. - View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40)
submitting the first portion and the second portion to a syntactic parser with other words; and
parsing the input text with first portion, the second portion and the other words.
-
-
34. The word segmentation method of claim 32 and further comprising:
assigning an indication of probability that the first portion and the second portion corresponding to a proper name.
-
35. The word segmentation method of claim 34 wherein assigning the indication of probability is a function of words proximate the first and second portions.
-
36. The word segmentation method of claim 32 wherein the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name includes comparing as a function of character position.
-
37. The word segmentation method of claim 36 wherein the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name, includes:
forming a record if adjacent single-characters are known to comprise the second portion as a function of character position.
-
38. The word segmentation method of claim 32 wherein the method identifies proper names in unsegmented input text.
-
39. The word segmentation method of claim 32 wherein the method identifies proper names in Chinese text.
-
40. The word segmentation method of claim 32 wherein the method identifies at least one of full names, institutional names and geographical names.
-
41. A word segmentation method to identify non-Chinese originated names contained in Chinese text, the method comprising:
-
locating a sequence of three or more single, Chinese characters in the Chinese text not forming part of a multiple-character word; and
comparing the sequence of single, Chinese characters to a lexical knowledge base to identify if the Chinese characters contained in the sequence of characters correspond to Chinese characters used in non-Chinese originated names. - View Dependent Claims (42, 43, 44, 45)
locating non-proper name multiple-character words in the Chinese text; and
identifying characters of the Chinese text comprising each of the multiple-character words.
-
-
43. The word segmentation method of claim 42 and further comprising:
locating single-character words in the Chinese text.
-
44. The word segmentation method of claim 43 and further comprising:
-
submitting the non-Chinese originated name, the single-character words and the multiple-character words to a syntactic parser; and
parsing the Chinese text with the non-Chinese originated name, the single-character words and the multiple-character words.
-
-
45. The word segmentation method of claim 44 and further comprising:
assigning a higher probability to the non-Chinese originated name than at least some of the other words.
Specification