Proper name identification in chinese

US 6,694,055 B2
Filed: 07/15/1998
Issued: 02/17/2004
Est. Priority Date: 07/15/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify proper names in input text by performing steps comprising:

locating a sequence of single-characters in the input text not forming part of a multiple-character word;

comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and

comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name;

locating non-proper name multiple-character words in the input text; and

identifying characters of the input text comprising each of the multiple-character words.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A word segmentation method to identify proper names in input text includes locating a sequence of single-characters in the input text not forming part of a multiple-character word. The method further includes comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name, and comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name. Instructions can be provided on a computer readable medium to implement the method.

67 Citations

View as Search Results

45 Claims

1. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify proper names in input text by performing steps comprising:
- locating a sequence of single-characters in the input text not forming part of a multiple-character word;
  
  comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and
  
  comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name;
  
  locating non-proper name multiple-character words in the input text; and
  
  identifying characters of the input text comprising each of the multiple-character words.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
3. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
- assigning an indication of probability that the first portion and the second portion corresponding to a proper name.
4. The computer readable medium of claim 3 including instructions readable by a computer which, when implemented, cause the computer to assign the indication of probability as a function of words proximate the first and second portions.
5. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
- locating single-character words in the input text.
6. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer in the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name to do so as a function of character position.
7. The computer readable medium of claim 6 including instructions readable by a computer which, when implemented, cause the computer to include in the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name, a step comprising:
- forming a record if adjacent single-characters are known to comprise the second portion as a function of character position.
8. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify proper names in unsegmented input text.
9. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify proper names in Chinese text.
10. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify full names.
11. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify full names wherein the first portion comprises a family name and the second portion forms at least a part of a given name.
12. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify institutional names.
13. The computer Readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify geographical names.

14. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify non-Chinese originated names written with Chinese characters in Chinese text by performing steps comprising:
- locating a sequence of five or more single, Chinese characters in the Chinese text not forming part of a multiple-character word; and
  
  comparing the sequence of single, Chinese characters to a lexical knowledge base to identify if the Chinese characters contained in the sequence of characters correspond to Chinese characters used in non-Chinese originated names.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The computer readable medium of claim 14 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
16. The computer readable medium of claim 15 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
- locating single-character words in the Chinese text.
17. The computer readable medium of claim 16 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
- submitting the non-Chinese originated name, the single-character words and the multiple-character words to a syntactic parser; and
  
  parsing the Chinese text with the non-Chinese originated name, the single-character words and the multiple-character words.
18. The computer readable medium of claim 17 including instructions readable by a computer which, when implemented, cause the computer to parse the Chinese text as a function of at least some of the words assigned a probability, and perform a step comprising:
- assigning a higher probability to the non-Chinese originated name than at least some of the other words.

19. A computer readable medium comprisinga lexical knowledge base for use in identifying proper names in input text, the lexical knowledge base comprising:
- for each of a plurality of words, an indication that the word corresponds to a first portion of a proper name;
  
  for each of a plurality of characters, an indication that the character is part of a second portion of a proper name;
  
  for each of a plurality of sequences of characters, an indication that the sequence of characters corresponds to all portions of a proper name; and
  
  for each of a plurality of entries comprising a character sequence, an indication that a first character is a part of the first portion of a proper name and a second character is a first character of the second portion of a proper name.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
- - 20. The computer readable medium of claim 19 wherein the indication that the character is part of a second portion includes position information of the character in the proper name.
  - 21. The computer readable medium of claim 19 wherein the knowledge base further comprises:
22. The computer readable medium of claim 19 wherein the proper name is a full name and the first portion comprises a family name and the second portion comprises a given name.
23. The computer readable medium of claim 22 wherein the indication that the character is part of a given name includes positional information of the character in the given name.
24. The computer readable medium of claim 23 wherein the knowledge base further comprises:
- for each of a plurality of entries comprising a character sequence, an indication that a first character is a family name and a second character is a first character of a given name.
25. The computer readable medium of claim 19 wherein the proper name comprises an institutional name.
26. The computer readable medium of claim 19 wherein the proper name comprises a geographical location.

27. A computer readable medium comprisinga lexical knowledge base for use in identifying non-Chinese originated names in Chinese text, the lexical knowledge base comprising:
- first field including a character; and
  
  a second field associated with the first field and including an indication that the character is part of a non-Chinese originated name.

28. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base for identifying proper names in input text by performing steps comprising:
- comparing a list of full proper names to be identified and a list of known portions of the full proper names and removing from each of the proper names any known portions contained therein to obtain a list comprising remaining portions of the full proper names; and
  
  storing indications in the lexical knowledge base for the list of full proper names, for the list of known portions of the full proper names, for the list of remaining portions of the full proper names, for character sequences that a first character is a part of the known portion of a full proper name and a second character is a first character of the remaining portion of a full proper name, and for positional information of characters in each of the remaining portions of the full proper names, wherein the positional information is separate from the list of remaining portions of full proper names.
- View Dependent Claims (29, 30, 31)
- - 29. The computer readable medium of claim 28 including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base wherein the known portions comprise family names and the remaining portions comprise given names.
  - 30. The computer readable medium of claim 28 including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base to identify institutional names.
  - 31. The computer readable medium of claim 28 including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base to identify geographical names.

32. A word segmentation method to identify proper names in input text, the method comprising:
- locating a sequence of single-characters in the input text not forming part of a multiple-character word;
  
  comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and
  
  comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name;
  
  locating non-proper name multiple-character words in the input text; and
  
  identifying characters of the input text comprising each of the multiple-character words.
- View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40)
- - 33. The word segmentation method of claim 32 and further comprising:
34. The word segmentation method of claim 32 and further comprising:
- assigning an indication of probability that the first portion and the second portion corresponding to a proper name.
35. The word segmentation method of claim 34 wherein assigning the indication of probability is a function of words proximate the first and second portions.
36. The word segmentation method of claim 32 wherein the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name includes comparing as a function of character position.
37. The word segmentation method of claim 36 wherein the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name, includes:
- forming a record if adjacent single-characters are known to comprise the second portion as a function of character position.
38. The word segmentation method of claim 32 wherein the method identifies proper names in unsegmented input text.
39. The word segmentation method of claim 32 wherein the method identifies proper names in Chinese text.
40. The word segmentation method of claim 32 wherein the method identifies at least one of full names, institutional names and geographical names.

41. A word segmentation method to identify non-Chinese originated names contained in Chinese text, the method comprising:
- locating a sequence of three or more single, Chinese characters in the Chinese text not forming part of a multiple-character word; and
  
  comparing the sequence of single, Chinese characters to a lexical knowledge base to identify if the Chinese characters contained in the sequence of characters correspond to Chinese characters used in non-Chinese originated names.
- View Dependent Claims (42, 43, 44, 45)
- - 42. The word segmentation method of claim 41 and further comprising:
43. The word segmentation method of claim 42 and further comprising:
- locating single-character words in the Chinese text.
44. The word segmentation method of claim 43 and further comprising:
- submitting the non-Chinese originated name, the single-character words and the multiple-character words to a syntactic parser; and
  
  parsing the Chinese text with the non-Chinese originated name, the single-character words and the multiple-character words.
45. The word segmentation method of claim 44 and further comprising:
- assigning a higher probability to the non-Chinese originated name than at least some of the other words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Wu, Andi
Primary Examiner(s)
MARIAM, DANIEL G

Application Number

US09/116,560
Publication Number

US 20020003898A1
Time in Patent Office

2,043 Days
Field of Search

382/185, 382/187, 382/198, 382/218, 382/228, 382/310, 382/181, 382/186, 382/229, 382/231, 382/305, 707/2, 707/6, 707/5, 704/1, 704/2, 704/4-5, 704/7-10
US Class Current

382/185
CPC Class Codes

G06F 40/295   Named entity recognition

G06F 40/53   Processing of non-Latin tex...

Y10S 707/99932   Access augmentation or opti...

Proper name identification in chinese

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

67 Citations

45 Claims

Specification

Use Cases

Quick Links

Others

Proper name identification in chinese

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

67 Citations

45 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others