PROPER NAME IDENTIFICATION IN CHINESE

US 20020003898A1
Filed: 07/15/1998
Published: 01/10/2002
Est. Priority Date: 07/15/1998
Status: Active Grant

First Claim

Patent Images

1. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify proper names in input text by performing steps comprising:

locating a sequence of single-characters in the input text not forming part of a multiple-character word;

comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and

comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A word segmentation method to identify proper names in input text includes locating a sequence of single-characters in the input text not forming part of a multiple-character word. The method further includes comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name, and comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name. Instructions can be provided on a computer readable medium to implement the method.

Citations

48 Claims

1. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify proper names in input text by performing steps comprising:
- locating a sequence of single-characters in the input text not forming part of a multiple-character word;
  
  comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and
  
  comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name.
- View Dependent Claims (2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
    - submitting the first portion and the second portion to a syntactic parser with other words; and
      
      parsing the input text with first portion, the second portion and the other words.
  - 3. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
    - assigning an indication of probability that the first portion and the second portion corresponding to a proper name.
  - 4. The computer readable medium of claim 3 including instructions readable by a computer which, when implemented, cause the computer to assign the indication of probability as a function of words proximate the first and second portions.
  - 5. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
    - locating non-proper name multiple-character words in the input text; and
      
      identifying characters of the input text comprising each of the multiple-character words.
  - 7. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer in the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name to do so as a function of character position.
  - 8. The computer readable medium of claim 7 including instructions readable by a computer which, when implemented, cause the computer to include in the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name, a step comprising:
    - forming a record if adjacent single-characters are known to comprise the second portion as a function of character position.
  - 9. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify proper names in unsegmented input text.
  - 10. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify proper names in Chinese text.
  - 11. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify full names.
  - 12. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify full names wherein the first portion comprises a family name and the second portion forms at least a part of a given name.
  - 13. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify institutional names.
  - 14. The computer readable medium of claim 1 including instructions readable by a computer which, when implemented, cause the computer to identify geographical names.

6. The computer readable medium of claim 6 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
- locating single-character words in the input text.

15. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to identify non-Chinese originated names contained in Chinese text by performing steps comprising:
- locating a sequence of five or more single-characters in the input text not forming part of a multiple-character word; and
  
  comparing the sequence of single-characters to a lexical knowledge base to identify if characters contained in the sequence of characters correspond to characters used in non-Chinese originated names.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The computer readable medium of claim 15 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
    - locating non-proper name multiple-character words in the input text; and
      
      identifying characters of the input text comprising each of the multiple-character words.
  - 17. The computer readable medium of claim 16 including instructions readable by a computer which, when implemented, cause the computer to perform a step comprising:
    - locating single-character words in the input text.
  - 18. The computer readable medium of claim 17 including instructions readable by a computer which, when implemented, cause the computer to perform steps comprising:
    - submitting the non-Chinese originated name, the single-character words and the multiple-character words to a syntactic parser; and
      
      parsing the input text with the non-Chinese originated name, the single-character words and the multiple-character words.
  - 19. The computer readable medium of claim 18 including instructions readable by a computer which, when implemented, cause the computer to parse the input text as a function of at least some of the words assigned a probability, and perform a step comprising:
    - assigning a higher probability to the non-Chinese originated name than at least some of the other words.

20. A computer readable medium comprising a lexical knowledge base for use in identifying proper names in input text, the lexical knowledge base comprising:
- for each of a plurality of words, an indication that the word corresponds to a first portion of a proper name; and
  
  for each of a plurality of characters, an indication that the character is part of a second portion of a proper name.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
- - 21. The computer readable medium of claim 20 wherein the indication that the character is part of a second portion includes position information of the character in the proper name.
  - 22. The computer readable medium of claim 20 wherein the knowledge base further comprises:
    - for each of a plurality of words, an indication that the word corresponds to a proper name.
  - 23. The computer readable medium of claim 20 wherein the proper name is a full name and the first portion comprises a family name and the second portion comprises a given name.
  - 24. The computer readable medium of claim 23 wherein the indication that the character is part of a given name includes positional information of the character in the given name.
  - 25. The computer readable medium of claim 24 wherein the knowledge base further comprises:
    - for each of a plurality of entries comprising a character sequence, an indication that a first character is a family name and a second character is a first character of a given name.
  - 26. The computer readable medium of claim 20 wherein the proper name comprises an institutional name.
  - 27. The computer readable medium of claim 20 wherein the proper name comprises a geographical location.

28. A computer readable medium comprising a lexical knowledge base for use in identifying non-Chinese originated names in Chinese text, the lexical knowledge base comprising:
- for each of a plurality of characters, an indication that the character is part of a non-Chinese originated name.

29. A computer readable medium including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base for identifying proper names in input text by performing steps comprising:
- comparing a list of full proper names to be identified and a list of known portions of the full proper names and removing from each of the proper names any known portions contained therein to obtain a list comprising remaining portions of the full proper names; and
  
  storing indications in the lexical knowledge base for the list of full proper names, for the list of known portions of the full proper names, for the list of remaining portions of the full proper names, and for positional information of characters in each of the remaining portions of the full proper names.
- View Dependent Claims (30, 31, 32, 35, 36)
- - 30. The computer readable medium of claim 29 including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base wherein the known portions comprise family names and the remaining portions comprise given names.
  - 31. The computer readable medium of claim 29 including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base to identify institutional names.
  - 32. The computer readable medium of claim 29 including instructions readable by a computer which, when implemented, cause the computer to create a lexical knowledge base to identify geographical names.
  - 35. The word segmentation method of claim 32 and further comprising:
    - assigning an indication of probability that the first portion and the second portion corresponding to a proper name.
  - 36. The word segmentation method of claim 35 wherein assigning the indication of probability is a function of words proximate the first and second portions.

33. A word segmentation method to identify proper names in input text, the method comprising:
- locating a sequence of single-characters in the input text not forming part of a multiple-character word;
  
  comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name; and
  
  comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name.
- View Dependent Claims (34, 37, 38, 39, 40, 41, 43)
- - 34. The word segmentation method of claim 33 and further comprising:
    - submitting the first portion and the second portion to a syntactic parser with other words; and
      
      parsing the input text with first portion, the second portion and the other words.
  - 37. The word segmentation method of claim 33 and further comprising:
    - locating non-proper name multiple-character words in the input text; and
      
      identifying characters of the input text comprising each of the multiple-character words.
  - 38. The word segmentation method of claim 33 wherein the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name includes comparing as a function of character position.
  - 39. The word segmentation method of claim 38 wherein the step of comparing the sequence of single-characters to the lexical knowledge base to identify if the second portion of the sequence includes characters known to comprise the second portion of a proper name, includes:
    - forming a record if adjacent single-characters are known to comprise the second portion as a function of character position.
  - 40. The word segmentation method of claim 33 wherein the method identifies proper names in unsegmented input text.
  - 41. The word segmentation method of claim 33 wherein the method identifies proper names in Chinese text.
  - 43. The word segmentation method of claim 33 wherein the method identifies at least one of full names, institutional names and geographical names.

44. A word segmentation method to identify non-Chinese originated names contained in Chinese text, the method comprising:
- locating a sequence of three or more single-characters in the input text not forming part of a multiple-character word; and
  
  comparing the sequence of single-characters to a lexical knowledge base to identify if characters contained in the sequence of characters correspond to characters used in non-Chinese originated names.
- View Dependent Claims (45, 46, 47, 48)
- - 45. The word segmentation method of claim 44 and further comprising:
    - locating non-proper name multiple-character words in the input text; and
      
      identifying characters of the input text comprising each of the multiple-character words.
  - 46. The word segmentation method of claim 45 and further comprising:
    - locating single-character words in the input text.
  - 47. The word segmentation method of claim 46 and further comprising:
    - submitting the non-Chinese originated name, the single-character words and the multiple-character words to a syntactic parser; and
      
      parsing the input text with the non-Chinese originated name, the single-character words and the multiple-character words.
  - 48. The word segmentation method of claim 47 and further comprising:
    - assigning a higher probability to the non-Chinese originated name than at least some of the other words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
WU, ANDI

Granted Patent

US 6,694,055 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/187
CPC Class Codes

G06F 40/295   Named entity recognition

G06F 40/53   Processing of non-Latin tex...

Y10S 707/99932   Access augmentation or opti...

PROPER NAME IDENTIFICATION IN CHINESE

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

48 Claims

Specification

Solutions

Use Cases

Quick Links

PROPER NAME IDENTIFICATION IN CHINESE

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

48 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links