Using source-channel models for word segmentation

US 7,493,251 B2
Filed: 05/30/2003
Issued: 02/17/2009
Est. Priority Date: 05/30/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method of segmenting text formed of a sequence of characters, the method comprising:

determining a class model probability of an entity given a candidate segment of the sequence of characters;

determining a context probability of a sequence of entities; and

combining the class model probability and the context model probability to select a sequence of entities and thereby select a sequence of candidate segments as a segmentation of the text.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for segmenting text is provided that identifies a sequence of entity types from a sequence of characters and thereby identifies a segmentation for the sequence of characters. Under the invention, the sequence of entity types is identified using probabilistic models that describe the likelihood of a sequence of entities and the likelihood of sequences of characters given particular entities. Under one aspect of the invention, organization name entities are identified from a first sequence of identified entities to form a final sequence of identified entities.

Citations

51 Claims

1. A method of segmenting text formed of a sequence of characters, the method comprising:
- determining a class model probability of an entity given a candidate segment of the sequence of characters;
  
  determining a context probability of a sequence of entities; and
  
  combining the class model probability and the context model probability to select a sequence of entities and thereby select a sequence of candidate segments as a segmentation of the text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 2. The method of claim 1 wherein determining a class model probability of an entity comprises determining that the candidate segment is a possible lexicon entity and determining a class model probability of the lexicon entity.
  - 3. The method of claim 2 wherein determining that the candidate segment is a possible lexicon entity comprises finding the candidate segment in a lexicon.
  - 4. The method of claim 3 wherein determining the class model probability comprises setting the class model probability equal to one.
  - 5. The method of claim 1 wherein determining a class model probability of an entity comprises determining that the candidate segment is a possible morphological lexicon entity and determining a class model probability of the morphological lexicon entity.
  - 6. The method of claim 5 wherein determining that the candidate segment is a possible morphological lexicon entity comprises finding the candidate segment in a morphological lexicon.
  - 7. The method of claim 6 wherein determining the class model probability comprises setting the class model probability equal to one.
  - 8. The method of claim 6 further comprising retrieving a morphological pattern from the morphological lexicon based on the candidate segment and tagging the candidate segment with the morphological pattern.
  - 9. The method of claim 1 wherein determining a class model probability of an entity comprises determining that the candidate segment is a possible name entity and determining a class model probability of the name entity.
  - 10. The method of claim 9 wherein determining that the candidate segment is a possible name entity comprises determining that the candidate segment is a possible person name entity.
  - 11. The method of claim 10 wherein determining that the candidate segment is a possible person name entity comprises matching at least one name in a list of names to at least one character in the candidate segment.
  - 12. The method of claim 11 wherein the list of names comprises a list of family names.
  - 13. The method of claim 9 wherein determining that the candidate segment is a possible name entity comprises determining that the candidate segment is a possible location name entity.
  - 14. The method of claim 13 wherein determining that the candidate segment is a possible location name entity comprises matching a location name in a list of location names to the entire candidate segment.
  - 15. The method of claim 13 wherein determining that the candidate segment is a possible location name entity comprises matching a location keyword in a list of location keywords to at least one character in the candidate segment.
  - 16. The method of claim 9 wherein determining that the candidate segment is a possible name entity comprises determining that the candidate segment is a possible transliteration name entity.
  - 17. The method of claim 16 wherein determining that the candidate segment is a possible transliteration name entity comprises matching each character in the candidate segment to a respective character in a list of characters associated with transliteration.
  - 18. The method of claim 9 wherein determining a class model probability of the name entity comprises forming the class model probability from a set of character bigram probabilities.
  - 19. The method of claim 1 wherein determining a class model probability of an entity comprises determining that the candidate segment is a possible factoid entity and determining a class model probability of the factoid entity.
  - 20. The method of claim 19 wherein determining that the candidate segment is a possible factoid entity comprises applying the candidate segment to a finite state transducer.
  - 21. The method of claim 19 wherein determining a class model probability of the factoid entity comprises setting the class model probability equal to one.
  - 22. The method of claim 1 wherein selecting a sequence of entities comprises selecting one sequence from a plurality of possible sequences of entities.
  - 23. The method of claim 22 further comprising determining a class model probability for each entity in the plurality of possible sequences of entities.
  - 24. The method of claim 1 further comprising identifying possible organization names in the selected sequence of entities.
  - 25. The method of claim 24 wherein identifying a possible organization name comprises finding a word that is in the selected sequence of candidate segments in an organization name keyword list.
  - 26. The method of claim 25 wherein identifying possible organization names further comprises identifying each of a plurality of sequences of candidate segments that end in a candidate segment that is found in the organization name keyword list as possible organization names.
  - 27. The method of claim 24 further comprising determining a class model probability for each possible organization name.
  - 28. The method of claim 27 further comprising using the class model probabilities for the possible organization names to select a second sequence of entities.
  - 29. The method of claim 28 wherein using the class model probabilities for each possible organization name to select a sequence of entities comprises using the class model probabilities for each possible organization name and the class model probabilities of the entities in the selected sequence of entities to select the second sequence of entities.

30. A computer-readable storage medium having encoded thereon computer-executable instructions for performing steps comprising:
- determining a class model probability for a segment of a text given a first entity;
  
  determining a class model probability for a segment of the text given a second entity; and
  
  using the class model probabilities for the first entity and the second entity to select a sequence of entities that is represented by the text and thereby segment the text.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
- - 31. The computer-readable storage medium of claim 30 wherein the first entity and the second entity are different from each other.
  - 32. The computer-readable storage medium of claim 30 wherein the class model probability for the first entity is determined in a different manner from the class model probability for the second entity.
  - 33. The computer-readable storage medium of claim 30 wherein the first entity is a lexicon word entity and determining a class model probability for the lexicon word entity comprises finding the segment of text in a lexicon and setting the class model probability for the lexicon word entity equal to one.
  - 34. The computer-readable storage medium of claim 30 wherein the first entity is a morphological lexicon word entity and determining a class model probability for the morphological lexicon word entity comprises finding the segment of text in a morphological lexicon and setting the class model probability for the morphological lexicon word entity equal to one.
  - 35. The computer-readable storage medium of claim 30 wherein the first entity is a name entity.
  - 36. The computer-readable storage medium of claim 35 wherein the name entity is a person name entity and determining a class model probability for the person name entity comprises finding a portion of the segment of text in a list of names and setting the class model probability for the person name entity using character bigram probabilities.
  - 37. The computer-readable storage medium of claim 35 wherein the name entity is a location name entity and determining a class model probability for the location name entity comprises finding at least a portion of the segment of text in a list of location names and setting the class model probability for the location name entity using character bigram probabilities.
  - 38. The computer-readable storage medium of claim 35 wherein the name entity is a transliteration name entity and determining a class model probability for the transliteration name entity comprises finding each character in the segment of text in a list of characters and setting the class model probability for the transliteration name entity using character bigram probabilities.
  - 39. The computer-readable storage medium of claim 38 further comprising determining class model probabilities for a person name entity, a location name entity and an organization name entity for the same segment associated with the transliteration name entity by setting the class model probabilities equal to the class model probability of the transliteration name entity.
  - 40. The computer-readable storage medium of claim 30 wherein the first entity is a factoid entity and determining a class model probability for the factoid entity comprises applying the segment of text to a finite state transducer, having the finite state transducer end in a success state and setting the class model probability for the factoid equal to one.
  - 41. The computer-readable storage medium of claim 40 wherein separate finite state transducers are provided for separate types of factoid entities.
  - 42. The computer-readable storage medium of claim 41 further comprising tagging the segment with the type of factoid entity associated with the finite state transducer.
  - 43. The computer-readable storage medium of claim 30 further comprising identifying an organization name entity from the sequence of entities.
  - 44. The computer-readable storage medium of claim 43 wherein identifying an organization name comprises finding a segment in a list of organization name keywords.
  - 45. The computer-readable storage medium of claim 43 further comprising determining a class model probability for the organization name entity.

46. A method of identifying organization names in an unsegmented text, the method comprising:
- identifying a sequence of entities in the unsegmented text to thereby segment the text;
  
  identifying a possible organization name from a portion of the segmented text;
  
  determining a probability for the possible organization name based on at least a portion of the sequence of entities; and
  
  using the probability to determine whether to accept the possible organization name as an organization name.
- View Dependent Claims (47, 48, 49, 50, 51)
- - 47. The method of claim 46 wherein the sequence of entities comprises at least one name entity.
  - 48. The method of claim 46 wherein identifying a possible organization name comprises finding a segment of the segmented text in a list of organization name keywords.
  - 49. The method of claim 46 wherein determining a probability for the possible organization name comprises utilizing class entity bigram probabilities.
  - 50. The method of claim 49 wherein one of the class entity bigram probabilities provides the probability of a name entity given a lexicon word entity.
  - 51. The method of claim 49 wherein determining a probability for the possible organization name further comprises utilizing class model probabilities for each entity incorporated in the organization name, wherein each class model probability provides the probability of an entity given a sequence of characters associated with the entity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zhou, Ming, Sun, Jian, Zhang, Lei, Gao, Jianfeng, Li, Mu, Huang, Chang-Ning
Primary Examiner(s)
Knepper; David D

Application Number

US10/448,644
Publication Number

US 20040243408A1
Time in Patent Office

2,090 Days
Field of Search

None
US Class Current

704/8
CPC Class Codes

G06F 40/268 Morphological analysis

G06F 40/284 Lexical analysis, e.g. toke...

Using source-channel models for word segmentation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

51 Claims

Specification

Solutions

Use Cases

Quick Links

Using source-channel models for word segmentation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

51 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links