Automatic segmentation of a text

US 6,374,210 B1
Filed: 11/24/1999
Issued: 04/16/2002
Est. Priority Date: 11/30/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method of segmenting a connected text into words, including the steps of:

reading an input string representing the connected text;

identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and

outputting at least one of the identified word sequences;

characterized in that the step of identifying at least one word sequence includes building a tree structure representing word sequences) in the input string in an iterative manner by;

taking the input string as a working string;

for each word of a dictionary;

comparing the word with a beginning of the working string; and

if the word matches the beginning of the working string;

forming a node in the tree representing the word;

associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and

forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string;

wherein in dependence on a predetermined criterion deciding whether new words are to be added to the tree structure;

if new words are to be added;

selecting at least one node in the tree whose associated word is to be followed by new words;

forming a plurality of new words;

each of the new words matching a beginning of the input string part associated with the selected node and consisting of a different number of characters;

for each formed new word forming a respective sub-tree linked to the selected node;

each sub-tree representing word sequence(s) starting with the respective new word in the input string part associated with the selected node.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system 100 is capable of segmenting a connected text, such as Japanese or Chinese sentence, into words. The system includes means 110 for reading an input string representing the connected text. Segmentation means 120 identifies at least one word sequence in the connected text by building a tree structure representing word sequence(s) in the input string in an iterative manner. Initially the input string is taken as a working string. Each word of a dictionary 122 is compared with the beginning of the working string. A match is represented by a node in the tree, and the process is continued with the remaining part of the input string. The system further includes means 130 for outputting at least one of the identified word sequences. A language model may be used to select between candidate sequences. Preferably the system is used in a speech recognition system to update the lexicon based on representative texts.

148 Citations

12 Claims

1. A method of segmenting a connected text into words, including the steps of:
- reading an input string representing the connected text;
  
  identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and
  
  outputting at least one of the identified word sequences;
  
  characterized in that the step of identifying at least one word sequence includes building a tree structure representing word sequences) in the input string in an iterative manner by;
  
  taking the input string as a working string;
  
  for each word of a dictionary;
  
  comparing the word with a beginning of the working string; and
  
  if the word matches the beginning of the working string;
  
  forming a node in the tree representing the word;
  
  associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
  
  forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string;
  
  wherein in dependence on a predetermined criterion deciding whether new words are to be added to the tree structure;
  
  if new words are to be added;
  
  selecting at least one node in the tree whose associated word is to be followed by new words;
  
  forming a plurality of new words;
  
  each of the new words matching a beginning of the input string part associated with the selected node and consisting of a different number of characters;
  
  for each formed new word forming a respective sub-tree linked to the selected node;
  
  each sub-tree representing word sequence(s) starting with the respective new word in the input string part associated with the selected node.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A method as claimed in claim 1, wherein the method includes checking whether at least one path through the tree structure represents a word sequence which matches the entire input string, where the word sequence includes only words of the dictionary;
    - and deciding to add new words if the outcome is negative.
  - 3. A method as claimed in claim 2, wherein the step of selecting at least one node in the tree whose associated word is to be followed by new words includes identifying at least one path through the tree representing a word sequence not matching the entire input string and using an end node of the identified path as the selected node.
  - 4. A method as claimed in claim 1, wherein the method includes, for each working string:
5. A method as claimed in claim 4, wherein the threshold is one.
6. A method as claimed in claim 1, wherein the method includes counting for each word sequence the number of new words in the sequence while building the tree structure and terminating extending the tree structure along a path representing the word sequence if the counted number of new words exceeds a predetermined threshold.
7. A method as claimed in claim 1, wherein the method includes calculating a likelihood for each word sequence while building the tree structure and terminating extending the tree structure along a path representing the word sequence if the likelihood of the corresponding word sequence is lower than a predetermined threshold.
8. A method as claimed in claim 7, wherein the likelihood of a word sequence decreases as a function of the number of new words in the word sequence.
9. A method as claimed in claim 1, characterized in that the step of forming new words comprises forming up to a K words, K>
- 1, each word starting with the beginning character of the working string and, respectively, including one to K of the beginning characters of the working string.
10. A method as claimed in claim 1, characterized in that the step of outputting at least one of the word sequences represented by the tree includes selecting one of the paths through the tree, where only paths are considered of which a word represented by an end node of the path matches the end of the input string.

11. A method as claimed in 4, characterized in that the step of selecting one of the paths through the tree includes calculating a likelihood of each candidate path based on a statistical N-gram language model, where N≧
- 2 and selecting a most likely path.

12. A system for segmenting a connected text into words;
- the system including;
  
  means for reading an input string representing the connected text;
  
  means for identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and
  
  means for outputting at least one of the identified word sequences;
  
  characterized in that the means for identifying at least one word sequence is operative to build a tree structure representing word sequence(s) in the input string in an iterative manner by;
  
  taking the input string as a working string;
  
  for each word of a dictionary;
  
  comparing the word with a beginning of the working string; and
  
  if the word matches the beginning of the working string;
  
  forming a node in the tree representing the word;
  
  associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
  
  forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string, wherein in dependence on a predetermined criterion deciding whether new words are to be added to the tree structure;
  
  if new words are to be added;
  
  selecting at least one node in the tree whose associated word is to be followed by new words;
  
  forming a plurality of new words;
  
  each of the new words matching a beginning of the input string part associated with the selected node and consisting of a different number of characters;
  
  for each formed new word forming a respective sub-tree linked to the selected node;
  
  each sub-tree representing word sequence(s) starting with the respective new word in the input string part associated with the selected node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
US Philips Corporation (Koninklijke Philips N.V.)
Original Assignee
US Philips Corporation (Koninklijke Philips N.V.)
Inventors
Chu, Ya-Cherng
Primary Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US09/449,231
Time in Patent Office

874 Days
Field of Search

704/1, 704/9-10, 704/257, 704/255, 704/256, 707/530, 707/531, 707/534, 707/535
US Class Current

704/9
CPC Class Codes

G06F 40/53 Processing of non-Latin tex...

Automatic segmentation of a text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

148 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic segmentation of a text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

148 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links