Automatic segmentation of a text
First Claim
1. A method of segmenting a connected text into words, including the steps of:
- reading an input string representing the connected text;
identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and
outputting at least one of the identified word sequences;
characterized in that the step of identifying at least one word sequence includes building a tree structure representing word sequences) in the input string in an iterative manner by;
taking the input string as a working string;
for each word of a dictionary;
comparing the word with a beginning of the working string; and
if the word matches the beginning of the working string;
forming a node in the tree representing the word;
associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string;
wherein in dependence on a predetermined criterion deciding whether new words are to be added to the tree structure;
if new words are to be added;
selecting at least one node in the tree whose associated word is to be followed by new words;
forming a plurality of new words;
each of the new words matching a beginning of the input string part associated with the selected node and consisting of a different number of characters;
for each formed new word forming a respective sub-tree linked to the selected node;
each sub-tree representing word sequence(s) starting with the respective new word in the input string part associated with the selected node.
1 Assignment
0 Petitions
Accused Products
Abstract
A system 100 is capable of segmenting a connected text, such as Japanese or Chinese sentence, into words. The system includes means 110 for reading an input string representing the connected text. Segmentation means 120 identifies at least one word sequence in the connected text by building a tree structure representing word sequence(s) in the input string in an iterative manner. Initially the input string is taken as a working string. Each word of a dictionary 122 is compared with the beginning of the working string. A match is represented by a node in the tree, and the process is continued with the remaining part of the input string. The system further includes means 130 for outputting at least one of the identified word sequences. A language model may be used to select between candidate sequences. Preferably the system is used in a speech recognition system to update the lexicon based on representative texts.
148 Citations
12 Claims
-
1. A method of segmenting a connected text into words, including the steps of:
-
reading an input string representing the connected text;
identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and
outputting at least one of the identified word sequences;
characterized in that the step of identifying at least one word sequence includes building a tree structure representing word sequences) in the input string in an iterative manner by;
taking the input string as a working string;
for each word of a dictionary;
comparing the word with a beginning of the working string; and
if the word matches the beginning of the working string;
forming a node in the tree representing the word;
associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string;
wherein in dependence on a predetermined criterion deciding whether new words are to be added to the tree structure;
if new words are to be added;
selecting at least one node in the tree whose associated word is to be followed by new words;
forming a plurality of new words;
each of the new words matching a beginning of the input string part associated with the selected node and consisting of a different number of characters;
for each formed new word forming a respective sub-tree linked to the selected node;
each sub-tree representing word sequence(s) starting with the respective new word in the input string part associated with the selected node.- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
determining how many words of the dictionary match the beginning of the working string;
deciding to add new words if the number of words of the dictionary matching the beginning of the working string is lower than a predetermined threshold; and
selecting as the node in the tree whose associated word is to be followed by new words the node associated with the working string.
-
-
5. A method as claimed in claim 4, wherein the threshold is one.
-
6. A method as claimed in claim 1, wherein the method includes counting for each word sequence the number of new words in the sequence while building the tree structure and terminating extending the tree structure along a path representing the word sequence if the counted number of new words exceeds a predetermined threshold.
-
7. A method as claimed in claim 1, wherein the method includes calculating a likelihood for each word sequence while building the tree structure and terminating extending the tree structure along a path representing the word sequence if the likelihood of the corresponding word sequence is lower than a predetermined threshold.
-
8. A method as claimed in claim 7, wherein the likelihood of a word sequence decreases as a function of the number of new words in the word sequence.
-
9. A method as claimed in claim 1, characterized in that the step of forming new words comprises forming up to a K words, K>
- 1, each word starting with the beginning character of the working string and, respectively, including one to K of the beginning characters of the working string.
-
10. A method as claimed in claim 1, characterized in that the step of outputting at least one of the word sequences represented by the tree includes selecting one of the paths through the tree, where only paths are considered of which a word represented by an end node of the path matches the end of the input string.
-
11. A method as claimed in 4, characterized in that the step of selecting one of the paths through the tree includes calculating a likelihood of each candidate path based on a statistical N-gram language model, where N≧
- 2 and selecting a most likely path.
-
12. A system for segmenting a connected text into words;
- the system including;
means for reading an input string representing the connected text;
means for identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and
means for outputting at least one of the identified word sequences;
characterized in that the means for identifying at least one word sequence is operative to build a tree structure representing word sequence(s) in the input string in an iterative manner by;
taking the input string as a working string;
for each word of a dictionary;
comparing the word with a beginning of the working string; and
if the word matches the beginning of the working string;
forming a node in the tree representing the word;
associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string, wherein in dependence on a predetermined criterion deciding whether new words are to be added to the tree structure;
if new words are to be added;
selecting at least one node in the tree whose associated word is to be followed by new words;
forming a plurality of new words;
each of the new words matching a beginning of the input string part associated with the selected node and consisting of a different number of characters;
for each formed new word forming a respective sub-tree linked to the selected node;
each sub-tree representing word sequence(s) starting with the respective new word in the input string part associated with the selected node.
- the system including;
Specification