Automatic segmentation of texts comprising chunks without separators
8 Assignments
0 Petitions
Accused Products
Abstract
Syntagms of a text including individual elements written without separators are segmented into chunks having strings including at least one individual element, such as an ideogram of the Mandarin Chinese language. A lexicon is defined including a set of strings, each string having at least one of the individual elements. The syntagm, being segmented, is orderly searched on an element-by-element basis by searching within the lexicon strings corresponding to any of the chunks. In the case of a positive search result, the corresponding chunk located is stored with an associated cost. A check is made as to whether the chunk located was already present in the lexicon. If the chunk located was already present, the cost associated therewith is reduced. A plurality of candidate segmentation sequences are thus generated, each corresponding to a respective segmentation pattern having associated a corresponding accrued cost. The candidate sequence having the lowest associated cost is selected as the final result of segmentation.
37 Citations
46 Claims
-
1-23. -23. (canceled)
-
24. A method of segmenting into chunks syntagms of a text including individual elements written without separators, said chunks comprising strings including at least one of said individual elements, including the steps of:
-
defining a lexicon including a set of strings, each string comprising at least one of said individual elements, wherein the strings in said lexicon are at least partly representative of said chunks;
orderly searching the syntagm being segmented on an element-by-element basis by searching within said lexicon strings corresponding to any of said chunks, wherein, in the case of a positive search result, the corresponding chunk located is stored with an associated cost;
checking whether the chunk located was already present in the lexicon and, in the case where the chunk located was already present, reducing the cost associated therewith;
storing, as a result of said orderly searching, a plurality of candidate segmentation sequences, each corresponding to a respective segmentation pattern and having associated a corresponding accrued cost; and
selecting as the final result of segmentation the candidate sequence having the lowest associated accrued cost. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
Specification