Automatic segmentation of texts comprising chunks without separators
First Claim
1. A method of segmenting into chunks syntagms of a text including individual elements written without separators, said chunks comprising strings including at least one of said individual elements, including the steps of:
- defining a lexicon including a set of strings, each string comprising at least one of said individual elements, wherein the strings in said lexicon are at least partly representative of said chunks, the lexicon also including a dynamic lexicon and a static lexicon;
orderly searching the syntagm being segmented on an element-by-element basis by searching, within said lexicon, strings corresponding to any of said chunks, wherein, in the case of a positive search result, the corresponding chunk located is stored with an associated cost;
checking whether the chunk located was already present in at least the dynamic lexicon and, in the case where the chunk located was already present, reducing the cost associated therewith;
storing in a computer memory, as a result of said orderly searching, a plurality of candidate segmentation sequences, each corresponding to a respective segmentation pattern and having an associated corresponding accrued cost;
selecting as the final result of segmentation the candidate sequence having the lowest associated accrued cost, andincreasing said associated cost by a constant value at each new step in said searching on an element-by-element basis.
8 Assignments
0 Petitions
Accused Products
Abstract
Syntagms of a text including individual elements written without separators are segmented into chunks having strings including at least one individual element, such as an ideogram of the Mandarin Chinese language. A lexicon is defined including a set of strings, each string having at least one of the individual elements. The syntagm, being segmented, is orderly searched on an element-by-element basis by searching within the lexicon strings corresponding to any of the chunks. In the case of a positive search result, the corresponding chunk located is stored with an associated cost. A check is made as to whether the chunk located was already present in the lexicon. If the chunk located was already present, the cost associated therewith is reduced. A plurality of candidate segmentation sequences are thus generated, each corresponding to a respective segmentation pattern having associated a corresponding accrued cost. The candidate sequence having the lowest associated cost is selected as the final result of segmentation.
10 Citations
22 Claims
-
1. A method of segmenting into chunks syntagms of a text including individual elements written without separators, said chunks comprising strings including at least one of said individual elements, including the steps of:
-
defining a lexicon including a set of strings, each string comprising at least one of said individual elements, wherein the strings in said lexicon are at least partly representative of said chunks, the lexicon also including a dynamic lexicon and a static lexicon; orderly searching the syntagm being segmented on an element-by-element basis by searching, within said lexicon, strings corresponding to any of said chunks, wherein, in the case of a positive search result, the corresponding chunk located is stored with an associated cost; checking whether the chunk located was already present in at least the dynamic lexicon and, in the case where the chunk located was already present, reducing the cost associated therewith; storing in a computer memory, as a result of said orderly searching, a plurality of candidate segmentation sequences, each corresponding to a respective segmentation pattern and having an associated corresponding accrued cost; selecting as the final result of segmentation the candidate sequence having the lowest associated accrued cost, and increasing said associated cost by a constant value at each new step in said searching on an element-by-element basis. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
Specification