Method and apparatus for automatic identification of word boundaries in continuous text and computation of word boundary scores
First Claim
1. A computerized method for identifying word boundaries in a continuous text input, the method comprising the following digital processes:
- (a) comparing the continuous text to a set of varying length strings to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value;
(b) identifying each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified in step (a), the candidate word boundaries defining segments of the continuous text; and
(c) verifying each segment defined by the candidate word boundaries identified in step (b) against a string database.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and device for identifying word boundaries in continuous text compares the continuous text to a set of varying length strings to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text. Each candidate word-initial boundary and candidate word-final boundary has an associated probability value. Each candidate word boundary in the continuous text is identified by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries. The set of varying length strings may include words and n-grams.
91 Citations
20 Claims
-
1. A computerized method for identifying word boundaries in a continuous text input, the method comprising the following digital processes:
-
(a) comparing the continuous text to a set of varying length strings to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value;
(b) identifying each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified in step (a), the candidate word boundaries defining segments of the continuous text; and
(c) verifying each segment defined by the candidate word boundaries identified in step (b) against a string database. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computerized data processing device for identifying word boundaries in a continuous text input, the device comprising:
-
a string comparator, to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text by comparing the continuous text to a set of varying length strings, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value;
a boundary checker, coupled to the string comparator, to identify each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified by the string comparator, the candidate word boundaries defining segments of the continuous text. - View Dependent Claims (8, 9, 10, 11, 12, 13)
a string database; and
a chart parser, coupled to the boundary checker, to verify each segment defined by the candidate word boundaries identified by the boundary checker against the string database.
-
-
9. A device according to claim 7, wherein the set of varying length strings includes words.
-
10. A device according to claim 7, wherein the set of varying length strings includes words and n-grams.
-
11. A device according to claim 10, wherein the words are one and two character words and the n-grams are trigrams.
-
12. A device according to claim 10, wherein the probability value associated with a candidate word-initial boundary is the probability that the string, beginning with the candidate word-initial boundary, begins a word.
-
13. A device according to claim 12, wherein the probability value associated with a candidate word-final boundary is the probability that the string, ending with the candidate word-final boundary, ends a word.
-
14. A digital storage medium encoded with instructions which, when loaded into a computer, establishes a device for identifying word boundaries in continuous text, the device including:
-
a string comparator, to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text by comparing the continuous text to a set of varying length strings, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value; and
a boundary checker, coupled to the string comparator, to identify each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified by the string comparator, the candidate word boundaries defining segments of the continuous text. - View Dependent Claims (15, 16, 17, 18, 19, 20)
a string database; and
a chart parser, coupled to the boundary checker, to verify each segment defined by the candidate word boundaries identified by the boundary checker against the string database.
-
-
16. A digital storage medium according to claim 14, wherein the set of varying length strings includes words.
-
17. A storage medium according to claim 14, wherein the set of varying length strings includes words and n-grams.
-
18. A storage medium according to claim 17, wherein the words are one and two character words and the n-grams are trigrams.
-
19. A storage medium according to claim 17, wherein the probability value associated with a candidate word-initial boundary is the probability that the string, beginning with the candidate word-initial boundary, begins a word.
-
20. A storage medium according to claim 19, wherein the probability value associated with a candidate word-final boundary is the probability that the string, ending with the candidate word-final boundary, ends a word.
Specification