Method and apparatus for automatic identification of word boundaries in continuous text and computation of word boundary scores

US 6,185,524 B1
Filed: 12/31/1998
Issued: 02/06/2001
Est. Priority Date: 12/31/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computerized method for identifying word boundaries in a continuous text input, the method comprising the following digital processes:

(a) comparing the continuous text to a set of varying length strings to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value;

(b) identifying each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified in step (a), the candidate word boundaries defining segments of the continuous text; and

(c) verifying each segment defined by the candidate word boundaries identified in step (b) against a string database.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and device for identifying word boundaries in continuous text compares the continuous text to a set of varying length strings to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text. Each candidate word-initial boundary and candidate word-final boundary has an associated probability value. Each candidate word boundary in the continuous text is identified by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries. The set of varying length strings may include words and n-grams.

91 Citations

View as Search Results

20 Claims

1. A computerized method for identifying word boundaries in a continuous text input, the method comprising the following digital processes:
- (a) comparing the continuous text to a set of varying length strings to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value;
  
  (b) identifying each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified in step (a), the candidate word boundaries defining segments of the continuous text; and
  
  (c) verifying each segment defined by the candidate word boundaries identified in step (b) against a string database.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method according to claim 1, wherein the set of varying length strings includes words.
  - 3. A method according to claim 1, wherein the set of varying length strings includes words and n-grams.
  - 4. A method according to claim 3, wherein the words are one and two character words and the n-gams are trigrams.
  - 5. A method according to claim 3, wherein the probability value associated with a candidate word-initial boundary is the probability the string, beginning with the candidate word-initial boundary, begins a word.
  - 6. A method according to claim 5, wherein the probability value associated with a candidate word-final boundary is the probability that the string, ending with the candidate word-final boundary, ends a word.

7. A computerized data processing device for identifying word boundaries in a continuous text input, the device comprising:
- a string comparator, to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text by comparing the continuous text to a set of varying length strings, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value;
  
  a boundary checker, coupled to the string comparator, to identify each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified by the string comparator, the candidate word boundaries defining segments of the continuous text.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. A device according to claim 7, further comprising:
9. A device according to claim 7, wherein the set of varying length strings includes words.
10. A device according to claim 7, wherein the set of varying length strings includes words and n-grams.
11. A device according to claim 10, wherein the words are one and two character words and the n-grams are trigrams.
12. A device according to claim 10, wherein the probability value associated with a candidate word-initial boundary is the probability that the string, beginning with the candidate word-initial boundary, begins a word.
13. A device according to claim 12, wherein the probability value associated with a candidate word-final boundary is the probability that the string, ending with the candidate word-final boundary, ends a word.

14. A digital storage medium encoded with instructions which, when loaded into a computer, establishes a device for identifying word boundaries in continuous text, the device including:
- a string comparator, to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text by comparing the continuous text to a set of varying length strings, each candidate word-initial boundary and candidate word-final boundary being a character in the continuous text and having an associated probability value; and
  
  a boundary checker, coupled to the string comparator, to identify each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified by the string comparator, the candidate word boundaries defining segments of the continuous text.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. A storage medium according to claim 14, the device further including:
16. A digital storage medium according to claim 14, wherein the set of varying length strings includes words.
17. A storage medium according to claim 14, wherein the set of varying length strings includes words and n-grams.
18. A storage medium according to claim 17, wherein the words are one and two character words and the n-grams are trigrams.
19. A storage medium according to claim 17, wherein the probability value associated with a candidate word-initial boundary is the probability that the string, beginning with the candidate word-initial boundary, begins a word.
20. A storage medium according to claim 19, wherein the probability value associated with a candidate word-final boundary is the probability that the string, ending with the candidate word-final boundary, ends a word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Vantage Technology Holdings LLC
Original Assignee
Lernout & Hauspie Speech Products NV (Intel Corporation)
Inventors
Carus, Alwin B., Good, Kathleen
Primary Examiner(s)
Thomas, Joseph

Application Number

US09/223,959
Time in Patent Office

768 Days
Field of Search

704/1, 704/9, 704/10, 704/8, 704/251, 704/252, 704/253, 704/254, 704/255, 704/256, 704/257, 707/530, 707/531, 707/532, 707/533, 341/28, 382/185
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

Method and apparatus for automatic identification of word boundaries in continuous text and computation of word boundary scores

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

91 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatic identification of word boundaries in continuous text and computation of word boundary scores

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

91 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links