Automatic segmentation of continuous text using statistical approaches

US 5,806,021 A
Filed: 09/04/1996
Issued: 09/08/1998
Est. Priority Date: 10/30/1995
Status: Expired due to Term

First Claim

Patent Images

1. A computer implemented method of segmenting continuous text comprising the steps of:

a) determining a phrase from a string of characters in a first direction;

b) determining from a beginning of the phrase a longest possible word beginning at the beginning of the phrase;

c) repeating steps a) and b) until the phrase is completed;

d) repeating steps a), b) and c) in a direction opposite said first direction, beginning with the end of the phrase and working backwards; and

e) choosing a result having a higher likelihood than other possible results.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic segmenter for continuous text segments such text in a rapid, consistent and semantically accurate manner. Two statistical methods for segmentation of continuous text are used. The first method, called "forward-backward matching", is easy and fast but can produce occasional errors in long phrases. The second method, called "statistical stack search segmenter", utilizes statistical language models to generate more accurate segmentation output at an expense of two times more execution time than the "forward-backward matching" method. In some applications where speed is a major concern, "forward-backward matching" can be used, while in other applications where highly accurate output is desired, "statistical stack search segmenter" is ideal.

178 Citations

9 Claims

1. A computer implemented method of segmenting continuous text comprising the steps of:
- a) determining a phrase from a string of characters in a first direction;
  
  b) determining from a beginning of the phrase a longest possible word beginning at the beginning of the phrase;
  
  c) repeating steps a) and b) until the phrase is completed;
  
  d) repeating steps a), b) and c) in a direction opposite said first direction, beginning with the end of the phrase and working backwards; and
  
  e) choosing a result having a higher likelihood than other possible results.
- View Dependent Claims (2, 3)
- - 2. The computer implemented method of segmenting continuous text as recited in claim 1, further comprising after step e) the steps of:
    - f) searching for every possible word that begins with a first character in the phrase and putting the words in a stack in order of language model likelihood;
      
      g) expanding a word at a top of the stack with words from a vocabulary byg1) starting with a highest likelihood result, searching for every possible word beginning with the character immediately following that word;
      
      g2) for each next word, computing a probability of a word stream containing that word and preceding words, and putting that word and the preceding words in the stack;
      
      g3) sorting and pruning the stack based upon the computed probability;
      
      g4) repeating steps g1), g2) and g3) until a top entry in the stack matches an input string; and
      
      h) outputting the top of the stack entry as a result.
  - 3. The computer implemented method of segmenting continuous text as recited in claim 2 wherein the step g2) of computing a probability is performed using a statistical language model.

4. A computer implemented method of segmenting continuous text comprising the steps of:
- searching for every possible word that begins with a first character in the phrase and putting the words in a stack in order of language model likelihood;
  
  expanding a word at a top of the stack with words from a vocabulary bya) starting with a highest likelihood result, searching for every possible word beginning with the character immediately following that word;
  
  b) for each next word, computing a probability of a word stream containing that word and preceding words, and putting that word and the preceding words in the stack;
  
  c) sorting and pruning the stack based upon the computed probability;
  
  d) repeating steps a), b) and c) until a top entry in the stack matches an input string; and
  
  outputting the top of the stack entry as a result.
- View Dependent Claims (5, 6)
- - 5. The computer implemented method of segmenting continuous text as recited in claim 4 further comprising, prior to the first step of searching, the initial steps of:
    - e) determining a phrase from a string of characters in a first direction;
      
      f) determining from a beginning of the phrase a longest possible word beginning at the beginning of the phrase; and
      
      g) repeating steps e) and f) until the phrase is completed.
  - 6. The computer implemented method of segmenting continuous text as recited in claim 4 wherein the step b) of computing a probability is performed using a statistical language model.

7. A computer implemented method of segmenting continuous text comprising the steps of:
- a) inputting unsegmented text;
  
  b) inputting an initial vocabulary and an initial language model;
  
  c) segmenting the input unsegmented text;
  
  d) testing the segmented text to determine if a satisfactory result has been obtained;
  
  e) outputting the segmented text if the result is satisfactory;
  
  f) otherwise, refining the vocabulary and rebuilding the language model; and
  
  g) repeating steps a) and c) using the refined vocabulary and rebuilt language model and again repeating steps d) and f) until the segmented text result is satisfactory.
- View Dependent Claims (8, 9)
- - 8. The computer implemented method of segmenting continuous text recited in claim 7 wherein step c) comprises the steps of:
    - h) determining a phrase from a string of characters in a first direction;
      
      i) determining from a beginning of the phrase a longest possible word beginning at the beginning of the phrase;
      
      j) repeating steps h) and i) until the phrase is completed;
      
      k) repeating steps h), i) and j) in a direction opposite said first direction, beginning with the end of the phrase and working backwards; and
      
      l) choosing a result having a higher likelihood than other possible results.
  - 9. The computer implemented method of segmenting continuous text recited in claim 7 wherein step c) comprises the steps of:
    - m) searching for every possible word that begins with a first character in the phrase and putting the words in a stack in order of language model likelihood;
      
      n) expanding a word at a top of the stack with words from a vocabulary byn1) starting with a highest likelihood result, searching for every possible word beginning with the character immediately following that word;
      
      n2) for each next word, computing a probability of a word stream containing that word and preceding words, and putting that word and the preceding words in the stack;
      
      n3) sorting and pruning the stack based upon the computed probability;
      
      n4) repeating steps n1), n2) and n3) until a top entry in the stack matches an input string; and
      
      o) outputting the top of the stack entry as a result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Liu, Fu-Hua, Picheny, Michael Alan, Chen, Chengjun Julian
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Lestina, Matthew J.

Application Number

US08/700,823
Time in Patent Office

734 Days
Field of Search

704/1, 704/9, 704/10, 704/241, 704/242, 704/254, 704/255, 704/256, 704/257, 707/530, 707/531, 707/534, 707/535, 707/536
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

Automatic segmentation of continuous text using statistical approaches

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

178 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic segmentation of continuous text using statistical approaches

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

178 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links