Method for dividing sentences into phrases using entropy calculations of word combinations based on adjacent words

US 6,505,151 B1
Filed: 03/15/2000
Issued: 01/07/2003
Est. Priority Date: 03/15/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method for automating the process of dividing sentences into phrases, comprising the following steps:

(a) dividing a sentence into sub-sentences using statistical analysis, $FE (C_{i}) = - \sum_{C_{j}} P_{F} (C_{j} | C_{i}) \log P_{F} (C_{j} | C_{i})$

including the following substeps;

(a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and

, (b) dividing the sub-sentences into phrases, using statistical analysis;

wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and

, wherein in substep (a.1) forward entropy (FE) of a character C_I, which immediately proceeds a character C_jin a sentence is calculated using the following equation;

where P_F(C_j|C_i) is the probability of C_jfollowing C_j.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The process of dividing sentences into phrases is automated. The sentence is divided into sub-sentences using statistical analysis. Then, the sub-sentences are into phrases, using statistical analysis. For example, for each pair of adjacent words in the sentence a metric is calculated which represents a strength of disconnection between the adjacent words. The sentence is divided into sub-sentences at locations in the sentence where the metric exceeds a first threshold.

Citations

10 Claims

1. A method for automating the process of dividing sentences into phrases, comprising the following steps:
- (a) dividing a sentence into sub-sentences using statistical analysis, $FE (C_{i}) = - \sum_{C_{j}} P_{F} (C_{j} | C_{i}) \log P_{F} (C_{j} | C_{i})$
  
  including the following substeps;
  
  (a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and
  
  , (b) dividing the sub-sentences into phrases, using statistical analysis;
  
  wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
  
  , wherein in substep (a.1) forward entropy (FE) of a character C_I, which immediately proceeds a character C_jin a sentence is calculated using the following equation;
  
  where P_F(C_j|C_i) is the probability of C_jfollowing C_j.

2. A method for automating the process of dividing sentences into phrases, comprising the following steps:
- (a) dividing a sentence into sub-sentences using statistical analysis, including the following substeps;
  
  (a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and
  
  , (b) dividing the sub-sentences into phrases, using statistical analysis;
  
  wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
  
  , wherein in substep (a.1) backward entropy (BE) of a character C_Iwhich immediately follows a character C_jin a sentence is calculated using the following equation;
  
  $BE (C_{i}) = - \sum_{C_{j}} P_{B} (C_{j}  C_{i}) \log P_{B} (C_{j}  C_{i})$ where P_B(C_j|C_i) is the probability of C_jbeing ahead of C_j.

3. A method for automating the process of dividing sentences into phrases, comprising the following steps:
- (a) dividing a sentence into sub-sentences using statistical analysis, including the following substeps;
  
  (a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and
  
  , (b) dividing the sub-sentences into phrases, using statistical analysis;
  
  wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
  
  , wherein in substep (a.1) mutual information (MI) of a character C_Iwhich immediately precedes a character C_jin a sentence is calculated using the following equation;
  
  $MI (C_{i}, C_{j}) = \log \frac{P (C_{i} C_{j})}{P (C_{i}) P (C_{j})}$ where P(C_iC_j) in the probability that C_jexactly comes after C_i, where P(C_i) is the probability that any character chosen at random in the corpus is C_i, and where P(C_j) is the probability that any character chosen at random in the corpus is C_j.

4. A method for automating the process of dividing sentences into phrases, comprising the following steps:
- (a) dividing a sentence into sub-sentences using statistical analysis; and
  
  , (b) dividing the sub-sentences into phrases, using statistical analysis, including the following substeps;
  
  (b.1) for a first word in the sub-sentence, determining an occurrence, in a corpus, of word combinations of the sub-sentence beginning with the first word, (b.2) for a word immediately following the first word in the sub-sentence, determining an occurrence, in the corpus, of word combinations of the sub-sentence beginning with the word immediately following the first word, and (b.3) for the first word in the sub-sentence, selecting a word combination of a first number of words starting with the first word to be used as a phrase when a ratio of the occurrence of the word combination of the first number of words starting with the first word and continuing with adjacent words in the sub-sentence to occurrence of a word combination of the first number of words starting with the word immediately following the first word and continuing with adjacent words in the sub-sentence is greater than that for any but the first number, provided the first number is less than a predetermine threshold.
- View Dependent Claims (5)
- - 5. A method as in claim 4 wherein step (b) additionally includes the following substeps:

6. A method for automating the process of dividing sentences into sub-sentences comprising the following steps:
- (a) for each pair of adjacent words in the sentence calculating a metric $FE (C_{i}) = - \sum_{C_{j}} P_{F} (C_{j} | C_{i}) \log P_{F} (C_{j} | C_{i})$
  
  which represents a strength of disconnection between the adjacent words; and
  
  (b) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold;
  
  wherein in step (a), the metric is a curability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
  
  , wherein in step (a) forward entropy (FE) of a character C_Iwhich immediately proceeds a character C_jin a sentence is calculated using the following equation;
  
  where P_F(C_j|C_i) is the probability of C_jfollowing C_j.

7. A method for automating the process of dividing sentences into sub-sentences comprising the following steps:
- (a) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words; and
  
  (b) breaking the sentence into sub-sentences at locations in the sentence where the metric exceed a first threshold;
  
  wherein in step (a), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
  
  , wherein in step (a) backward entropy (BE) of a character C_Iwhich immediately follows a character C_jin a sentence is calculated using the following equation;
  
  $BE (C_{i}) = - \sum_{C_{j}} P_{B} (C_{j}  C_{i}) \log P_{B} (C_{j}  C_{i})$ where P_B(C_j|C_i) is the probability of C_jbeing ahead of C_j.

8. A method for automating the process of dividing sentences into sub-sentences comprising the following steps:
- (a) for each pair of adjacent words in the sentence calculating a metric $MI (C_{i}, C_{j}) = \log \frac{P (C_{i} C_{j})}{P (C_{i}) P (C_{j})}$
  
  which represents a strength of disconnection between the adjacent words; and
  
  (b) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold;
  
  wherein in step (a), the metric is a curability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
  
  , wherein in step (a) mutual information (MI) of a character C_Iwhich immediately precedes a character C_jin a sentence is calculated using the following equation;
  
  where P(C_iC_j) is the probability that C_jexactly comes after C_i, where P(C_i) is the probability that any character chosen at random in the corpus is C_i, and where P(C_j) is the probability that any character chosen at random in the corpus is C_j.

9. A method for automating the process of dividing a sentence portion into phrases, comprising the following steps:
- (a) for a first word in the sentence portion, determining an occurrence, in a corpus, of word combinations of the sentence portion beginning with the first word;
  
  (b) for a word immediately following the first word in the sentence portion, determining an occurrence, in the corpus, of word combinations of the sentence portion beginning with the word immediately following the first word; and
  
  , (c) for the first word in the sentence portion, selecting a word combination of a first number of words starting with the first word to be used as a phrase when a ratio of the occurrence of the word combination of the first number of words starting with the first word and continuing with adjacent words in the sentence portion to occurrence of a word combination of the first number of words starting with the word immediately following the first word and continuing with adjacent words in the sentence portion is greater than that for any but the first number, provided the first number is less than a predetermine threshold.
- View Dependent Claims (10)
- - 10. A method as in claim 9 additionally comprising the following steps:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bridgewell Incorporated
Original Assignee
Bridgewell Incorporated
Inventors
Cheng, Chih-Yuan, Chen, Kuang-Hua, Sung, Tien-Hsiung, Oyang, Yen-Jen, Chou, Peilin
Primary Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US09/525,692
Time in Patent Office

1,028 Days
Field of Search

704/9, 704/10, 707/3, 707/4, 707/5, 707/6, 707/530, 707/531, 707/532, 707/533, 434/169
US Class Current

704/9
CPC Class Codes

G06F 40/289 Phrasal analysis, e.g. fini...

Method for dividing sentences into phrases using entropy calculations of word combinations based on adjacent words

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Method for dividing sentences into phrases using entropy calculations of word combinations based on adjacent words

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links