Method for dividing sentences into phrases using entropy calculations of word combinations based on adjacent words
First Claim
Patent Images
1. A method for automating the process of dividing sentences into phrases, comprising the following steps:
- (a) dividing a sentence into sub-sentences using statistical analysis,
including the following substeps;
(a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and
, (b) dividing the sub-sentences into phrases, using statistical analysis;
wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
, wherein in substep (a.1) forward entropy (FE) of a character CI, which immediately proceeds a character Cj in a sentence is calculated using the following equation;
where PF(Cj|Ci) is the probability of Cj following Cj.
1 Assignment
0 Petitions
Accused Products
Abstract
The process of dividing sentences into phrases is automated. The sentence is divided into sub-sentences using statistical analysis. Then, the sub-sentences are into phrases, using statistical analysis. For example, for each pair of adjacent words in the sentence a metric is calculated which represents a strength of disconnection between the adjacent words. The sentence is divided into sub-sentences at locations in the sentence where the metric exceeds a first threshold.
-
Citations
10 Claims
-
1. A method for automating the process of dividing sentences into phrases, comprising the following steps:
-
(a) dividing a sentence into sub-sentences using statistical analysis,
including the following substeps;
(a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and
,(b) dividing the sub-sentences into phrases, using statistical analysis;
wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
,wherein in substep (a.1) forward entropy (FE) of a character CI, which immediately proceeds a character Cj in a sentence is calculated using the following equation;
where PF(Cj|Ci) is the probability of Cj following Cj.
-
-
2. A method for automating the process of dividing sentences into phrases, comprising the following steps:
-
(a) dividing a sentence into sub-sentences using statistical analysis, including the following substeps;
(a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and
,(b) dividing the sub-sentences into phrases, using statistical analysis;
wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
,wherein in substep (a.1) backward entropy (BE) of a character CI which immediately follows a character Cj in a sentence is calculated using the following equation;
where PB(Cj|Ci) is the probability of Cj being ahead of Cj.
-
-
3. A method for automating the process of dividing sentences into phrases, comprising the following steps:
-
(a) dividing a sentence into sub-sentences using statistical analysis, including the following substeps;
(a.1) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words, and (a.2) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold; and
,(b) dividing the sub-sentences into phrases, using statistical analysis;
wherein in substep (a.1), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
,wherein in substep (a.1) mutual information (MI) of a character CI which immediately precedes a character Cj in a sentence is calculated using the following equation;
where P(CiCj) in the probability that Cj exactly comes after Ci, where P(Ci) is the probability that any character chosen at random in the corpus is Ci, and where P(Cj) is the probability that any character chosen at random in the corpus is Cj.
-
-
4. A method for automating the process of dividing sentences into phrases, comprising the following steps:
-
(a) dividing a sentence into sub-sentences using statistical analysis; and
,(b) dividing the sub-sentences into phrases, using statistical analysis, including the following substeps;
(b.1) for a first word in the sub-sentence, determining an occurrence, in a corpus, of word combinations of the sub-sentence beginning with the first word, (b.2) for a word immediately following the first word in the sub-sentence, determining an occurrence, in the corpus, of word combinations of the sub-sentence beginning with the word immediately following the first word, and (b.3) for the first word in the sub-sentence, selecting a word combination of a first number of words starting with the first word to be used as a phrase when a ratio of the occurrence of the word combination of the first number of words starting with the first word and continuing with adjacent words in the sub-sentence to occurrence of a word combination of the first number of words starting with the word immediately following the first word and continuing with adjacent words in the sub-sentence is greater than that for any but the first number, provided the first number is less than a predetermine threshold. - View Dependent Claims (5)
(b.4) for a next word in the sub-sentence not included in the word combination of the first number of words starting with the first word, determining an occurrence, in the corpus, of word combinations of the sub-sentence beginning with the next word;
(b.5) for a word immediately following the next word in the sub-sentence, determining an occurrence, in the corpus, of word combinations of the sub-sentence beginning with the word immediately following the next word; and
,(b.6) for the next word, selecting a word combination of a second number of words starting with the next word to be used as a phrase when a ratio of the occurrence of the word combination of the second number of words starting with the next word and continuing with adjacent words in the sub-sentence to occurrence of a word combination of the second number of words starting with the word immediately following the next word and continuing with adjacent words in the sub-sentence is greater than that for any but the second number, provided the second number is less than the predetermine threshold.
-
-
6. A method for automating the process of dividing sentences into sub-sentences comprising the following steps:
-
(a) for each pair of adjacent words in the sentence calculating a metric
which represents a strength of disconnection between the adjacent words; and
(b) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold;
wherein in step (a), the metric is a curability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
,wherein in step (a) forward entropy (FE) of a character CI which immediately proceeds a character Cj in a sentence is calculated using the following equation;
where PF(Cj|Ci) is the probability of Cj following Cj.
-
-
7. A method for automating the process of dividing sentences into sub-sentences comprising the following steps:
-
(a) for each pair of adjacent words in the sentence calculating a metric which represents a strength of disconnection between the adjacent words; and
(b) breaking the sentence into sub-sentences at locations in the sentence where the metric exceed a first threshold;
wherein in step (a), the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
,wherein in step (a) backward entropy (BE) of a character CI which immediately follows a character Cj in a sentence is calculated using the following equation;
where PB(Cj|Ci) is the probability of Cj being ahead of Cj.
-
-
8. A method for automating the process of dividing sentences into sub-sentences comprising the following steps:
-
(a) for each pair of adjacent words in the sentence calculating a metric
which represents a strength of disconnection between the adjacent words; and
(b) breaking the sentence into sub-sentences at locations in the sentence where the metric exceeds a first threshold;
wherein in step (a), the metric is a curability measure that is calculated as a sum of backward entropy, forward entropy and mutual information; and
,wherein in step (a) mutual information (MI) of a character CI which immediately precedes a character Cj in a sentence is calculated using the following equation;
where P(CiCj) is the probability that Cj exactly comes after Ci, where P(Ci) is the probability that any character chosen at random in the corpus is Ci, and where P(Cj) is the probability that any character chosen at random in the corpus is Cj.
-
-
9. A method for automating the process of dividing a sentence portion into phrases, comprising the following steps:
-
(a) for a first word in the sentence portion, determining an occurrence, in a corpus, of word combinations of the sentence portion beginning with the first word;
(b) for a word immediately following the first word in the sentence portion, determining an occurrence, in the corpus, of word combinations of the sentence portion beginning with the word immediately following the first word; and
,(c) for the first word in the sentence portion, selecting a word combination of a first number of words starting with the first word to be used as a phrase when a ratio of the occurrence of the word combination of the first number of words starting with the first word and continuing with adjacent words in the sentence portion to occurrence of a word combination of the first number of words starting with the word immediately following the first word and continuing with adjacent words in the sentence portion is greater than that for any but the first number, provided the first number is less than a predetermine threshold. - View Dependent Claims (10)
(d) for a next word in the sentence portion not included in the word combination of the first number of words starting with the first word, determining an occurrence, in the corpus, of word combinations of the sentence portion beginning with the next word;
(e) for a word immediately following the next word in the sentence portion, determining an occurrence, in the corpus, of word combinations of the sentence portion beginning with the word immediately following the next word; and
,(f) for the next word, selecting a word combination of a second number of words starting with the next word to be used as a phrase when a ratio of the occurrence of the word combination of the second number of words starting with the next word and continuing with adjacent words in the sentence portion to occurrence of a word combination of the second number of words starting with the word immediately following the next word and continuing with adjacent words in the sentence portion is greater than that for any but the second number, provided the second number is less than the predetermine threshold.
-
Specification