System and method for identifying compounds through iterative analysis

US 7,555,428 B1
Filed: 08/21/2003
Issued: 06/30/2009
Est. Priority Date: 08/21/2003
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for identifying compounds in text, comprising:

extracting a vocabulary of tokens from text;

iterating from n>

2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;

identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;

dividing each n-gram into n−

1 pairs of two adjacent segments, where each segment consists of at least one token;

for each n-gram, calculating a likelihood of collocation for each pair of the n−

1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−

1 pairs;

identifying a set of n-grams having scores above a threshold; and

adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary, wherein the iterating is performed by one or more processors.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for identifying compounds through iterative analysis of measure of association is disclosed. A limit on a number of tokens per compound is specified. Compounds within a text corpus are iteratively evaluated. A number of occurrences of one or more n-grams within the text corpus is determined. Each n-gram includes up to a maximum number of tokens, which are each provided in a vocabulary for the text corpus. At least one n-gram including a number of tokens equal to the limit based on the number of occurrences is identified. A measure of association between the tokens in the identified n-gram is determined. Each identified n-gram with a sufficient measure of association is added to the vocabulary as a compound token and the limit is adjusted.

44 Citations

View as Search Results

15 Claims

1. A computer-implemented method for identifying compounds in text, comprising:
- extracting a vocabulary of tokens from text;
  
  iterating from n>
  
  2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;
  
  identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;
  
  dividing each n-gram into n−
  
  1 pairs of two adjacent segments, where each segment consists of at least one token;
  
  for each n-gram, calculating a likelihood of collocation for each pair of the n−
  
  1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−
  
  1 pairs;
  
  identifying a set of n-grams having scores above a threshold; and
  
  adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary, wherein the iterating is performed by one or more processors.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 where calculating a likelihood of collocation for each pair of segments of the n-gram comprises determining a likelihood ratio λ
    - for each pair of segments that is computed in accordance with the formula;
      
      $λ = \frac{L (H_{i})}{L (H_{c})}$ where L(H_i) is a likelihood of observing H_iunder an independence hypothesis, L(H_c) is a likelihood of observing H_cunder a collocation hypothesis, and H is a pair of segments.
  - 3. The method of claim 2 where the L(H_c) is computed for each pair of segments, t₁, t₂, in each n-gram in accordance with the formula:
    - $\underset{L (H_{i})}{\arg \max} \frac{L (t_{1}, t_{2} form compound)}{l (n - gram does not form compound)} .$
  - 4. The method of claim 2 where, for each pair of segments, t₁, t₂, in each n-gram, the independence hypothesis comprises P(t₂|t₁)=P(t₂| t₁) and the collocation hypothesis comprises P(t₂|t₁)>
    - P(t₂| t₁).
  - 5. The method of claim 1 where identifying a plurality of unique n-grams in the text comprises skipping n-grams appearing in a list of known compounds.

6. A computer readable storage medium on which program code is stored, which program code, when executed by a processor, causes the processor to perform operations comprising:
- extracting a vocabulary of tokens from text;
  
  iterating from n>
  
  2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;
  
  identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;
  
  dividing each n-gram into n−
  
  1 pairs of two adjacent segments, where each segment consists of at least one token;
  
  for each n-gram, calculating a likelihood of collocation for each of the n−
  
  1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−
  
  1 pairs;
  
  identifying a set of n-grams having scores above a threshold; and
  
  adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The computer-readable storage medium of claim 6 where calculating a likelihood of collocation for each pair of segments of the n-gram comprises determining a likelihood ratio λ
    - for each pair of segments that is computed in accordance with the formula;
      
      $λ = \frac{L (H_{i})}{L (H_{c})}$ where L(H_i) is a likelihood of observing H_iunder an independence hypothesis, L(H_c) is a likelihood of observing H_cunder a collocation hypothesis, and H is a pair of segments.
  - 8. The computer-readable storage medium of claim 7 where the L(H_c) is computed for each pair of segments, t₁, t₂, in each n-gram in accordance with the formula:
    - $\underset{L (H_{i})}{\arg \max} \frac{L (t_{1}, t_{2} form compound)}{l (n - gram does not form compound)} .$
  - 9. The computer-readable storage medium of claim 7 where, for each pair of segments, t₁, t₂, in each n-gram, the independence hypothesis comprises P(t₂|t₁)=P(t₂| t₂) and the collocation hypothesis comprises P(t₂|t₁)>
    - P(t₂| t₁).
  - 10. The computer-readable storage medium of claim 6 where identifying a plurality of unique n-grams in the text comprises skipping n-grams appearing in a list of known compounds.

11. A system comprising:
- a computer readable storage medium on which a program product is stored; and
  
  one or more processors configured to execute the program product and perform operations comprising;
  
  extracting a vocabulary of tokens from text;
  
  iterating from n>
  
  2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;
  
  identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;
  
  dividing each n-gram into n−
  
  1 pairs of two adjacent segments, where each segment consists of at least one token;
  
  for each n-gram, calculating a likelihood of collocation for each of the n−
  
  1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−
  
  1 pairs;
  
  identifying a set of n-grams having scores above a threshold; and
  
  adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11 where calculating a likelihood of collocation for each pair of segments of the n-gram comprises determining a likelihood ratio λ
    - for each pair of segments that is computed in accordance with the formula;
      
      $λ = \frac{L (H_{i})}{L (H_{c})}$ where L(H_i) is a likelihood of observing H_iunder an independence hypothesis, L(H_c) is a likelihood of observing H_cunder a collocation hypothesis, and H is a pair of segments.
  - 13. The system of claim 12 where the L(H_c) is computed for each pair of segments, t₁, t₂, in each n-gram in accordance with the formula:
    - $\underset{L (H_{i})}{\arg \max} \frac{L (t_{1}, t_{2} form compound)}{l (n - gram does not form compound)} .$
  - 14. The system of claim 12 where, for each pair of segments, t₁, t₂, in each n-gram, the independence hypothesis comprises P(t₂|t₁)=P(t₂| t₁) and the collocation hypothesis comprises P(t₂|t₁)>
    - P(t₂| t₁).
  - 15. The system of claim 11 where identifying a plurality of unique n-grams in the text comprises skipping n-grams appearing in a list of known compounds.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Franz, Alexander, Milch, Brian
Primary Examiner(s)
SHAH, PARAS D

Application Number

US10/647,203
Time in Patent Office

2,140 Days
Field of Search

704/7, 704/10, 704/9
US Class Current

704/10
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

System and method for identifying compounds through iterative analysis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

44 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying compounds through iterative analysis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links