×

System and method for identifying compounds through iterative analysis

  • US 7,555,428 B1
  • Filed: 08/21/2003
  • Issued: 06/30/2009
  • Est. Priority Date: 08/21/2003
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for identifying compounds in text, comprising:

  • extracting a vocabulary of tokens from text;

    iterating from n>

    2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;

    identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;

    dividing each n-gram into n−

    1 pairs of two adjacent segments, where each segment consists of at least one token;

    for each n-gram, calculating a likelihood of collocation for each pair of the n−

    1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−

    1 pairs;

    identifying a set of n-grams having scores above a threshold; and

    adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary, wherein the iterating is performed by one or more processors.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×