System and method for identifying compounds through iterative analysis
First Claim
1. A computer-implemented method for identifying compounds in text, comprising:
- extracting a vocabulary of tokens from text;
iterating from n>
2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;
identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;
dividing each n-gram into n−
1 pairs of two adjacent segments, where each segment consists of at least one token;
for each n-gram, calculating a likelihood of collocation for each pair of the n−
1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−
1 pairs;
identifying a set of n-grams having scores above a threshold; and
adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary, wherein the iterating is performed by one or more processors.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for identifying compounds through iterative analysis of measure of association is disclosed. A limit on a number of tokens per compound is specified. Compounds within a text corpus are iteratively evaluated. A number of occurrences of one or more n-grams within the text corpus is determined. Each n-gram includes up to a maximum number of tokens, which are each provided in a vocabulary for the text corpus. At least one n-gram including a number of tokens equal to the limit based on the number of occurrences is identified. A measure of association between the tokens in the identified n-gram is determined. Each identified n-gram with a sufficient measure of association is added to the vocabulary as a compound token and the limit is adjusted.
44 Citations
15 Claims
-
1. A computer-implemented method for identifying compounds in text, comprising:
- extracting a vocabulary of tokens from text;
iterating from n>
2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;
identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;
dividing each n-gram into n−
1 pairs of two adjacent segments, where each segment consists of at least one token;
for each n-gram, calculating a likelihood of collocation for each pair of the n−
1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−
1 pairs;
identifying a set of n-grams having scores above a threshold; and
adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary, wherein the iterating is performed by one or more processors. - View Dependent Claims (2, 3, 4, 5)
- extracting a vocabulary of tokens from text;
-
6. A computer readable storage medium on which program code is stored, which program code, when executed by a processor, causes the processor to perform operations comprising:
- extracting a vocabulary of tokens from text;
iterating from n>
2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;
identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;
dividing each n-gram into n−
1 pairs of two adjacent segments, where each segment consists of at least one token;
for each n-gram, calculating a likelihood of collocation for each of the n−
1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−
1 pairs;
identifying a set of n-grams having scores above a threshold; and
adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary. - View Dependent Claims (7, 8, 9, 10)
- extracting a vocabulary of tokens from text;
-
11. A system comprising:
- a computer readable storage medium on which a program product is stored; and
one or more processors configured to execute the program product and perform operations comprising;
extracting a vocabulary of tokens from text;
iterating from n>
2 down to n=2 where n decreases by one each iteration and in each iteration performing the actions of;
identifying a plurality of unique n-grams in the text, each n-gram being an occurrence in the text of n sequential tokens, each token being found in the vocabulary;
dividing each n-gram into n−
1 pairs of two adjacent segments, where each segment consists of at least one token;
for each n-gram, calculating a likelihood of collocation for each of the n−
1 pairs of two adjacent segments of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of collocation for the each of the n−
1 pairs;
identifying a set of n-grams having scores above a threshold; and
adding the identified set of n-grams as compound tokens to the vocabulary and removing constituent tokens that occur in the added compound tokens from the vocabulary. - View Dependent Claims (12, 13, 14, 15)
- a computer readable storage medium on which a program product is stored; and
Specification