Identifying common co-occurring elements in lists
First Claim
Patent Images
1. A computer-implemented method comprising:
- traversing a corpus of documents to identify a plurality of lists within the documents, wherein each list comprises structured data delimited from other data in a document, and wherein each list specifies an enumeration of elements;
selecting a pair of terms based on determining that both terms of the pair are contained in a first quantity of lists that are included in the documents in the corpus, wherein the first quantity is more than a first predetermined quantity, and wherein each list in the first quantity of lists includes more than a second predetermined quantity of terms;
determining a first value that represents a quantity of documents in the corpus that include a list that contains both terms of the pair;
determining a second value that represents a quantity of the documents in the set corpus that include a list that contains at least one of the terms of the pair;
when both terms of the pair are contained in the first quantity of lists that are included in the documents in the corpus, determining a correlation value from the first value and the second value;
determining that the correlation value satisfies a threshold; and
designating, by one or more computers, the pair of terms as potentially non-synonymous terms by adding the pair of terms to a blacklist, based on determining that the correlation value satisfies the threshold, wherein the blacklist is accessed for synonym determination.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system for detecting correlations between terms. During operation, the system identifies one or more lists contained in one or more documents and identifies two terms co-occurring in the lists. The system further determines a correlation between the co-occurring terms, and places the co-occurring terms in a correlated-pair list based on the correlation.
-
Citations
28 Claims
-
1. A computer-implemented method comprising:
-
traversing a corpus of documents to identify a plurality of lists within the documents, wherein each list comprises structured data delimited from other data in a document, and wherein each list specifies an enumeration of elements; selecting a pair of terms based on determining that both terms of the pair are contained in a first quantity of lists that are included in the documents in the corpus, wherein the first quantity is more than a first predetermined quantity, and wherein each list in the first quantity of lists includes more than a second predetermined quantity of terms; determining a first value that represents a quantity of documents in the corpus that include a list that contains both terms of the pair; determining a second value that represents a quantity of the documents in the set corpus that include a list that contains at least one of the terms of the pair; when both terms of the pair are contained in the first quantity of lists that are included in the documents in the corpus, determining a correlation value from the first value and the second value; determining that the correlation value satisfies a threshold; and designating, by one or more computers, the pair of terms as potentially non-synonymous terms by adding the pair of terms to a blacklist, based on determining that the correlation value satisfies the threshold, wherein the blacklist is accessed for synonym determination. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system comprising:
-
one or more computers; and a computer-readable medium coupled to the one or more computers having instructions stored thereon which, when executed by the one or more computers, cause the one or more computers to perform operations comprising; traversing a corpus of documents to identify a plurality of lists within the documents, wherein each list comprises structured data delimited from other data in a document, and wherein each list specifies an enumeration of elements; selecting a pair of terms based on determining that both terms of the pair are contained in a first quantity of lists that are included in the documents in the corpus, wherein the first quantity is more than a first predetermined quantity, and wherein each list in the first quantity of lists includes more than a second predetermined quantity of terms; determining a first value that represents a quantity of documents in the corpus that include a list that contains both terms of the pair; determining a second value that represents a quantity of the documents in the set corpus that include a list that contains at least one of the terms of the pair; when both terms of the pair are contained in the first quantity of lists that are included in the documents in the corpus, determining a correlation value from the first value and the second value; determining that the correlation value satisfies a threshold; and designating the pair of terms as potentially non-synonymous terms by adding the pair of terms to a blacklist, based on determining that the correlation value satisfies the threshold, wherein the blacklist is accessed for synonym determination. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 28)
-
-
21. A non-transitory computer-readable medium encoded with a computer program comprising instructions that, when executed, operate to cause a computer to perform operations comprising:
-
traversing a corpus of documents to identify a plurality of lists within the documents, wherein each list comprises structured data delimited from other data in a document, and wherein each list specifies an enumeration of elements; selecting a pair of terms based on determining that both terms of the pair are contained in a first quantity of lists that are included in the documents in the corpus, wherein the first quantity is more than a first predetermined quantity, and wherein each list in the first quantity of lists includes more than a second predetermined quantity of terms; determining a first value that represents a quantity of documents in the corpus that include a list that contains both terms of the pair; determining a second value that represents a quantity of the documents in the set corpus that include a list that contains at least one of the terms of the pair; when both terms of the pair are contained in the first quantity of lists that are included in the documents in the corpus, determining a correlation value from the first value and the second value; determining that the correlation value satisfies a threshold; and designating the pair of terms as potentially non-synonymous terms by adding the pair of terms to a blacklist, based on determining that the correlation value satisfies the threshold, wherein the blacklist is accessed for synonym determination. - View Dependent Claims (22, 23, 24, 25, 26, 27)
-
Specification