Establishing “is a” relationships for a taxonomy
First Claim
1. A method for identifying generic terms, the method comprising:
- evaluating, by a computer system, word usage in an article for a string in reference corpus articles of a reference corpus by generating for the string a corpus vector having a list of co-occurring corpus words that occur in the reference corpus articles that also include the string and calculating a corpus ratio for each co-occurring corpus word as a number of times the each co-occurring corpus word occurs in the reference corpus articles containing the string divided by a number of times the co-occurring corpus word occurs in the reference corpus articles, the corpus vector including the list of co-occurring corpus words and the corpus ratio for the each co-occurring corpus words;
evaluating, by the computer system, word usage in current social media documents containing the string by generating for the string a social media vector having a list of co-occurring social media words that occur in social media documents that also include the string and calculating a social media ratio for each co-occurring social media word as a number of times the each co-occurring social media word occurs in the current social media documents containing the string divided by a number of times the co-occurring social media word occurs in the current social media documents, the social media vector including the list of co-occurring social media words and the social media ratio for the each co-occurring social media words;
causing, by the computer system, a generic score for the string to indicate greater generic-ness for a greater difference between the corpus vector and the social media vector; and
storing, by the computer system, the generic score for the string.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed are methods for returning to a user an answer to the question “what is <string>.” Concepts and classes to which the concepts belong are determined from a corpus, such as taxonomy. The concepts are mapped to categories according to the structure of the taxonomy. Homonyms for words are collected and scored according to likeliness of use. Concept vectors are assembled for the identified concepts based on articles in the corpus and social media usage. Words are evaluated for generic-ness and a generic score is associated therewith. In responding to a query, the generic-ness of the terms of the query is evaluated and additional context solicited if the terms are generic. Candidate homonym concepts for a string in the query are selected according to context vectors for the homonym concepts. One or more homonym concepts are selected and the one or more categories corresponding to these concepts are returned.
-
Citations
15 Claims
-
1. A method for identifying generic terms, the method comprising:
-
evaluating, by a computer system, word usage in an article for a string in reference corpus articles of a reference corpus by generating for the string a corpus vector having a list of co-occurring corpus words that occur in the reference corpus articles that also include the string and calculating a corpus ratio for each co-occurring corpus word as a number of times the each co-occurring corpus word occurs in the reference corpus articles containing the string divided by a number of times the co-occurring corpus word occurs in the reference corpus articles, the corpus vector including the list of co-occurring corpus words and the corpus ratio for the each co-occurring corpus words; evaluating, by the computer system, word usage in current social media documents containing the string by generating for the string a social media vector having a list of co-occurring social media words that occur in social media documents that also include the string and calculating a social media ratio for each co-occurring social media word as a number of times the each co-occurring social media word occurs in the current social media documents containing the string divided by a number of times the co-occurring social media word occurs in the current social media documents, the social media vector including the list of co-occurring social media words and the social media ratio for the each co-occurring social media words; causing, by the computer system, a generic score for the string to indicate greater generic-ness for a greater difference between the corpus vector and the social media vector; and storing, by the computer system, the generic score for the string. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for identifying generic terms, the system comprising one or more processors and one or more memory devices storing executable and operational data effective to cause the one or more processors to:
-
evaluate word usage in an article for a string in reference corpus articles of a reference corpus by generating for the string a corpus vector having a list of co-occurring corpus words that occur in the reference corpus articles that also include the string and calculating a corpus ratio for each co-occurring corpus word as a number of times the each co-occurring corpus word occurs in the reference corpus articles containing the string divided by a number of times the co-occurring corpus word occurs in the reference corpus articles, the corpus vector including the list of co-occurring corpus words and the corpus ratio for the each co-occurring corpus words; evaluate word usage in current social media documents containing the string by generating for the string a social media vector having a list of co-occurring social media words that occur in social media documents that also include the string and calculating a social media ratio for each co-occurring social media word as a number of times the each co-occurring social media word occurs in the current social media documents containing the string divided by a number of times the co-occurring social media word occurs in the current social media documents, the social media vector including the list of co-occurring social media words and the social media ratio for the each co-occurring social media words; cause a generic score for the string to indicate greater generic-ness for a greater difference between the corpus vector and the social media vector; and store the generic score for the string. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer program product for identifying generic terms, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for:
-
evaluating word usage in an article for a string in reference corpus articles of a reference corpus by generating for the string a corpus vector having a list of co-occurring corpus words that occur in the reference corpus articles that also include the string and calculating a corpus ratio for each co-occurring corpus word as a number of times the each co-occurring corpus word occurs in the reference corpus articles containing the string divided by a number of times the co-occurring corpus word occurs in the reference corpus articles, the corpus vector including the list of co-occurring corpus words and the corpus ratio for the each co-occurring corpus words; evaluating word usage in current social media documents containing the string by generating for the string a social media vector having a list of co-occurring social media words that occur in social media documents that also include the string and calculating a social media ratio for each co-occurring social media word as a number of times the each co-occurring social media word occurs in the current social media documents containing the string divided by a number of times the co-occurring social media word occurs in the current social media documents, the social media vector including the list of co-occurring social media words and the social media ratio for the each co-occurring social media words; causing a generic score for the string to indicate greater generic-ness for a greater difference between the corpus vector and social media vector; and storing the generic score for the string. - View Dependent Claims (12, 13, 14, 15)
-
Specification