Text mining apparatus and associated methods
First Claim
Patent Images
1. A method of performing text mining comprising:
- identifying consecutive words strings in unstructured text documents;
generating a list of term candidates based on context independency values calculated based on entropy of left context and right context word strings surrounding the consecutive word strings;
generating a list of key terms from among the list of term candidatesreceiving a query over a user interface;
calculating Chi-square values wherein Chi-square values are calculated between at least some terms of the query and at least some of the key terms to identify the associated terms from among the key terms using a Chi-square expression based on count information of at least some query terms and at least some key terms in the text documents, wherein the count information includes a number of documents where both query terms and key terms appear, a number of documents where query terms appear but key terms do not appear, a number of documents where query terms appear, a number of documents where at least some query terms do not appear and key terms appear;
a number of documents where at least some query terms nor key terms appear;
a number where at least some query terms do not appear, a number where key terms appear, and a number where key terms do not appear; and
providing content in the unstructured text documents over the user interface based on the query.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for extracting key terms and associated key terms for use in text mining is provided. The method includes receiving unstructured text documents, such as emails over a customer service system. Term candidates are extracted based on identifying consecutive word strings satisfying a context independency threshold. Term candidates are weighted using mutual information to generate a list of weighted terms. The weighted terms are then recounted. Terms are associated based on Chi-square values. Associated terms can then be used for information retrieval. A user interface can be personalized with individual user profiles.
-
Citations
17 Claims
-
1. A method of performing text mining comprising:
-
identifying consecutive words strings in unstructured text documents; generating a list of term candidates based on context independency values calculated based on entropy of left context and right context word strings surrounding the consecutive word strings; generating a list of key terms from among the list of term candidates receiving a query over a user interface; calculating Chi-square values wherein Chi-square values are calculated between at least some terms of the query and at least some of the key terms to identify the associated terms from among the key terms using a Chi-square expression based on count information of at least some query terms and at least some key terms in the text documents, wherein the count information includes a number of documents where both query terms and key terms appear, a number of documents where query terms appear but key terms do not appear, a number of documents where query terms appear, a number of documents where at least some query terms do not appear and key terms appear;
a number of documents where at least some query terms nor key terms appear;
a number where at least some query terms do not appear, a number where key terms appear, and a number where key terms do not appear; andproviding content in the unstructured text documents over the user interface based on the query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer readable storage medium including instructions which, when implemented, cause a computer to perform a method comprising:
-
identifying a list of term candidates in unstructured text comprising calculating context independency values based on entropy of left context and right context word strings surrounding consecutive word strings in the unstructured text to generate term candidates; generating a list of key terms from among the list of term candidates; receiving a query over a user interface; calculating Chi-square values between at least some terms of the query and at least some of the key terms to identify associated terms from among the key terms using a Chi-square expression of the form
where x represents at least some of the query terms, y represents at least some of the key terms, N represents the number of documents, f(x,y) is the number of documents where both terms x and y appear, f(x,y ) is the number of documents where x appears but y does not appear, f(x) is the number of documents where x appears, f(x ,y) is the number of documents where x does not appear and y appears, f(x ,y ) is the number of documents where neither x nor y appears, f(x ) is the number where x does not appear, f(y) is the number where y appears, f(y ) is the number where y does not appear; andretrieving information in the unstructured text documents associated with at least one of the listed key terms. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A computer readable storage medium including instructions which, when implemented, cause a computer to perform text mining, the instructions comprising:
-
a key term extraction module adapted to identify a list of key terms in documents of unstructured text; and a text mining module adapted to receive a query and associate at least a portion of the query with some of the key terms based on Chi-square values to generate associated terms, wherein Chi-square values are calculated between at least some terms of the query and at least some of the key terms to identify the associated terms from among the key terms using a Chi-square expression based on count information of at least some query terms and at least some key terms in the text documents, wherein the count information includes a number of documents where both query terms and key terms appear, a number of documents where query terms appear but key terms do not appear, a number of documents where query terms appear, a number of documents where at least some query terms do not appear and key terms appear;
a number of documents where at least some query terms nor key terms appear;
a number where at least some query terms do not appear, a number where key terms appear, and a number where key terms do not appear, wherein the key term extraction module identifies consecutive word strings in the unstructured text using a suffix array and generates a list of term candidates based on context independency values, wherein the key term extraction module calculates context independency values based on entropy of left context and right context word strings surrounding the consecutive word strings to generate term candidates, and wherein the text mining module retrieves information from the unstructured text documents based on the query. - View Dependent Claims (15, 16, 17)
-
Specification