Text mining apparatus and associated methods

US 7,461,056 B2
Filed: 02/09/2005
Issued: 12/02/2008
Est. Priority Date: 02/09/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of performing text mining comprising:

identifying consecutive words strings in unstructured text documents;

generating a list of term candidates based on context independency values calculated based on entropy of left context and right context word strings surrounding the consecutive word strings;

generating a list of key terms from among the list of term candidatesreceiving a query over a user interface;

calculating Chi-square values wherein Chi-square values are calculated between at least some terms of the query and at least some of the key terms to identify the associated terms from among the key terms using a Chi-square expression based on count information of at least some query terms and at least some key terms in the text documents, wherein the count information includes a number of documents where both query terms and key terms appear, a number of documents where query terms appear but key terms do not appear, a number of documents where query terms appear, a number of documents where at least some query terms do not appear and key terms appear;

a number of documents where at least some query terms nor key terms appear;

a number where at least some query terms do not appear, a number where key terms appear, and a number where key terms do not appear; and

providing content in the unstructured text documents over the user interface based on the query.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for extracting key terms and associated key terms for use in text mining is provided. The method includes receiving unstructured text documents, such as emails over a customer service system. Term candidates are extracted based on identifying consecutive word strings satisfying a context independency threshold. Term candidates are weighted using mutual information to generate a list of weighted terms. The weighted terms are then recounted. Terms are associated based on Chi-square values. Associated terms can then be used for information retrieval. A user interface can be personalized with individual user profiles.

Citations

17 Claims

1. A method of performing text mining comprising:
- identifying consecutive words strings in unstructured text documents;
  
  generating a list of term candidates based on context independency values calculated based on entropy of left context and right context word strings surrounding the consecutive word strings;
  
  generating a list of key terms from among the list of term candidatesreceiving a query over a user interface;
  
  calculating Chi-square values wherein Chi-square values are calculated between at least some terms of the query and at least some of the key terms to identify the associated terms from among the key terms using a Chi-square expression based on count information of at least some query terms and at least some key terms in the text documents, wherein the count information includes a number of documents where both query terms and key terms appear, a number of documents where query terms appear but key terms do not appear, a number of documents where query terms appear, a number of documents where at least some query terms do not appear and key terms appear;
  
  a number of documents where at least some query terms nor key terms appear;
  
  a number where at least some query terms do not appear, a number where key terms appear, and a number where key terms do not appear; and
  
  providing content in the unstructured text documents over the user interface based on the query.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, and further comprising retrieving content from the unstructured text documents, wherein the content is associated with the associated terms.
  - 3. The method of claim 2, and further comprises retrieving content from the unstructured text documents using at least one of clusters and synonyms associated with the unstructured text documents.
  - 4. The method of claim 1, and further comprising maintaining a user profile from each user, wherein the user profile comprises at least one of notable terms, stop terms, synonyms, clusters, and categories.
  - 5. The method of claim 1, wherein generating a list of term candidates comprises identifying consecutive word strings in the unstructured text.
  - 6. The method of claim 5, wherein identifying consecutive word strings comprises using at least one of a suffix array or a longest common prefix (LCP) array.
  - 7. The method of claim 5, wherein generating a list of key terms from among the list of term candidates comprises calculating mutual information values of constituent word strings in term candidates.
  - 8. The method of claim 1, wherein generating a list of key terms from among the list of term candidates comprises providing weight and count information for each of the listed key terms.

9. A computer readable storage medium including instructions which, when implemented, cause a computer to perform a method comprising:
- identifying a list of term candidates in unstructured text comprising calculating context independency values based on entropy of left context and right context word strings surrounding consecutive word strings in the unstructured text to generate term candidates;
  
  generating a list of key terms from among the list of term candidates;
  
  receiving a query over a user interface;
  
  calculating Chi-square values between at least some terms of the query and at least some of the key terms to identify associated terms from among the key terms using a Chi-square expression of the form $χ^{2} = \frac{{(f (x, y) - \frac{f (x) \cdot f (y)}{N})}^{2}}{f (x) \cdot f (y) / N} + \frac{{(f (x, \overline{y}) - \frac{f (x) \cdot f (\overline{y})}{N})}^{2}}{f (x) \cdot f (\overline{y}) / N} + \frac{{(f (\overline{x}, y) - \frac{f (\overline{x}) \cdot f (y)}{N})}^{2}}{f (\overline{x}) \cdot f (y) / N} + \frac{{(f (\overline{x}, \overline{y}) - \frac{f (\overline{x}) \cdot f (\overline{y})}{N})}^{2}}{f (\overline{x}) \cdot f (\overline{y}) / N} .$
  
  where x represents at least some of the query terms, y represents at least some of the key terms, N represents the number of documents, f(x,y) is the number of documents where both terms x and y appear, f(x, y) is the number of documents where x appears but y does not appear, f(x) is the number of documents where x appears, f( x,y) is the number of documents where x does not appear and y appears, f( x, y) is the number of documents where neither x nor y appears, f( x) is the number where x does not appear, f(y) is the number where y appears, f( y) is the number where y does not appear; and
  
  retrieving information in the unstructured text documents associated with at least one of the listed key terms.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The computer readable storage medium of claim 9, wherein identifying term candidates comprises:
    - identifying the consecutive word strings in the unstructured text using at least one of a suffix array and a longest common prefix (LCP) array.
  - 11. The computer readable storage medium of claim 9, wherein identifying a list of term candidates comprises accessing at least one of synonyms and clusters to retrieve information from the unstructured documents associated with the synonyms or clusters.
  - 12. The computer readable storage medium of claim 9, wherein generating a list of key terms from among the term candidates comprises identifying consecutive word strings in the unstructured text using at least one of a suffix array and an LCP array.
  - 13. The computer readable storage medium of claim 12, wherein generating a list of key terms from among the list of term candidates comprises calculating mutual information values of constituent word strings in term candidates.

14. A computer readable storage medium including instructions which, when implemented, cause a computer to perform text mining, the instructions comprising:
- a key term extraction module adapted to identify a list of key terms in documents of unstructured text; and
  
  a text mining module adapted to receive a query and associate at least a portion of the query with some of the key terms based on Chi-square values to generate associated terms, wherein Chi-square values are calculated between at least some terms of the query and at least some of the key terms to identify the associated terms from among the key terms using a Chi-square expression based on count information of at least some query terms and at least some key terms in the text documents, wherein the count information includes a number of documents where both query terms and key terms appear, a number of documents where query terms appear but key terms do not appear, a number of documents where query terms appear, a number of documents where at least some query terms do not appear and key terms appear;
  
  a number of documents where at least some query terms nor key terms appear;
  
  a number where at least some query terms do not appear, a number where key terms appear, and a number where key terms do not appear, wherein the key term extraction module identifies consecutive word strings in the unstructured text using a suffix array and generates a list of term candidates based on context independency values, wherein the key term extraction module calculates context independency values based on entropy of left context and right context word strings surrounding the consecutive word strings to generate term candidates, and wherein the text mining module retrieves information from the unstructured text documents based on the query.
- View Dependent Claims (15, 16, 17)
- - 15. The computer readable storage medium of claim 14, wherein the term extraction module calculates mutual information values of constituent word strings in term candidates to generate a list of key terms, and wherein the term extraction module re-counts at least some of the weighted term in the unstructured text.
  - 16. The computer readable storage medium of claim 14, and further comprising a synonym construction module adapted to generate all morphological forms of key terms or their constituent word strings.
  - 17. The computer readable storage medium of claim 14, wherein the text mining module retrieves information from the unstructured text that is associated with the associated terms.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Li, Hang, Cao, Yunbo, Martin, Benjamin, Ribet, Olivier
Primary Examiner(s)
Vy; Hung T

Application Number

US11/054,113
Publication Number

US 20060206306A1
Time in Patent Office

1,392 Days
Field of Search

707 3- 5, 707/103.R, 715/500, 704/4, 704/3
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/3338   Query expansion

G06F 40/289   Phrasal analysis, e.g. fini...

Y10S 707/99934   Query formulation, input pr...

Text mining apparatus and associated methods

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Text mining apparatus and associated methods

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links