Method and apparatus for automatically discovering features in free form heterogeneous data

US 8,108,413 B2
Filed: 02/15/2007
Issued: 01/31/2012
Est. Priority Date: 02/15/2007
Status: Active Grant

First Claim

Patent Images

1. A method, performed using a data processing system, of automatically discovering one or more features in free form heterogeneous data, the method comprising the steps of:

obtaining free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center;

applying a label to each data item;

using the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words; and

automatically discovering one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text;

wherein the data processing system comprises a memory and a processor coupled to the memory; and

wherein the obtaining step, the applying step, the labeled data using step, and the word distribution using step are preformed, at least in part, on the data processing system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are provided for automatically discovering one or more features in free form heterogeneous data. In one aspect of the invention, the techniques include obtaining free form heterogeneous data, wherein the data comprises one or more data items, applying a label to each data item, using the labeled data to build a language model, wherein a word distribution associated with each label can be derived from the model, and using the word distribution associated with each label to discover one or more features in the data, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data.

Citations

35 Claims

1. A method, performed using a data processing system, of automatically discovering one or more features in free form heterogeneous data, the method comprising the steps of:
- obtaining free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center;
  
  applying a label to each data item;
  
  using the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words; and
  
  automatically discovering one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text;
  
  wherein the data processing system comprises a memory and a processor coupled to the memory; and
  
  wherein the obtaining step, the applying step, the labeled data using step, and the word distribution using step are preformed, at least in part, on the data processing system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the step of using the word distribution associated with each label to discover one or more features in the data comprises ranking a word probability for one or more features.
  - 3. The method of claim 1, wherein the step of using the word distribution associated with each label to discover one or more features in the data comprises discovering one or more keyword features in the data.
  - 4. The method of claim 3, wherein the one or more keyword features comprise a verb.
  - 5. The method of claim 1, wherein the step of using the word distribution associated with each label to discover one or more features in the data comprises discovering one or more distance features in the data.
  - 6. The method of claim 1, further comprising the step of generating a dictionary vector, wherein each word of the data appears as a keyword feature.
  - 7. The method of claim 1, wherein the step of using the word distribution associated with each label to discover one or more features in the data comprises discovering one or more formatting features in the data.
  - 8. The method of claim 1, wherein the step of using the word distribution associated with each label to discover one or more features in the data comprises discovering one or more character features in the data.
  - 9. The method of claim 1, wherein the step of using the word distribution associated with each label to discover one or more features in the data comprises discovering one or more percentage features in the data.

10. A method, performed on a data processing system, of automatically discovering one or more features in free form problem ticket data to facilitate one or more information technology (IT) operations, the method comprising the steps of:
- obtaining free form problem ticket data, wherein at least a portion of the data is textual data representing one or more inquiries received by a call center;
  
  labeling a portion of the data;
  
  grouping the labeled data into one or more groups, wherein each group is associated with a label;
  
  generating a language model for each group, wherein the model computes a word distribution for each group; and
  
  wherein the language model for each group comprises a probability of a word occurring in the group, the probability comprising a frequency of the word within the group divided by a total number of words within the group;
  
  automatically discovering one or more features in the data using the word distribution, wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text;
  
  expanding the one or more discovered features; and
  
  using the one or more expanded features to facilitate one or more information technology (IT) operations;
  
  wherein the data processing system comprises a memory and a processor coupled to the memory; and
  
  wherein the obtaining step, the labeling step, the generating step, the word distribution using step, the expanding step, and the expanded feature step are preformed, at least in part, on the data processing system.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 11. The method of claim 10, wherein the step of expanding the one or more discovered features comprises identifying one or more correlations with one or more features from an external resource.
  - 12. The method of claim 11, wherein the external resource comprises at least one of a thesaurus and a set of rules used to identify data in a different database.
  - 13. The method of claim 10, wherein the step of expanding the one or more discovered features comprises computing a probability indicating a relationship between two or more features, and wherein the probability indicating a relationship comprises a frequency of a co-occurrence of a first feature of the two or more features and a second feature of the two or more features in a group of features divided by a total number of co-occurrences of features within the group of features.
  - 14. The method of claim 10, wherein the step of using the word distribution to discover one or more features in the data comprises using the word distribution to discover a keyword feature.
  - 15. The method of claim 14, wherein the step of using the word distribution to discover a keyword feature comprises at least one of computing a variance of a set of numbers, computing an inverse group frequency, and manually selecting a result from at least one of computing a variance of a set of one or more numbers and computing an inverse group frequency, wherein the computing of a variance of a set of numbers comprises subtracting a mathematical mean of the set of numbers from at least one number of the set of numbers, and wherein the computing of an inverse group frequency comprises dividing a number of a plurality of clusters of words by a number of clusters of words that contain a specific word, the plurality of clusters of words comprising each of the clusters of words that contain the specific word.
  - 16. The method of claim 10, wherein the step of using the word distribution to discover one or more features in the data comprises using the word distribution to discover a distance feature.
  - 17. The method of claim 16, wherein using the word distribution to discover a distance feature comprises at least one of computing a Kullback-Leibler divergence (K-L divergence) between a word distribution of a query data and the word distribution of a group, and computing a smallest K-L divergence between a word distribution of a query data and the word distribution of each data in each group.
  - 18. The method of claim 10, wherein the one or more features comprise at least one of a formatting feature, a character feature and a percentage feature.
  - 19. The method of claim 10, further comprising the step of generating a dictionary vector, wherein each word of the data appears as a keyword feature.
  - 20. The method of claim 10, wherein the step of wherein the step of expanding the one or more discovered features comprises incorporating a portion of unlabeled free form problem ticket data.

21. An apparatus for automatically discovering one or more features in free form heterogeneous data, the apparatus comprising:
- a memory; and
  
  at least one processor coupled to the memory and operative to;
  
  obtain free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center;
  
  apply a label to each data item;
  
  use the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words; and
  
  automatically discover one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text.
- View Dependent Claims (22, 23)
- - 22. The apparatus of claim 21, wherein the at least one processor is operative to use the word distribution associated with each label to discover at least one of a keyword feature, a distance feature, a formatting feature, a character feature and a percentage feature.
  - 23. The apparatus of claim 21, wherein the at least one processor is further operative to generate a dictionary vector, wherein each word of the data appears as a keyword feature.

24. A computer program product comprising a computer readable storage medium having computer useable program code for automatically discovering one or more features in free form heterogeneous data in order to, at least in part, resolve problems customers experience with commercial products, the computer program product including:
- computer useable program code for obtaining free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center;
  
  computer useable program code for applying a label to each data item;
  
  computer useable program code for using the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words, the cluster of words associated with one of the applied labels; and
  
  computer useable program code for automatically discovering one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text.
- View Dependent Claims (25, 26)
- - 25. The computer program product of claim 24, wherein the computer usable program code for using the word distribution associated with each label to discover one or more features in the data comprises computer usable program code for using the word distribution associated with each label to discover at least one of a keyword feature, a distance feature, a formatting feature, a character feature and a percentage feature.
  - 26. The computer program product of claim 24, further including computer usable program code for generating a dictionary vector, wherein each word of the data appears as a keyword feature.

27. A computer program product comprising a computer readable storage medium having computer useable program code for automatically discovering one or more features in free form problem ticket data to facilitate one or more information technology (IT) operations data in order to, at least in part, resolve problems customers experience with commercial products, the computer program product including:
- computer useable program code for obtaining free form problem ticket data, wherein at least a portion of the data is textual data representing one or more inquiries received by a call center;
  
  computer useable program code for labeling a portion of the data;
  
  computer useable program code for grouping the labeled data into one or more groups, wherein each group is associated with a label;
  
  computer useable program code for generating a language model for each group, wherein the model computes a word distribution for each group, and wherein the language model comprises a probability of a word occurring in the group, the probability comprising a frequency of the word within the group divided by a total number of words within the group;
  
  computer useable program code for automatically discovering one or more features in the data using the word distribution, wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text;
  
  computer useable program code for expanding the one or more discovered features; and
  
  computer useable program code for using the one or more expanded features to facilitate one or more information technology (IT) operations.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35)
- - 28. The computer program product of claim 27, wherein the computer usable program code for expanding the one or more discovered features comprises computer usable program code for identifying one or more correlations with one or more features from an external resource.
  - 29. The computer program product of claim 28, wherein the computer usable program code for identifying one or more correlations with one or more features from an external resource comprises computer usable program code for identifying one or more correlations with at least one of a thesaurus and a set of rules used to identify data in a different database.
  - 30. The computer program product of claim 27, wherein the computer usable program code for expanding the one or more discovered features comprises computer usable program code for computing a probability indicating a relationship between two or more features, and wherein the probability indicator comprises a frequency of a co-occurrence of a first feature of the two or more features and a second feature of the two or more features in a group of features divided by a total number of co-occurrences of features within the group of features.
  - 31. The computer program product of claim 27, wherein the computer usable program code for using the word distribution to discover one or more features in the data comprises computer usable program code for using the word distribution to discover a keyword feature.
  - 32. The computer program product of claim 31, wherein the computer usable program code for using the word distribution to discover a keyword feature comprises computer usable program code for at least one of computing a variance of a set of numbers, computing an inverse group frequency, and manually selecting a result from at least one of computing a variance of a set of one or more numbers and computing an inverse group frequency, wherein the computing of a variance of a set of numbers comprises subtracting a mathematical mean of the set of numbers from at least one number of the set of numbers, and wherein the computing of an inverse group frequency comprises dividing a number of a plurality of clusters of words by a number of clusters of words that contain a specific word, the plurality of clusters of words comprising each of the clusters of words that contain the specific word.
  - 33. The computer program product of claim 27, wherein the computer usable program code for using the word distribution to discover one or more features in the data comprises computer usable program code for using the word distribution to discover a distance feature.
  - 34. The computer program product of claim 33, wherein the computer usable program code for using the word distribution to discover a distance feature comprises computer usable program code for at least one of computing a Kullback-Leibler divergence (K-L divergence) between a word distribution of a query data and the word distribution of a group, and computing a smallest K-L divergence between a word distribution of a query data and the word distribution of each data in each group.
  - 35. The computer program product of claim 27, wherein the computer usable program code for using the word distribution to discover one or more features from the data comprises computer usable program code for using the word distribution to discover at least one of a formatting feature, a character feature and a percentage feature.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kar, Gautam, Mahindru, Ruchi, Sailer, Anca, Wei, Xing
Primary Examiner(s)
HARPER, ELIYAH STONE

Application Number

US11/675,396
Publication Number

US 20080201131A1
Time in Patent Office

1,811 Days
Field of Search

707/104.1, 707/100, 707/10, 707/2, 707/999.102, 707/999, 707/758, 707/769, 707/802, 707/999.104
US Class Current

707/758
CPC Class Codes

G06F 16/35 Clustering; Classification

G06Q 10/10 Office automation; Time man...

Method and apparatus for automatically discovering features in free form heterogeneous data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatically discovering features in free form heterogeneous data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links