Method and apparatus for automatically discovering features in free form heterogeneous data
First Claim
1. A method, performed using a data processing system, of automatically discovering one or more features in free form heterogeneous data, the method comprising the steps of:
- obtaining free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center;
applying a label to each data item;
using the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words; and
automatically discovering one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text;
wherein the data processing system comprises a memory and a processor coupled to the memory; and
wherein the obtaining step, the applying step, the labeled data using step, and the word distribution using step are preformed, at least in part, on the data processing system.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques are provided for automatically discovering one or more features in free form heterogeneous data. In one aspect of the invention, the techniques include obtaining free form heterogeneous data, wherein the data comprises one or more data items, applying a label to each data item, using the labeled data to build a language model, wherein a word distribution associated with each label can be derived from the model, and using the word distribution associated with each label to discover one or more features in the data, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data.
-
Citations
35 Claims
-
1. A method, performed using a data processing system, of automatically discovering one or more features in free form heterogeneous data, the method comprising the steps of:
-
obtaining free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center; applying a label to each data item; using the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words; and automatically discovering one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text; wherein the data processing system comprises a memory and a processor coupled to the memory; and wherein the obtaining step, the applying step, the labeled data using step, and the word distribution using step are preformed, at least in part, on the data processing system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method, performed on a data processing system, of automatically discovering one or more features in free form problem ticket data to facilitate one or more information technology (IT) operations, the method comprising the steps of:
-
obtaining free form problem ticket data, wherein at least a portion of the data is textual data representing one or more inquiries received by a call center; labeling a portion of the data; grouping the labeled data into one or more groups, wherein each group is associated with a label; generating a language model for each group, wherein the model computes a word distribution for each group; and
wherein the language model for each group comprises a probability of a word occurring in the group, the probability comprising a frequency of the word within the group divided by a total number of words within the group;automatically discovering one or more features in the data using the word distribution, wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text; expanding the one or more discovered features; and using the one or more expanded features to facilitate one or more information technology (IT) operations; wherein the data processing system comprises a memory and a processor coupled to the memory; and wherein the obtaining step, the labeling step, the generating step, the word distribution using step, the expanding step, and the expanded feature step are preformed, at least in part, on the data processing system. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. An apparatus for automatically discovering one or more features in free form heterogeneous data, the apparatus comprising:
-
a memory; and at least one processor coupled to the memory and operative to; obtain free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center; apply a label to each data item; use the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words; and automatically discover one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text. - View Dependent Claims (22, 23)
-
-
24. A computer program product comprising a computer readable storage medium having computer useable program code for automatically discovering one or more features in free form heterogeneous data in order to, at least in part, resolve problems customers experience with commercial products, the computer program product including:
-
computer useable program code for obtaining free form heterogeneous data, wherein the data comprises one or more data items, and wherein at least a portion of the data is textual data representing one or more inquiries received by a call center; computer useable program code for applying a label to each data item; computer useable program code for using the labeled data to build a language model, wherein a word distribution associated with each label is derived from the model, and wherein the language model comprises a probability of a word occurring in a cluster of words, the probability comprising a frequency of the word within the cluster of words divided by a total number of words within the cluster of words, the cluster of words associated with one of the applied labels; and computer useable program code for automatically discovering one or more features in the data using the word distribution associated with each label, wherein discovering one or more features in the data facilitates one or more operations that use at least a portion of the labeled data, and wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text. - View Dependent Claims (25, 26)
-
-
27. A computer program product comprising a computer readable storage medium having computer useable program code for automatically discovering one or more features in free form problem ticket data to facilitate one or more information technology (IT) operations data in order to, at least in part, resolve problems customers experience with commercial products, the computer program product including:
-
computer useable program code for obtaining free form problem ticket data, wherein at least a portion of the data is textual data representing one or more inquiries received by a call center; computer useable program code for labeling a portion of the data; computer useable program code for grouping the labeled data into one or more groups, wherein each group is associated with a label; computer useable program code for generating a language model for each group, wherein the model computes a word distribution for each group, and wherein the language model comprises a probability of a word occurring in the group, the probability comprising a frequency of the word within the group divided by a total number of words within the group; computer useable program code for automatically discovering one or more features in the data using the word distribution, wherein the one or more features include text related features that are used for recognizing an information type of a particular unit of text; computer useable program code for expanding the one or more discovered features; and computer useable program code for using the one or more expanded features to facilitate one or more information technology (IT) operations. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35)
-
Specification