Joint approach to feature and document labeling
First Claim
1. A system operating on a set of documents comprising text, the system comprising:
- a computer;
a user interfacing device including a display and at least one user input device; and
a non-transitory storage medium storing instructions programming the computer to perform operations including;
(i) receiving, via the user interfacing device, L labeled topics each labeled with a word list comprising words representative of the labeled topic;
(ii) performing probabilistic classification of the documents of the set of documents to generate for each labeled topic of the L labeled topics a document vector whose elements store scores of the documents for the labeled topic and a word vector whose elements store scores of words of a vocabulary for the labeled topic;
(iii) performing non-negative matrix factorization (NMF) to generate a NMF model that clusters the set of documents into k topics where k>
L and the performing NMF includes initializing NMF factors representing L topics of the k topics to the document and word vectors for the L labeled topics generated in the operation (ii); and
performing data mining using the NMF model.
1 Assignment
0 Petitions
Accused Products
Abstract
Documents of a set of documents are represented by bag-of-words (BOW) vectors. L labeled topics are provided, each labeled with a word list comprising words of a vocabulary that are representative of the labeled topic and possibly a list of relevant documents. Probabilistic classification of the documents generates for each labeled topic a document vector whose elements store scores of the documents for the labeled topic and a word vector whose elements store scores of the words of the vocabulary for the labeled topic. Non-negative matrix factorization (NMF) is performed to generate a document-topic model that clusters the documents into k topics where k>L. NMF factors representing L topics of the k topics are initialized to the document and word vectors for the L labeled topics. In some embodiments the NMF factors representing the L topics initialized to the document and word vectors are frozen, that is, are not updated by the NMF after the initialization.
-
Citations
20 Claims
-
1. A system operating on a set of documents comprising text, the system comprising:
-
a computer; a user interfacing device including a display and at least one user input device; and a non-transitory storage medium storing instructions programming the computer to perform operations including; (i) receiving, via the user interfacing device, L labeled topics each labeled with a word list comprising words representative of the labeled topic; (ii) performing probabilistic classification of the documents of the set of documents to generate for each labeled topic of the L labeled topics a document vector whose elements store scores of the documents for the labeled topic and a word vector whose elements store scores of words of a vocabulary for the labeled topic; (iii) performing non-negative matrix factorization (NMF) to generate a NMF model that clusters the set of documents into k topics where k>
L and the performing NMF includes initializing NMF factors representing L topics of the k topics to the document and word vectors for the L labeled topics generated in the operation (ii); andperforming data mining using the NMF model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method comprising:
-
receiving, at a computer, D documents each represented by a bag-of-words (BOW) vector of length W over a vocabulary; receiving, at the computer and from a human annotator, L labeled topics wherein each labeled topic is labeled with a word list comprising words of the vocabulary that are representative of the labeled topic; performing, using the computer, probabilistic classification of the D documents to generate for each labeled topic a document vector of length D and a word vector of length W; performing, using the computer, non-negative matrix factorization (NMF) to generate a NMF model that clusters the D documents into k topics where k>
L and the performing NMF includes initializing NMF factors representing L topics of the k topics to the document and word vectors for the L labeled topics generated by the probabilistic classification of the D documents; andperforming data mining using the NMF model including at least one of; displaying a list of documents of the D documents indicated by the NMF model as belonging to a topic of interest of the k topics, and receiving from a human user a selection of a document of the set of documents and displaying a list of other documents of the set of documents having most similar topic scores indicated by the NMF model. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory storage medium storing instructions executable by a computer to operate on documents of a set of D documents each represented by a bag-of-words (BOW) vector of length W over a vocabulary by operations including:
-
(i) receiving L labeled topics each labeled with a word list comprising words of the vocabulary that are representative of the labeled topic; (ii) performing probabilistic classification of the D documents to generate for each labeled topic of the L labeled topics a document vector of length D whose elements store scores of the D documents for the labeled topic and a word vector of length W whose W elements store scores of the W words of the vocabulary for the labeled topic; (iii) generating a document-topic model by clustering the D documents into k clusters where k>
L and the clustering includes initializing L clusters of the k clusters to the L labeled topics generated in the operation (ii); andperforming data mining using the document-topic model. - View Dependent Claims (18, 19, 20)
-
Specification