JOINT APPROACH TO FEATURE AND DOCUMENT LABELING
First Claim
1. A document labeling system comprising:
- an electronic data processing device configured to label documents comprising text of a set of documents by operations including;
(i) receiving L labeled topics each labeled with a word list comprising words representative of the labeled topic;
(ii) performing probabilistic classification of the documents of the set of documents to generate for each labeled topic of the L labeled topics a document vector whose elements store scores of the documents for the labeled topic and a word vector whose elements store scores of words of a vocabulary for the labeled topic; and
(iii) performing non-negative matrix factorization (NMF) to generate a NMF model that clusters the set of documents into k topics where k>
L and the performing NMF includes initializing NMF factors representing L topics of the k topics to the document and word vectors for the L labeled topics generated in the operation (ii).
1 Assignment
0 Petitions
Accused Products
Abstract
Documents of a set of documents are represented by bag-of-words (BOW) vectors. L labeled topics are provided, each labeled with a word list comprising words of a vocabulary that are representative of the labeled topic and possibly a list of relevant documents. Probabilistic classification of the documents generates for each labeled topic a document vector whose elements store scores of the documents for the labeled topic and a word vector whose elements store scores of the words of the vocabulary for the labeled topic. Non-negative matrix factorization (NMF) is performed to generate a document-topic model that clusters the documents into k topics where k>L. NMF factors representing L topics of the k topics are initialized to the document and word vectors for the L labeled topics. In some embodiments the NMF factors representing the L topics initialized to the document and word vectors are frozen, that is, are not updated by the NMF after the initialization.
23 Citations
20 Claims
-
1. A document labeling system comprising:
an electronic data processing device configured to label documents comprising text of a set of documents by operations including; (i) receiving L labeled topics each labeled with a word list comprising words representative of the labeled topic; (ii) performing probabilistic classification of the documents of the set of documents to generate for each labeled topic of the L labeled topics a document vector whose elements store scores of the documents for the labeled topic and a word vector whose elements store scores of words of a vocabulary for the labeled topic; and (iii) performing non-negative matrix factorization (NMF) to generate a NMF model that clusters the set of documents into k topics where k>
L and the performing NMF includes initializing NMF factors representing L topics of the k topics to the document and word vectors for the L labeled topics generated in the operation (ii).- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A document labeling method comprising:
-
receiving, at an electronic data processing device, D documents each represented by a bag-of-words (BOW) vector of length W over a vocabulary; receiving, at the electronic data processing device, L labeled topics wherein each labeled topic is labeled with a word list comprising words of the vocabulary that are representative of the labeled topic; performing, using the electronic data processing device, probabilistic classification of the D documents to generate for each labeled topic a document vector of length D and a word vector of length W; and performing, using the electronic data processing device, non-negative matrix factorization (NMF) to generate a NMF model that clusters the D documents into k topics where k>
L and the performing NMF includes initializing NMF factors representing L topics of the k topics to the document and word vectors for the L labeled topics generated by the probabilistic classification off the D documents. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory storage medium storing instructions executable by an electronic data processing device to label documents of a set of D documents each represented by a bag-of-words (BOW) vector of length W over a vocabulary by operations including:
-
(i) receiving L labeled topics each labeled with a word list comprising words of the vocabulary that are representative of the labeled topic; (ii) performing probabilistic classification of the D documents to generate for each labeled topic of the L labeled topics a document vector of length D whose elements store scores of the D documents for the labeled topic and a word vector of length W whose W elements store scores of the W words of the vocabulary for the labeled topic; and (iii) clustering the D documents into k clusters where k>
L and the clustering includes initializing L clusters of the k clusters to the L labeled topics generated in the operation (ii). - View Dependent Claims (18, 19, 20)
-
Specification