Document representation for machine-learning document classification
First Claim
1. A computer-implemented method for providing weighted vector representations of documents, the method being executed by one or more processors and comprising:
- receiving, by the one or more processors, text data, the text data comprising a plurality of documents, each document comprising a plurality of words;
processing, by the one or more processors, the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words;
determining, by the one or more processors, a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors;
grouping, by the one or more processors, words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and
providing, by the one or more processors, a document representation for each document in the plurality of documents, each document representation comprising a feature vector, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, systems, and computer-readable storage media for providing weighted vector representations of documents, with actions including receiving text data, the text data including a plurality of documents, each document including a plurality of words, processing the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words, determining a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors, grouping words of the plurality of words into clusters based on the plurality of similarity scores, each cluster including two or more words of the plurality of words, and providing a document representation for each document in the plurality of documents, each document representation including a feature vector, each feature corresponding to a cluster.
10 Citations
20 Claims
-
1. A computer-implemented method for providing weighted vector representations of documents, the method being executed by one or more processors and comprising:
-
receiving, by the one or more processors, text data, the text data comprising a plurality of documents, each document comprising a plurality of words; processing, by the one or more processors, the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words; determining, by the one or more processors, a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors; grouping, by the one or more processors, words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and providing, by the one or more processors, a document representation for each document in the plurality of documents, each document representation comprising a feature vector, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations for providing weighted vector representations of documents, the operations comprising:
-
receiving text data, the text data comprising a plurality of documents, each document comprising a plurality of words; processing the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words; determining a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors; grouping words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and providing a document representation for each document in the plurality of documents, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A system, comprising:
-
a computing device; and a computer-readable storage device coupled to the computing device and having instructions stored thereon which, when executed by the computing device, cause the computing device to perform operations for unsupervised aspect extraction from raw data, the operations comprising; a receiving text data, the text data comprising a plurality of documents, each document comprising a plurality of words; processing the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words; determining a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors; grouping words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and providing a document representation for each document in the plurality of documents, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification