Document representation for machine-learning document classification

US 10,482,118 B2
Filed: 06/14/2017
Issued: 11/19/2019
Est. Priority Date: 06/14/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for providing weighted vector representations of documents, the method being executed by one or more processors and comprising:

receiving, by the one or more processors, text data, the text data comprising a plurality of documents, each document comprising a plurality of words;

processing, by the one or more processors, the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words;

determining, by the one or more processors, a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors;

grouping, by the one or more processors, words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and

providing, by the one or more processors, a document representation for each document in the plurality of documents, each document representation comprising a feature vector, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and computer-readable storage media for providing weighted vector representations of documents, with actions including receiving text data, the text data including a plurality of documents, each document including a plurality of words, processing the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words, determining a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors, grouping words of the plurality of words into clusters based on the plurality of similarity scores, each cluster including two or more words of the plurality of words, and providing a document representation for each document in the plurality of documents, each document representation including a feature vector, each feature corresponding to a cluster.

10 Citations

View as Search Results

20 Claims

1. A computer-implemented method for providing weighted vector representations of documents, the method being executed by one or more processors and comprising:
- receiving, by the one or more processors, text data, the text data comprising a plurality of documents, each document comprising a plurality of words;
  
  processing, by the one or more processors, the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words;
  
  determining, by the one or more processors, a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors;
  
  grouping, by the one or more processors, words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and
  
  providing, by the one or more processors, a document representation for each document in the plurality of documents, each document representation comprising a feature vector, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein each feature of the document representation comprises a feature value based on the weight determined for a respective cluster.
  - 3. The method of claim 2, wherein the weight comprises a term frequency and inverse document frequency (TF-IDF) weight.
  - 4. The method of claim 1, wherein words are included in a cluster in response to determining that their respective word-vectors are sufficiently similar.
  - 5. The method of claim 1, wherein each similarity score of the plurality of similarity scores is determined as a cosine similarity score between multiple word-vectors.
  - 6. The method of claim 1, wherein processing the text data to provide a plurality of word-vectors comprises processing at least a portion of the text data using Word2vec.
  - 7. The method of claim 1, further comprising providing the document representations to a document classification system for one or more of natural language processing (NLP) and information retrieval (IR) based on the document representations.

8. A non-transitory computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations for providing weighted vector representations of documents, the operations comprising:
- receiving text data, the text data comprising a plurality of documents, each document comprising a plurality of words;
  
  processing the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words;
  
  determining a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors;
  
  grouping words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and
  
  providing a document representation for each document in the plurality of documents, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer-readable storage medium of claim 8, wherein each feature of the document representation comprises a feature value based on the weight determined for a respective cluster.
  - 10. The computer-readable storage medium of claim 9, wherein the weight comprises a term frequency and inverse document frequency (TF-IDF) weight.
  - 11. The computer-readable storage medium of claim 8, wherein words are included in a cluster in response to determining that their respective word-vectors are sufficiently similar.
  - 12. The computer-readable storage medium of claim 8, wherein each similarity score of the plurality of similarity scores is determined as a cosine similarity score between multiple word-vectors.
  - 13. The computer-readable storage medium of claim 8, wherein processing the text data to provide a plurality of word-vectors comprises processing at least a portion of the text data using Word2vec.
  - 14. The computer-readable storage medium of claim 8, wherein operations further comprise providing the document representations to a document classification system for one or more of natural language processing (NLP) and information retrieval (IR) based on the document representations.

15. A system, comprising:
- a computing device; and
  
  a computer-readable storage device coupled to the computing device and having instructions stored thereon which, when executed by the computing device, cause the computing device to perform operations for unsupervised aspect extraction from raw data, the operations comprising;
  
  a receiving text data, the text data comprising a plurality of documents, each document comprising a plurality of words;
  
  processing the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words;
  
  determining a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors;
  
  grouping words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and
  
  providing a document representation for each document in the plurality of documents, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein each feature of the document representation comprises a feature value based on the weight determined for a respective cluster.
  - 17. The system of claim 16, wherein the weight comprises a term frequency and inverse document frequency (TF-IDF) weight.
  - 18. The system of claim 15, wherein words are included in a cluster in response to determining that their respective word-vectors are sufficiently similar.
  - 19. The system of claim 15, wherein each similarity score of the plurality of similarity scores is determined as a cosine similarity score between multiple word-vectors.
  - 20. The system of claim 15, wherein processing the text data to provide a plurality of word-vectors comprises processing at least a portion of the text data using Word2vec.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAP SE
Original Assignee
SAP SE
Inventors
Zheng, Xin
Primary Examiner(s)
Pham, Khanh B

Application Number

US15/623,071
Publication Number

US 20180365248A1
Time in Patent Office

888 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/358   Browsing; Visualisation the...

G06F 40/216   using statistical methods

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

G06N 20/00   Machine learning

Document representation for machine-learning document classification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Document representation for machine-learning document classification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links