×

Document representation for machine-learning document classification

  • US 10,482,118 B2
  • Filed: 06/14/2017
  • Issued: 11/19/2019
  • Est. Priority Date: 06/14/2017
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for providing weighted vector representations of documents, the method being executed by one or more processors and comprising:

  • receiving, by the one or more processors, text data, the text data comprising a plurality of documents, each document comprising a plurality of words;

    processing, by the one or more processors, the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words;

    determining, by the one or more processors, a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors;

    grouping, by the one or more processors, words of the plurality of words into clusters based on the plurality of similarity scores, each cluster comprising two or more words of the plurality of words; and

    providing, by the one or more processors, a document representation for each document in the plurality of documents, each document representation comprising a feature vector, each feature in the feature vector comprising a cluster, each feature having a weight assigned thereto that represents a relative importance of a respective cluster to a respective document based on weights of constituent words in the cluster, each weight being determined based on a sum of frequency values of words in the respective cluster of the respective document and a sum of document frequency values of words in the respective cluster of the respective document across the plurality of documents.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×