Document characterization using a tensor space model
First Claim
1. A computer-readable medium having computer-executable instructions for controlling a processor of a computer system to categorize a document by a method comprising:
- for each of a plurality of categories, providing documents within that category, each document having words with characters;
for each document,generating a high-order tensor having an order of at least three, each order represented by a coordinate with characters as dimensions of the coordinate, each element of the high-order tensor representing a sequence of at least three characters and being set to a weight based on number of occurrences of that sequence of at least three characters within the document, the weight being based on term frequency by inverse document frequency; and
generating a core tensor by reducing dimensionality of the generated high-order tensor using high-order singular value decomposition;
training a support vector machine (“
SVM”
) classifier using the generated core tensors for the documents and the categories of the documents; and
categorizing a document by generating a high-order tensor for the document, generating a core tensor for the generated high-order tensor for the document, and applying the SVM classifier to the generated core tensor for the document to determine a category for the document.
2 Assignments
0 Petitions
Accused Products
Abstract
Computer-readable media having computer-executable instructions and apparatuses categorize documents or corpus of documents. A Tensor Space Model (TSM), which models the text by a higher-order tensor, represents a document or a corpus of documents. Supported by techniques of multilinear algebra, TSM provides a framework for analyzing the multifactor structures. TSM is further supported by operations and presented tools, such as the High-Order Singular Value Decomposition (HOSVD) for a reduction of the dimensions of the higher-order tensor. The dimensionally reduced tensor is compared with tensors that represent possible categories. Consequently, a category is selected for the document or corpus of documents. Experimental results on the dataset for 20 Newsgroups suggest that TSM is advantageous to a Vector Space Model (VSM) for text classification.
11 Citations
15 Claims
-
1. A computer-readable medium having computer-executable instructions for controlling a processor of a computer system to categorize a document by a method comprising:
-
for each of a plurality of categories, providing documents within that category, each document having words with characters; for each document, generating a high-order tensor having an order of at least three, each order represented by a coordinate with characters as dimensions of the coordinate, each element of the high-order tensor representing a sequence of at least three characters and being set to a weight based on number of occurrences of that sequence of at least three characters within the document, the weight being based on term frequency by inverse document frequency; and generating a core tensor by reducing dimensionality of the generated high-order tensor using high-order singular value decomposition; training a support vector machine (“
SVM”
) classifier using the generated core tensors for the documents and the categories of the documents; andcategorizing a document by generating a high-order tensor for the document, generating a core tensor for the generated high-order tensor for the document, and applying the SVM classifier to the generated core tensor for the document to determine a category for the document. - View Dependent Claims (2, 3)
-
-
4. A method performed by a computer system to categorize a document, the method performed by a processor of the computer system comprising:
-
for each of a plurality of categories, storing documents within that category, each document having words with characters; for each document, generating a high-order tensor having an order of at least three, each order represented by a coordinate with characters as dimensions of the coordinate, each element of the high-order tensor representing a sequence of at least three characters and being set to a weight based on number of occurrences of that sequence of at least three characters within the document; and generating a core tensor by reducing dimensionality of the generated high-order tensor; training a classifier using the generated core tensors for the documents and the categories of the documents; and categorizing a document by generating a high-order tensor for the document, generating a core tensor for the generated high-order tensor for the document, and applying the classifier to the generated core tensor for the document to determine a category for the document. - View Dependent Claims (5, 6, 7, 8, 9)
-
-
10. A computer system that categorizes a document, comprising:
-
a processor; and a memory storing; a corpus of documents, each document having words and a category; a tensor space model module that generates a high-order tensor having an order of at least three, each order represented by a coordinate with characters as dimensions of the coordinate, each element of the high-order tensor representing a sequence of at least three characters and being set to a weight based on number of occurrences of that sequence of at least three characters within the document; an analyzing module that generates a core tensor by reducing dimensionality of the generated high-order tensor; a training module that trains a classifier using the generated core tensors for the documents and the categories of the documents; and a categorization module that categorizes a document by generating a high-order tensor for the document, generates a core tensor for the generated high-order tensor for the document, and applies the classifier to the generated core tensor for the document to determine a category for the document. - View Dependent Claims (11, 12, 13, 14, 15)
-
Specification