Method of vector analysis for a document
First Claim
1. A method for representing an input document with vectors, comprising:
- detecting terms that occur in said input document;
segmenting said input document into document segments, each segment being an appropriately sized chunk; and
generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments, wherein a set of said document segment vectors are represented by eigenvalues and eigenvectors of a square sum matrix of said document segments.
8 Assignments
0 Petitions
Accused Products
Abstract
The invention provides a document representation method and a document analysis method including extraction of important sentences from a given document and/or determination of similarity between two documents.
The inventive method detects terms that occur in the input document, segments the input document into document segments, each segment being an appropriately sized chunk and generates document segment vectors, each vector including as its element values according to occurrence frequencies of the terms occurring in the document segments. The method further calculates eigenvalues and eigenvectors of a square sum matrix in which a rank of the respective document segment vector is represented by R and selects from the eigenvectors a plural (L) of eigenvectors to be used for determining the importance. Then, the method calculates a weighted sum of the squared projections of the respective document segment vectors onto the respective selected eigenvectors and selects document segments having the significant importance based on the calculated weighted sum of the squared projections of the respective document segment vectors.
86 Citations
11 Claims
-
1. A method for representing an input document with vectors, comprising:
-
detecting terms that occur in said input document;
segmenting said input document into document segments, each segment being an appropriately sized chunk; and
generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments, wherein a set of said document segment vectors are represented by eigenvalues and eigenvectors of a square sum matrix of said document segments. - View Dependent Claims (2)
-
-
3. A method for extracting important document segments from an input document, comprising:
-
detecting terms that occur in said input document;
segmenting said input document into document segments, each segment being an appropriately sized chunk;
generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments;
determining eigenvalues and eigenvectors of a square sum matrix in which a rank of said document segment vector is represented by R;
selecting, from said eigenvectors, a plural of (L) eigenvectors to be used for determining importance;
calculating a weighted sum of squared projections of said document segment vectors onto the selected eigenvectors; and
selecting document segments having large importance based on said calculated weighted sum of squared projections of the document segment vectors. - View Dependent Claims (4, 5)
-
-
6. A method for retrieving, from an input document, a document segment having relatedness with a query, comprising:
-
detecting terms that occur in said input document;
segmenting said input document into document segments, each segment being an appropriately sized chunk;
generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments;
determining eigenvalues and eigenvectors of square sum matrices of said document segment vectors to define a subspace;
detecting terms that occur in said query;
generating query vectors, each vector including as its element values according to occurrence frequency of said terms to project said query vectors onto said subspace;
projecting each of said document segment vectors onto said subspace to calculate relatedness of said query with said document segments. - View Dependent Claims (7, 8)
-
-
9. A method for determining similarity between given two input documents, comprising:
-
detecting terms that occur in each of said input documents;
segmenting each of said input document into document segments, each segment being an appropriately sized chunk;
generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments;
calculating, for each of the two input documents, a squared inner product for all combinations of said document segment vectors contained in each input document; and
determining said similarity between the two input documents based on a sum of said squared inner products.
-
-
10. A method for determining similarity between given two input documents, comprising:
-
calculating eigenvalues and eigenvectors of square sum matrices of document segment vectors for one of said two input documents;
selecting, as the basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors;
calculating weighted sum of squared inner products between said basis vector and document segment vectors of the other document of said two input documents and summing the squared inner products with weights; and
determining the similarity between said two input documents based on said weighted sum of squared inner products.
-
-
11. A method for determining similarity between given two input documents, comprising:
-
calculating eigenvalues and eigenvectors of square sum matrices of document segment vectors for each of said two input documents;
selecting, as basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors for each of said two input documents;
calculating weighted sum of squared inner products for combination of said selected basis vectors and summing the squared inner product with weights; and
determining the similarity between said two input documents based on said weighted sum of squared inner products.
-
Specification