Method of vector analysis for a document
First Claim
1. A computer-implemented method for extracting important document segments from an input document, comprising:
- detecting k terms that occur in said input document;
segmenting said input document into N document segments, each segment being a predetermined part of said document;
generating document segment vectors, each vector including as its element values occurrence of frequencies of said terms occurring in said document segments, wherein a n-th document vector dn (n=1, . . . , N) is represented by (dn1, dn2, . . . , dnk) and dni represents the occurrence frequency of an i-th term among a total k terms in a n-th document segment;
determining eigenvalues and eigenvectors of a square sum matrix A, where said square sum matrix A is a k×
k matrix, k>
1, and each component Aab of said square sum matrix A indicates a degree of co-occurrence of a-th and b-th terms (a,b=1, . . . , k) in said input document and is calculated by;
and a rank of said square sum matrix is represented by R;
selecting, from said eigenvectors, a plural of (L) eigenvectors to be used for determining importance;
calculating a weighted sum of squared projections of said document segment vectors onto the selected eigenvectors; and
selecting documents having significant importance based on said calculated weighted sum of square projections of the document segment vectors;
wherein the vector dn after the projection is represented by zn=(zn1, zn2, . . . znL), a projection value of dn to a m-th eigenvector is given by znm=φ
mtdn, where φ
m represents the m-th eigenvector and t represents transpose;
a sum of squared projections onto a L dimensional subspace being given by;
Σ
mL=IZnm2
or
Σ
mL=Iλ
mznm2, where λ
m represents the eigenvalue of the m-th eigenvector”
.
8 Assignments
0 Petitions
Accused Products
Abstract
The invention provides a document representation method and a document analysis method including extraction of important sentences from a given document and/or determination of similarity between two documents.
The inventive method detects terms that occur in the input document, segments the input document into document segments, each segment being an appropriately sized chunk and generates document segment vectors, each vector including as its element values according to occurrence frequencies of the terms occurring in the document segments. The method further calculates eigenvalues and eigenvectors of a square sum matrix in which a rank of the respective document segment vector is represented by R and selects from the eigenvectors a plural (L) of eigenvectors to be used for determining the importance. Then, the method calculates a weighted sum of the squared projections of the respective document segment vectors onto the respective selected eigenvectors and selects document segments having the significant importance based on the calculated weighted sum of the squared projections of the respective document segment vectors.
58 Citations
3 Claims
-
1. A computer-implemented method for extracting important document segments from an input document, comprising:
-
detecting k terms that occur in said input document; segmenting said input document into N document segments, each segment being a predetermined part of said document; generating document segment vectors, each vector including as its element values occurrence of frequencies of said terms occurring in said document segments, wherein a n-th document vector dn (n=1, . . . , N) is represented by (dn1, dn2, . . . , dnk) and dni represents the occurrence frequency of an i-th term among a total k terms in a n-th document segment; determining eigenvalues and eigenvectors of a square sum matrix A, where said square sum matrix A is a k×
k matrix, k>
1, and each component Aab of said square sum matrix A indicates a degree of co-occurrence of a-th and b-th terms (a,b=1, . . . , k) in said input document and is calculated by;and a rank of said square sum matrix is represented by R; selecting, from said eigenvectors, a plural of (L) eigenvectors to be used for determining importance; calculating a weighted sum of squared projections of said document segment vectors onto the selected eigenvectors; and selecting documents having significant importance based on said calculated weighted sum of square projections of the document segment vectors; wherein the vector dn after the projection is represented by zn=(zn1, zn2, . . . znL), a projection value of dn to a m-th eigenvector is given by znm=φ
mtdn, where φ
m represents the m-th eigenvector and t represents transpose;a sum of squared projections onto a L dimensional subspace being given by;
Σ
mL=IZnm2
or
Σ
mL=Iλ
mznm2,where λ
m represents the eigenvalue of the m-th eigenvector”
.- View Dependent Claims (2)
-
-
3. A system for extracting sentences from a document, said system comprising a storage medium and a processor for executing the steps comprising:
-
detecting k terms that occur in said input document; segmenting said input document into N document segments, each segment being a predetermined part of said document; generating document segment vectors, each vector including as its element values occurrence of frequencies of said terms occurring in said document segments, wherein a n-th document vector dn (n=1, . . . , N) is represented by (dn1, dn2, . . . , dnk) and d ni represents the occurrence frequency of an i-th term among a total k terms in a n-th document segment; determining eigenvalues and eigenvectors of a square sum matrix A, where said square sum matrix A is a k×
k matrix, k>
1, and each component Aab of said square sum matrix A indicates a degree of co-occurrence of a-th and b-th terms (a,b=1, . . . , k) in said input document and is calculated by;and a rank of said square sum matrix is represented by R; selecting, from said eigenvectors, a plural of (L) eigenvectors to be used for determining importance; calculating a weighted sum of squared projections of said document segment vectors onto the selected eigenvectors; and selecting documents having significant importance based on said calculated weighted sum of square projections of the document segment vectors; wherein the vector dn after the projection is represented by zn=(zn1, zn2, . . . znL), a projection value of dn to a m-th eigenvector is given by znm=φ
mtdn, where 100 m represents the m-th eigenvector and t represents transpose;a sum of squared projections onto a L dimensional subspace being given by;
Σ
mL=IZnm2
or
Σ
mL=Iλ
mznm2where λ
m represents the eigenvalue of the m-th eigenvector.
-
Specification