Method and vector analysis for a document
First Claim
1. A method for determining similarity between two input documents, comprising:
- detecting terms that occur in each of said input documents;
segmenting each of said input documents into document segments, each segment being a predetermined part of said input document;
generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where, a n-th document segment vector for a first one of said input documents Sn (n=1, . . . , N) is represented by (sn1, sn2, sn3, . . . , snk) and a m-th document segment vector for a second one of said input documents Tm (m=1, . . . , M) is represented by (tm1, tm2, tm3, . . . , tmk), where Sni represents the occurrence frequency of an i-th term in a n-th document segment, and tmi represents the occurrence frequency of an i-th term in a m-th document segment;
calculating by a processing device, for each of the two input documents, a squared inner product for all combinations of said document segment vectors contained in each input document, where the squared inner product is represented by
SntTm=Σ
k=1KSnkTmk; and
determining said similarity between the two input documents based on a sum of said squared inner products.
7 Assignments
0 Petitions
Accused Products
Abstract
The invention provides a document representation method and a document analysis method including extraction of important sentences from a given document and/or determination of similarity between two documents.
The inventive method detects terms that occur in the input document, segments the input document into document segments, each segment being an appropriately sized chunk and generates document segment vectors, each vector including as its element values according to occurrence frequencies of the terms occurring in the document segments. The method further calculates eigenvalues and eigenvectors of a square sum matrix in which a rank of the respective document segment vector is represented by R and selects from the eigenvectors a plural (L) of eigenvectors to be used for determining the importance. Then, the method calculates a weighted sum of the squared projections of the respective document segment vectors onto the respective selected eigenvectors and selects document segments having the significant importance based on the calculated weighted sum of the squared projections of the respective document segment vectors.
-
Citations
12 Claims
-
1. A method for determining similarity between two input documents, comprising:
-
detecting terms that occur in each of said input documents; segmenting each of said input documents into document segments, each segment being a predetermined part of said input document; generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where, a n-th document segment vector for a first one of said input documents Sn (n=1, . . . , N) is represented by (sn1, sn2, sn3, . . . , snk) and a m-th document segment vector for a second one of said input documents Tm (m=1, . . . , M) is represented by (tm1, tm2, tm3, . . . , tmk), where Sni represents the occurrence frequency of an i-th term in a n-th document segment, and tmi represents the occurrence frequency of an i-th term in a m-th document segment; calculating by a processing device, for each of the two input documents, a squared inner product for all combinations of said document segment vectors contained in each input document, where the squared inner product is represented by
SntTm=Σ
k=1KSnkTmk; anddetermining said similarity between the two input documents based on a sum of said squared inner products. - View Dependent Claims (2, 3)
-
-
4. A method for determining similarity between two input documents, comprising:
-
detecting k terms that occur in each of said input documents; segmenting each of said input documents into document segments, each segment being a predetermined part of said input document; generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where a n-th document segment vector for a first one of said input documents Sn (n=1, . . . , N) is represented by (sn1, sn2, sn3, . . . , snk) and a m-th document segment vector for a second one of said input documents Tm (m=1, . . . , M) is represented by (tm1, tm2, tm3, . . . , tmk), where sni, represents the occurrence frequency of an i-th term among a total of k terms in a n-th document segment of said first input document, and tmi represents the occurrence frequency of the i-th term in a m-th document segment of said second input document; calculating, by a processing device, eigenvalues and eigenvectors of square sum matrices of document segment vectors for said two input documents, where the square sum matrix B of the first input document is a k×
k matrix, k>
1, and each component Bab of said square sum matrix B indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said first input document and is calculated by
Bab=Σ
n=1NSnaSnb,
and the square sum matrix C of the second input document is a k×
k matrix, k 1, and each component Cab of said square sum matrix C indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said second input document and is calculated by
Cab=Σ
m=1MTmaTmb;selecting, as the basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors; calculating weighted sum of squared inner products between said basis vectors of one of said two input documents and document segment vectors of the other document of said two input documents and summing the squared inner products with weights; and determining the similarity between said two input documents based on said weighted sum of squared inner products. - View Dependent Claims (5, 6)
-
-
7. A method for determining similarity between given two input documents, comprising:
-
detecting k terms that occur in each of said input documents; segmenting each of said input documents into document segments, each segment being a predetermined part of said input document; generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where a n-th document segment vector for a first one of said input documents Sn (n=1, . . . , N) is represented by (sn1, sn2, sn3, . . . , snk) and a m-th document segment vector for a second one of said input documents Tm (m=1, . . . , M) is represented by (tm1, tm2, tm3, . . . , tmk), where sni represents the occurrence frequency of an i-th term among a total of k terms in a n-th document segment of said first input document, and tmi represents the occurrence frequency of the i-th term in a m-th document segment of said second input document; calculating, by a rocessing device eigenvalues and eigenvectors of square sum matrices of document segment vectors for said two input documents, where the square sum matrix B of the first input document is a k×
k matrix, k>
1, and each component Bab of said square sum matrix B indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said first input document and is calculated by
Bab=Σ
n=1NSnaSnb,
and the square sum matrix C of the second input document is a k×
k matrix, k 1, and each component Cab of said square sum matrix C indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said second input document and is calculated by
Cab=Σ
m=1MTmaTmb;selecting, as basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors for each of said two input documents; calculating weighted sum of squared inner products for combination of said selected basis vectors and summing the squared inner product with weights; and determining the similarity between said two input documents based on said weighted sum of squared inner products. - View Dependent Claims (8, 9)
-
-
10. A non-transitory computer readable medium storing instructions executable by a processor, the instructions, when executed, causing the processor to perform a method for determining similarity between two input documents, the method comprising:
-
detecting k terms that occur in each of said input documents; segmenting each of said input documents into document segments, each segment being a predetermined part of said input document; generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where a n-th document segment vector for a first one of said input documents Sn (n=1, . . . , N) is represented by (sn1, sn2, sn3, . . . , snk) and a m-th document segment vector for a second one of said input documents Tm (m=1, . . . , M) is represented by (tm1, tm2, tm3, . . . , tmk), where sni, represents the occurrence frequency of an i-th term among a total of k terms in a n-th document segment of said first input document, and tmi represents the occurrence frequency of the i-th term in a m-th document segment of said second input document; calculating eigenvalues and eigenvectors of square sum matrices of document segment vectors for said two input documents, where the square sum matrix B of the first input document is a k×
k matrix, k>
1, and each component Bab of said square sum matrix B indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said first input document and is calculated by
Bab=En=1NSnaSnb,
and the square sum matrix C of the second input document is a k×
k matrix, k 1, and each component Cab of said square sum matrix C indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said second input document and is calculated by
Cab=Σ
m=1MTmaTmb;selecting, as the basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors; calculating weighted sum of squared inner products between said basis vectors of one of said two input documents and document segment vectors of the other document of said two input documents and summing the squared inner products with weights; and determining the similarity between said two input documents based on said weighted sum of squared inner products. - View Dependent Claims (11, 12)
-
Specification