Method and vector analysis for a document

US 8,171,026 B2
Filed: 04/16/2009
Issued: 05/01/2012
Est. Priority Date: 11/20/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method for determining similarity between two input documents, comprising:

detecting terms that occur in each of said input documents;

segmenting each of said input documents into document segments, each segment being a predetermined part of said input document;

generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where, a n-th document segment vector for a first one of said input documents S_n(n=1, . . . , N) is represented by (s_n1, s_n2, s_n3, . . . , s_nk) and a m-th document segment vector for a second one of said input documents T_m(m=1, . . . , M) is represented by (t_m1, t_m2, t_m3, . . . , t_mk), where S_nirepresents the occurrence frequency of an i-th term in a n-th document segment, and t_mirepresents the occurrence frequency of an i-th term in a m-th document segment;

calculating by a processing device, for each of the two input documents, a squared inner product for all combinations of said document segment vectors contained in each input document, where the squared inner product is represented by
S_n^tT_m=Σ

_k=1^KS_nkT_mk; and

determining said similarity between the two input documents based on a sum of said squared inner products.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides a document representation method and a document analysis method including extraction of important sentences from a given document and/or determination of similarity between two documents.

The inventive method detects terms that occur in the input document, segments the input document into document segments, each segment being an appropriately sized chunk and generates document segment vectors, each vector including as its element values according to occurrence frequencies of the terms occurring in the document segments. The method further calculates eigenvalues and eigenvectors of a square sum matrix in which a rank of the respective document segment vector is represented by R and selects from the eigenvectors a plural (L) of eigenvectors to be used for determining the importance. Then, the method calculates a weighted sum of the squared projections of the respective document segment vectors onto the respective selected eigenvectors and selects document segments having the significant importance based on the calculated weighted sum of the squared projections of the respective document segment vectors.

Citations

12 Claims

1. A method for determining similarity between two input documents, comprising:
- detecting terms that occur in each of said input documents;
  
  segmenting each of said input documents into document segments, each segment being a predetermined part of said input document;
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where, a n-th document segment vector for a first one of said input documents S_n(n=1, . . . , N) is represented by (s_n1, s_n2, s_n3, . . . , s_nk) and a m-th document segment vector for a second one of said input documents T_m(m=1, . . . , M) is represented by (t_m1, t_m2, t_m3, . . . , t_mk), where S_nirepresents the occurrence frequency of an i-th term in a n-th document segment, and t_mirepresents the occurrence frequency of an i-th term in a m-th document segment;
  
  calculating by a processing device, for each of the two input documents, a squared inner product for all combinations of said document segment vectors contained in each input document, where the squared inner product is represented by
  S_n^tT_m=Σ
  
  _k=1^KS_nkT_mk; and
  
  determining said similarity between the two input documents based on a sum of said squared inner products.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, further comprising, prior to segmenting each of said input documents into document segments,performing a morphological analysis for the detected terms, including assigning a part of speech to each of the detected terms.
  - 3. The method of claim 1, wherein each of the document segments contains an equal number of terms.

4. A method for determining similarity between two input documents, comprising:
- detecting k terms that occur in each of said input documents;
  
  segmenting each of said input documents into document segments, each segment being a predetermined part of said input document;
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where a n-th document segment vector for a first one of said input documents S_n(n=1, . . . , N) is represented by (s_n1, s_n2, s_n3, . . . , s_nk) and a m-th document segment vector for a second one of said input documents T_m(m=1, . . . , M) is represented by (t_m1, t_m2, t_m3, . . . , t_mk), where s_ni, represents the occurrence frequency of an i-th term among a total of k terms in a n-th document segment of said first input document, and t_mirepresents the occurrence frequency of the i-th term in a m-th document segment of said second input document;
  
  calculating, by a processing device, eigenvalues and eigenvectors of square sum matrices of document segment vectors for said two input documents, where the square sum matrix B of the first input document is a k×
  
  k matrix, k>
  
  1, and each component B_abof said square sum matrix B indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said first input document and is calculated by
  B_ab=Σ
  
  _n=1^NS_naS_nb,
  
  and the square sum matrix C of the second input document is a k×
  
  k matrix, k 1, and each component C_abof said square sum matrix C indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said second input document and is calculated by
  C_ab=Σ
  
  _m=1^MT_maT_mb;
  
  selecting, as the basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors;
  
  calculating weighted sum of squared inner products between said basis vectors of one of said two input documents and document segment vectors of the other document of said two input documents and summing the squared inner products with weights; and
  
  determining the similarity between said two input documents based on said weighted sum of squared inner products.
- View Dependent Claims (5, 6)
- - 5. The method of claim 4, further comprising, prior to segmenting each of said input documents into document segments,performing a morphological analysis for the detected terms, including assigning a part of speech to each of the detected terms.
  - 6. The method of claim 4, wherein each of the document segments contains an equal number of terms.

7. A method for determining similarity between given two input documents, comprising:
- detecting k terms that occur in each of said input documents;
  
  segmenting each of said input documents into document segments, each segment being a predetermined part of said input document;
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where a n-th document segment vector for a first one of said input documents S_n(n=1, . . . , N) is represented by (s_n1, s_n2, s_n3, . . . , s_nk) and a m-th document segment vector for a second one of said input documents T_m(m=1, . . . , M) is represented by (t_m1, t_m2, t_m3, . . . , t_mk), where s_nirepresents the occurrence frequency of an i-th term among a total of k terms in a n-th document segment of said first input document, and t_mirepresents the occurrence frequency of the i-th term in a m-th document segment of said second input document;
  
  calculating, by a rocessing device eigenvalues and eigenvectors of square sum matrices of document segment vectors for said two input documents, where the square sum matrix B of the first input document is a k×
  
  k matrix, k>
  
  1, and each component B_abof said square sum matrix B indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said first input document and is calculated by
  B_ab=Σ
  
  _n=1^NS_naS_nb,
  
  and the square sum matrix C of the second input document is a k×
  
  k matrix, k 1, and each component C_abof said square sum matrix C indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said second input document and is calculated by
  C_ab=Σ
  
  _m=1^MT_maT_mb;
  
  selecting, as basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors for each of said two input documents;
  
  calculating weighted sum of squared inner products for combination of said selected basis vectors and summing the squared inner product with weights; and
  
  determining the similarity between said two input documents based on said weighted sum of squared inner products.
- View Dependent Claims (8, 9)
- - 8. The method of claim 7, further comprising, prior to segmenting each of said input documents into document segments,performing a morphological analysis for the detected terms, including assigning a part of speech to each of the detected terms.
  - 9. The method of claim 7, wherein each of the document segments contains an equal number of terms.

10. A non-transitory computer readable medium storing instructions executable by a processor, the instructions, when executed, causing the processor to perform a method for determining similarity between two input documents, the method comprising:
- detecting k terms that occur in each of said input documents;
  
  segmenting each of said input documents into document segments, each segment being a predetermined part of said input document;
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments, where a n-th document segment vector for a first one of said input documents S_n(n=1, . . . , N) is represented by (s_n1, s_n2, s_n3, . . . , s_nk) and a m-th document segment vector for a second one of said input documents T_m(m=1, . . . , M) is represented by (t_m1, t_m2, t_m3, . . . , t_mk), where s_ni, represents the occurrence frequency of an i-th term among a total of k terms in a n-th document segment of said first input document, and t_mirepresents the occurrence frequency of the i-th term in a m-th document segment of said second input document;
  
  calculating eigenvalues and eigenvectors of square sum matrices of document segment vectors for said two input documents, where the square sum matrix B of the first input document is a k×
  
  k matrix, k>
  
  1, and each component B_abof said square sum matrix B indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said first input document and is calculated by
  B_ab=E_n=1^NS_naS_nb,
  
  and the square sum matrix C of the second input document is a k×
  
  k matrix, k 1, and each component C_abof said square sum matrix C indicates a degree of co-occurrence of a-th and b-th terms (a, b=1, . . . , k) in said second input document and is calculated by
  C_ab=Σ
  
  _m=1^MT_maT_mb;
  
  selecting, as the basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors;
  
  calculating weighted sum of squared inner products between said basis vectors of one of said two input documents and document segment vectors of the other document of said two input documents and summing the squared inner products with weights; and
  
  determining the similarity between said two input documents based on said weighted sum of squared inner products.
- View Dependent Claims (11, 12)
- - 11. The non-transitory computer readable medium of claim 10, further comprising, prior to segmenting each of said input documents into document segments,performing a morphological analysis for the detected terms, including assigning a part of speech to each of the detected terms.
  - 12. The non-transitory computer readable medium of claim 10, wherein each of the document segments contains an equal number of terms.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Micro Focus LLC (Open Text Corporation)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Kawatani, Takahiko
Primary Examiner(s)
Breene, John E
Assistant Examiner(s)
Myint, Dennis

Application Number

US12/424,801
Publication Number

US 20090216759A1
Time in Patent Office

1,111 Days
Field of Search

707/749, 707/750, 707/737
US Class Current

707/737
CPC Class Codes

G06F 16/3347   using vector based model

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Method and vector analysis for a document

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method and vector analysis for a document

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links