Method of vector analysis for a document

US 20040068396A1
Filed: 09/22/2003
Published: 04/08/2004
Est. Priority Date: 11/20/2000
Status: Active Grant

First Claim

Patent Images

1. A method for representing an input document with vectors, comprising:

detecting terms that occur in said input document;

segmenting said input document into document segments, each segment being an appropriately sized chunk; and

generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments, wherein a set of said document segment vectors are represented by eigenvalues and eigenvectors of a square sum matrix of said document segments.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides a document representation method and a document analysis method including extraction of important sentences from a given document and/or determination of similarity between two documents.

The inventive method detects terms that occur in the input document, segments the input document into document segments, each segment being an appropriately sized chunk and generates document segment vectors, each vector including as its element values according to occurrence frequencies of the terms occurring in the document segments. The method further calculates eigenvalues and eigenvectors of a square sum matrix in which a rank of the respective document segment vector is represented by R and selects from the eigenvectors a plural (L) of eigenvectors to be used for determining the importance. Then, the method calculates a weighted sum of the squared projections of the respective document segment vectors onto the respective selected eigenvectors and selects document segments having the significant importance based on the calculated weighted sum of the squared projections of the respective document segment vectors.

86 Citations

View as Search Results

11 Claims

1. A method for representing an input document with vectors, comprising:
- detecting terms that occur in said input document;
  
  segmenting said input document into document segments, each segment being an appropriately sized chunk; and
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments, wherein a set of said document segment vectors are represented by eigenvalues and eigenvectors of a square sum matrix of said document segments.
- View Dependent Claims (2)
- - 2. A method as claimed in claim 1, wherein in said input document K terms occur and said input document is divided into N document segments, said square sum matrix A=(A_ab) being calculated by:

3. A method for extracting important document segments from an input document, comprising:
- detecting terms that occur in said input document;
  
  segmenting said input document into document segments, each segment being an appropriately sized chunk;
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments;
  
  determining eigenvalues and eigenvectors of a square sum matrix in which a rank of said document segment vector is represented by R;
  
  selecting, from said eigenvectors, a plural of (L) eigenvectors to be used for determining importance;
  
  calculating a weighted sum of squared projections of said document segment vectors onto the selected eigenvectors; and
  
  selecting document segments having large importance based on said calculated weighted sum of squared projections of the document segment vectors.
- View Dependent Claims (4, 5)
- - 4. A method as claimed in claim 3, wherein in said input document K terms occur and said input document is divided into N document segments;
    - wherein d_nirepresents occurrence frequency of an i-th term in a n-th document segment and a n-th document segment vector d_n(n=1, . . . , N) is represented by (d_n1, d_n2, . . . , d_nk) and the vector d_nafter the projection is represented by z_n=(z_n1, z_n2, . . . z_nL), a projection value of d_nto a m-th eigenvector is given by z_nm=Φ
      
      _m^td_n, where Φ
      
      _mrepresents the m-th eigenvector and t represents transpose;
      
      a sum of squared projections onto a L dimensional subspace being given by;
      
      $\sum_{m = 1}^{L} z_{n m}^{2}$ or $\sum_{m = 1}^{L} λ_{m} z_{n m}^{2},$ where λ
      
      _mrepresents the eigenvalue of the m-th eigenvector.
  - 5. A method as claimed in claim 3, wherein said eigenvalue and eigenvector are calculated by the following square sum matrix:

6. A method for retrieving, from an input document, a document segment having relatedness with a query, comprising:
- detecting terms that occur in said input document;
  
  segmenting said input document into document segments, each segment being an appropriately sized chunk;
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said document segments;
  
  determining eigenvalues and eigenvectors of square sum matrices of said document segment vectors to define a subspace;
  
  detecting terms that occur in said query;
  
  generating query vectors, each vector including as its element values according to occurrence frequency of said terms to project said query vectors onto said subspace;
  
  projecting each of said document segment vectors onto said subspace to calculate relatedness of said query with said document segments.
- View Dependent Claims (7, 8)
- - 7. A method as claimed in claim 6, wherein when z_nrepresents a projection vector of a projection vector d_nof said document segment on said subspace and y represents a projected vector of the said query vector onto said subspace, relatedness g_nbetween a n-th document segment and the query is obtained based on an inner product of y and z_n, that is y^tz_nwhere t represents transpose.
  - 8. A method as claimed in claim 6, wherein a weight s_mfor a m-th eigenvector is defined by a function of (Φ
    - _m^tq)²and the relatedness g_nwith the document segment n is obtained by;

9. A method for determining similarity between given two input documents, comprising:
- detecting terms that occur in each of said input documents;
  
  segmenting each of said input document into document segments, each segment being an appropriately sized chunk;
  
  generating document segment vectors, each vector including as its element values according to occurrence frequencies of said terms occurring in said respective document segments;
  
  calculating, for each of the two input documents, a squared inner product for all combinations of said document segment vectors contained in each input document; and
  
  determining said similarity between the two input documents based on a sum of said squared inner products.

10. A method for determining similarity between given two input documents, comprising:
- calculating eigenvalues and eigenvectors of square sum matrices of document segment vectors for one of said two input documents;
  
  selecting, as the basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors;
  
  calculating weighted sum of squared inner products between said basis vector and document segment vectors of the other document of said two input documents and summing the squared inner products with weights; and
  
  determining the similarity between said two input documents based on said weighted sum of squared inner products.

11. A method for determining similarity between given two input documents, comprising:
- calculating eigenvalues and eigenvectors of square sum matrices of document segment vectors for each of said two input documents;
  
  selecting, as basis vectors, eigenvectors corresponding to the larger eigenvalues from said calculated eigenvectors for each of said two input documents;
  
  calculating weighted sum of squared inner products for combination of said selected basis vectors and summing the squared inner product with weights; and
  
  determining the similarity between said two input documents based on said weighted sum of squared inner products.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Micro Focus LLC (Open Text Corporation)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Kawatani, Takahiko

Granted Patent

US 7,562,066 B2
Time in Patent Office

Days
Field of Search
US Class Current

703/2
CPC Class Codes

G06F 16/3347   using vector based model

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Method of vector analysis for a document

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

86 Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Method of vector analysis for a document

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

86 Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links