Techniques for computing similarity measurements between segments representative of documents

US 8,166,049 B2
Filed: 05/28/2009
Issued: 04/24/2012
Est. Priority Date: 05/29/2008
Status: Active Grant

First Claim

Patent Images

1. In a system for navigating a document repository in which each document in the document repository comprises at least one segment, a method for computing similarity measurements between various ones of a plurality of segments comprising:

populating a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords;

identifying a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords; and

adding first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data;

for each sub-matrix of the plurality of sub-matrices, calculating a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword frequency data and the first keyword frequency data; and

summing the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Keyword frequency data for a plurality of document-derived segments is represented in a matrix form in which each segment is represented as a vector of dimensionality equal to the number of keywords. The matrix may be subdivided into a plurality of sub-matrices, each preferably corresponding to a non-overlapping portion of the plurality of keywords. When determining a similarity measurement between any pair of segments, at least a portion of the keyword frequency data for each sub-matrix'"'"'s non-overlapping keywords are used to determine a sub-matrix dot product for the pair of segments. The resulting plurality of sub-matrix dot products are then summed together in order to provide the similarity measurement. Keywords that are synonyms of each other may be accommodated through the modification of keyword frequency data. Where the keyword frequency data in the matrix representation is relative sparse, compressed views of the matrix representation may be provided.

Citations

6 Claims

1. In a system for navigating a document repository in which each document in the document repository comprises at least one segment, a method for computing similarity measurements between various ones of a plurality of segments comprising:
- populating a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords;
  
  identifying a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords; and
  
  adding first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data;
  
  for each sub-matrix of the plurality of sub-matrices, calculating a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword frequency data and the first keyword frequency data; and
  
  summing the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment.
- View Dependent Claims (2)
- - 2. The method of claim 1, further comprising:
    - identifying an additional segment corresponding to a newly added document in the document repository;
      
      populating the matrix with additional keyword frequency data spanning the plurality of keywords and corresponding to the additional segment;
      
      for each sub-matrix of the plurality of sub-matrices, calculating an additional sub-matrix dot product between the additional segment and another segment of the plurality of segments, the additional sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of additional sub-matrix dot products; and
      
      summing the plurality of additional sub-matrix dot products to provide an additional similarity measurement between the additional segment and the other segment.

3. In a system for navigating a document repository in which each document in the document repository comprises at least one segment, a method for computing similarity measurements between various ones of a plurality of segments comprising:
- populating a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords;
  
  identifying a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords;
  
  adding first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data; and
  
  calculating a dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the dot product spanning at least a portion of the plurality of keywords, including the modified second keyword data, to provide a similarity measurement between the first segment and the second segment, wherein the dot product is a dot product of the modified second keyword data and the first keyword frequency data.
- View Dependent Claims (4)
- - 4. The method of claim 3, further comprising:
    - identifying an additional segment corresponding to a newly added document in the document repository;
      
      populating the matrix with additional keyword frequency data spanning the plurality of keywords and corresponding to the additional segment;
      
      calculating an additional dot product between the additional segment and another segment of the plurality of segments, the additional dot product spanning at least a portion of the plurality of keywords, including the modified second keyword data, to provide an additional similarity measurement between the additional segment and the other segment.

5. An apparatus for computing similarity measurements between various ones of a plurality of segments, comprising:
- a matrix creation component operable to populate a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords;
  
  at least one storage device, operably coupled to the matrix creation component, operable to store the matrix;
  
  a similarity computation component operably coupled to the at least one storage component and operable to calculate, for each sub-matrix of the plurality of sub-matrices, a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products and sum the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment; and
  
  a synonym determination component, operably coupled to the matrix creation component, operable to identify a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords and add first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword data and the first keyword frequency data.

6. An apparatus for computing similarity measurements between various ones of a plurality of segments, comprising:
- a processor;
  
  a storage device operably coupled to the processor, the storage device comprising executable instructions that when executed by the processor cause the processor to;
  
  populate a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords;
  
  identify a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords;
  
  add first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data;
  
  for each sub-matrix of the plurality of sub-matrices, calculate a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword frequency data and the first keyword frequency data; and
  
  sum the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services Limited (Accenture PLC)
Inventors
Bose Rantham Prabhakara, Jagadeesh Chandra, Nayak, Ashwin, Chandran, Anitha
Primary Examiner(s)
Channavajjala, Srirama
Assistant Examiner(s)
PARK, GRACE A

Application Number

US12/473,347
Publication Number

US 20090300006A1
Time in Patent Office

1,062 Days
Field of Search

707748-750
US Class Current

707/748
CPC Class Codes

G06F 16/316 Indexing structures

Techniques for computing similarity measurements between segments representative of documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Techniques for computing similarity measurements between segments representative of documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links