Techniques for computing similarity measurements between segments representative of documents
First Claim
1. In a system for navigating a document repository in which each document in the document repository comprises at least one segment, a method for computing similarity measurements between various ones of a plurality of segments comprising:
- populating a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords;
identifying a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords; and
adding first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data;
for each sub-matrix of the plurality of sub-matrices, calculating a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword frequency data and the first keyword frequency data; and
summing the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment.
2 Assignments
0 Petitions
Accused Products
Abstract
Keyword frequency data for a plurality of document-derived segments is represented in a matrix form in which each segment is represented as a vector of dimensionality equal to the number of keywords. The matrix may be subdivided into a plurality of sub-matrices, each preferably corresponding to a non-overlapping portion of the plurality of keywords. When determining a similarity measurement between any pair of segments, at least a portion of the keyword frequency data for each sub-matrix'"'"'s non-overlapping keywords are used to determine a sub-matrix dot product for the pair of segments. The resulting plurality of sub-matrix dot products are then summed together in order to provide the similarity measurement. Keywords that are synonyms of each other may be accommodated through the modification of keyword frequency data. Where the keyword frequency data in the matrix representation is relative sparse, compressed views of the matrix representation may be provided.
-
Citations
6 Claims
-
1. In a system for navigating a document repository in which each document in the document repository comprises at least one segment, a method for computing similarity measurements between various ones of a plurality of segments comprising:
-
populating a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords; identifying a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords; and adding first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data; for each sub-matrix of the plurality of sub-matrices, calculating a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword frequency data and the first keyword frequency data; and summing the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment. - View Dependent Claims (2)
-
-
3. In a system for navigating a document repository in which each document in the document repository comprises at least one segment, a method for computing similarity measurements between various ones of a plurality of segments comprising:
-
populating a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords; identifying a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords; adding first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data; and calculating a dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the dot product spanning at least a portion of the plurality of keywords, including the modified second keyword data, to provide a similarity measurement between the first segment and the second segment, wherein the dot product is a dot product of the modified second keyword data and the first keyword frequency data. - View Dependent Claims (4)
-
-
5. An apparatus for computing similarity measurements between various ones of a plurality of segments, comprising:
-
a matrix creation component operable to populate a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords; at least one storage device, operably coupled to the matrix creation component, operable to store the matrix; a similarity computation component operably coupled to the at least one storage component and operable to calculate, for each sub-matrix of the plurality of sub-matrices, a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products and sum the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment; and a synonym determination component, operably coupled to the matrix creation component, operable to identify a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords and add first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword data and the first keyword frequency data.
-
-
6. An apparatus for computing similarity measurements between various ones of a plurality of segments, comprising:
-
a processor; a storage device operably coupled to the processor, the storage device comprising executable instructions that when executed by the processor cause the processor to; populate a matrix representative of the plurality of segments in which each segment of the plurality of segments is represented by keyword frequency data spanning a plurality of keywords, the matrix comprising a plurality of sub-matrices in which each sub-matrix of the plurality of sub-matrices corresponds to a non-overlapping portion of the plurality keywords; identify a first keyword of the plurality of keywords as being a synonym of a second keyword of the plurality of keywords; add first keyword frequency data corresponding to the first keyword to second keyword frequency data corresponding to the second keyword to provide modified second keyword frequency data; for each sub-matrix of the plurality of sub-matrices, calculate a sub-matrix dot product between a first segment of the plurality of segments and a second segment of the plurality of segments, the sub-matrix dot product spanning at least a portion of the non-overlapping portion of the plurality of keywords, to provide a plurality of sub-matrix dot products, wherein the plurality of sub-matrix dot products comprise dot products of the modified second keyword frequency data and the first keyword frequency data; and sum the plurality of sub-matrix dot products to provide a similarity measurement between the first segment and the second segment.
-
Specification