Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently

US 7,421,418 B2
Filed: 02/18/2004
Issued: 09/02/2008
Est. Priority Date: 02/19/2003
Status: Active Grant

First Claim

Patent Images

1. An apparatus for deriving a similarity measure comprising:

means for inputting an eigenspace analysis of a reference;

means for inputting a transition probability model of a target;

means for operating on said eigenspace analysis and said transition probability model; and

means for displaying said similarity measure to a user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently have been disclosed.

67 Citations

View as Search Results

53 Claims

1. An apparatus for deriving a similarity measure comprising:
- means for inputting an eigenspace analysis of a reference;
  
  means for inputting a transition probability model of a target;
  
  means for operating on said eigenspace analysis and said transition probability model; and
  
  means for displaying said similarity measure to a user.
- View Dependent Claims (2)
- - 2. A machine-readable medium having stored thereon information representing the apparatus of claim 1.

3. A hardware based apparatus comprising:
- a first block having an input and an output, said first block input capable of receiving a vector space;
  
  a second block having an input and an output, said second block input capable of receiving a probability space; and
  
  a third block having a first input, a second input, and an output, said third block first input coupled to receive said first block output, said third block second input coupled to receive said second block output, and said third block output capable of communication a similarity space; and
  
  a display capable of presenting to a user said similarity space.
- View Dependent Claims (4)
- - 4. A machine-readable medium having stored thereon information representing the apparatus of claim 3.

5. A computer implemented method comprising:
- generating a similarity metric based upon an eigenspace analysis and an n-gram model, said similarity metric capable of being stored in hardware on said computer and capable of being displayed to a user.

6. A computer implemented method comprising:
- receiving a profile;
  
  receiving a matrix; and
  
  generating a similarity indication between said profile and said matrix, said similarity indication capable of being stored in hardware on said computer and capable of being displayed to a user.
- View Dependent Claims (7, 8, 9, 10, 11)
- - 7. The method of claim 6 wherein said profile is an eigenspace.
  - 8. The method of claim 6 wherein said matrix is a transition probability matrix.
  - 9. The method of claim 6 wherein said profile is derived from tokens.
  - 10. The method of claim 6 wherein said matrix is derived from tokens.
  - 11. The method of claim 6 wherein said profile is derived from an n-gram analysis of text.

12. A computer implemented method for generating a similarity space, the method comprising:
- combining a vector space with a transition probability space, said similarity space capable of being stored in hardware on said computer and capable of being displayed to a user.

13. A computer implemented method comprising performing a mathematical operation using an eigenspace and transition probability matrix to generate a similarity index, said similarity index capable of being stored in hardware on said computer and capable of being displayed to a user.

14. A computer implemented method for generating a similarity score comprising:
- receiving a profile having eigenvalues (w.i);
  
  receiving a transition matrix (a);
  
  generating a left eigenvector (u.i) for each said w.i;
  
  generating a right eigenvector (v.i) for each said w.i;
  
  generating a complex conjugate of said u.i;
  
  generating said similarity score according to a formula;
  
  similarity score=sum(i, ∥
  
  u.i.conjugate*a*v.i∥
  
  ^2),where each term of said summation is said transition matrix premultiplied by said complex conjugate of i-th said left eigenvector, and postmultiplied by i-th said right eigenvector and wherein norm squared ∥
  
  .∥
  
  ^2 is a square of magnitude of a complex number resulting from said i-th term in said sum;
  
  storing said similarity score in hardware on said computer; and
  
  presenting to a user said similarity score.
- View Dependent Claims (15, 16, 17)
- - 15. The method of claim 14 wherein said profile has parameters selected from the group consisting of tuples, tokens, eigenvalues, left eigenvectors, and right eigenvectors.
  - 16. The method of claim 14 wherein said transition matrix is a probability transition matrix.
  - 17. The method of claim 14 wherein said profile represents a reference text and said transition matrix represents a target text, and a lower similarity score indicates that said target text has little or nothing in common with said reference text versus a higher similarity score indicating that there are common tuple-token combinations that appear in said target text that also appear in said reference text.

18. An apparatus for generating a similarity measure comprising:
- means for performing a computation on a target input and a profile; and
  
  means for presenting to a user said similarity measure.
- View Dependent Claims (19, 20, 21)
- - 19. The apparatus in claim 18 wherein said computation is substantially linear in order of magnitude with respect to a plurality of target inputs against said profile.
  - 20. The apparatus of claim 18 wherein said target input comprises a transition probability matrix of said target input tokenized.
  - 21. The apparatus of claim 18 wherein said profile comprises an Eigenspace.

22. A means for computing a similarity measure wherein said computing means is substantially O(log(n)) for n target inputs against a pre-computed profile, and means for presenting to a user said similarity measure.
- View Dependent Claims (23, 24)
- - 23. The means of claim 22 wherein said pre-computed profile is an eigenspace and said one or more target inputs are one or more transition probability matrices.
  - 24. The means of claim 23 wherein said one or more transition probability matrices are derived from one or more sets of shuffled tokens from one or more target inputs.

25. A computer implemented method comprising:
- tokenizing a target input;
  
  generating a transition probability matrix for said tokens;
  
  operating on a profile and said transition probability matrix; and
  
  generating a measure of similarity between said profile and said matrix, said measure of similarity capable of being stored in hardware on said computer and capable of being displayed to a user.
- View Dependent Claims (26, 27, 28)
- - 26. The method of claim 25 wherein said profile comprises:
    - tokenizing a reference input;
      
      generating a transition probability matrix for said reference tokens; and
      
      generating an eigenspace for said transition probability matrix for said reference tokens.
  - 27. The method of claim 25 wherein said target input is selected from the group consisting of letters, groups of letters, words, phrases, sentences, paragraphs, sections, spaces, punctuation, one or more documents, XML, textual input, HTML, SGML, and sets of text.
  - 28. The method of claim 26 wherein said reference input is selected from the group consisting of letters, groups of letters, words, phrases, sentences, paragraphs, sections, spaces, punctuation, one or more documents, XML, textual input, HTML, SGML, and sets of text.

29. A computer implemented method for modeling comprising;
- using a history window of h tokens to compose a tuple; and
  
  tallying all words that fall within r tokens of said tuple wherein r is between r=1 (which is a Markov n-gram model), and r substantially approaching infinity (which is a word frequency model), said tallying all words capable of being stored in hardware on said computer and capable of being displayed to a user.
- View Dependent Claims (30, 31, 32, 33)
- - 30. The method of claim 29 wherein said r is a step function token transition window of width r.
  - 31. The method of claim 29 wherein said r is a non-step function.
  - 32. The method of claim 31 where said non-step function gives greater weight to nearby tokens and lesser weight to farther away tokens.
  - 33. The method of claim 29 wherein said r is a transition weight function s(i), where 0<
    - =s(i)<
      
      =1, for i=1, . . . ,r, and normalized so that sum(i=1, . . . ,r;
      
      s(i))=1.

34. A computer implemented method for generating a similarity measure between a reference profile and a target input by performing an operation on an eigenvalue space representation of said reference profile and a transition probability model of said target input, said similarity measure capable of being stored in hardware on said computer and capable of being displayed to a user.
- View Dependent Claims (35, 36, 37)
- - 35. The method of claim 34 wherein said transition probability model represents a tokenized representation of said target input.
  - 36. The method of claim 35 wherein said tokenized representation is further generated by shuffling of tokens representing said target input.
  - 37. The method of claim 36 wherein a plurality of similarity measures is generated based on said reference profile and one or more said shuffled tokenized representations as said transition probability model.

38. A computer implemented method for determining a high similarity measure, the method comprising:
- (a) pre-generating a fixed set of eigenspace profiles representing known references;
  
  (b) generating a series of tokens representing clauses from a target input;
  
  (c) dividing said series of tokens into two groups, group A and group B;
  
  (d) generating a transition probability model for group A and group B;
  
  (e) generating a similarity measure for group A versus said profiles, and for group B versus said profiles;
  
  (f) retaining group A if it has a similarity measure equal to or higher than group B from (e), otherwise retaining group B;
  
  (g) defining the retained group as said series of tokens and repeat (c) to (g) for a predetermined number of times;
  
  storing said high similarity measure in hardware on said computer; and
  
  presenting to a user said high similarity measure.
- View Dependent Claims (39, 40, 41, 42)
- - 39. The method of claim 38 wherein said (c) dividing said series of tokens into two groups, group A and group B results in group A and group B being substantially the same size.
  - 40. The method of claim 38 wherein said predetermined number of times is based upon a factor selected from the group consisting of a relationship to the number of said tokens representing clauses from said target input, and a predetermined minimum similarity measure.
  - 41. The method of claim 38 wherein said dividing further comprises shuffling said tokens.
  - 42. The method of claim 38 wherein said known references comprises N text blocks.

43. A computer implemented method comprising:
- receiving N text blocks;
  
  building a binary tree representing indexes of said N text blocks;
  
  receiving a T text block;
  
  computing a transition probability matrix for said T text block;
  
  traversing said binary tree; and
  
  finding a closest matching N text block for said T text block, said closest matching N text block capable of being stored in hardware on said computer and capable of being displayed to a user.
- View Dependent Claims (44)
- - 44. The method of claim 43 wherein said building further comprises:
    - concatenating said N text blocks;
      
      computing a profile of said concatenated N text blocks; and
      
      computing partitioning eigenvectors of said N text blocks.

45. A computer implemented method comprising:
- receiving a T text block;
  
  computing a profile for said T text block;
  
  receiving N text blocks;
  
  (a) shuffling randomly said N text blocks;
  
  (b) dividing said shuffled randomly N text blocks into set A and set B;
  
  (c) concatenating the text clocks in set A to form group A;
  
  (d) concatenating the text clocks in set B to form group B;
  
  (e) computing a transition probability matrix for said group A and for said group B;
  
  (f) generating a similarity measure between said T text block and said group A and said group B;
  
  (g) determining if group A or group B has a higher similarity measure;
  
  (h) tallying an additional count for text blocks that are members of group A or group B having said determined higher similarity measure;
  
  (i) repeating (a) through (h) R times;
  
  (j) picking group A or group B with a highest count as a remaining group;
  
  (k) using the remaining group now as said N text blocks;
  
  (l) repeating (a) through (k) K times;
  
  storing said similarity measure in hardware on said computer; and
  
  presenting to a user said similarity measure.
- View Dependent Claims (46, 47, 48, 49)
- - 46. The method of claim 45 wherein group A and group B are substantially a same size.
  - 47. The method of claim 45 wherein R is less than 1025.
  - 48. The method of claim 45 wherein R is determined dynamically.
  - 49. The method of claim 45 wherein K is determined from the group consisting of dynamically, and a fixed count.

50. A method for determining a high similarity measure, the method comprising:
- pre-generating one or more eigenspace profiles representing clauses from a known reference;
  
  generating a series of tokens representing clauses from a target input;
  
  (a) setting a counter n=0;
  
  (b) setting counter n=n+1;
  
  (c) dividing said series of tokens into two groups, group A(n) and group B(n);
  
  (d) generating a transition probability model for group A(n) and group B(n);
  
  (e) generating a similarity measure for group A(n) versus said profiles, and for group B(n) versus said profiles;
  
  (f) awarding group A(n) a point if it has a similarity measure equal to or higher than group B(n) from (e), otherwise awarding group B(n) a point;
  
  (g) shuffling said series of tokens representing clauses from said target input in substantially random order and repeat (b) to (g) for a predetermined number of times;
  
  (h) picking those groups having a point and retaining tokens associated with said picked groups and defining said retained tokens as said series of tokens;
  
  (i) repeating (c) to (h) until said high similarity measure is determined;
  
  storing said high similarity measure in hardware on said computer; and
  
  presenting to a user said high similarity measure.
- View Dependent Claims (51, 52, 53)
- - 51. The method of claim 50 wherein said (c) dividing said series of tokens into two groups, group A and group B results in group A and group B being substantially the same size.
  - 52. The method of claim 50 wherein said predetermined number of times is based upon a factor selected from the group consisting of a relationship to the number of said tokens representing clauses from said target input, and a predetermined minimum similarity measure.
  - 53. The method of claim 50 wherein said method of claim 50 computation is substantially M*log(N), where N represents said clauses from said known reference, and where M represents said clauses from said target input.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nahava, Inc.
Original Assignee
Nahava, Inc.
Inventors
Nakano, Russell Toshio
Primary Examiner(s)
Holmes; Michael B

Application Number

US10/781,580
Publication Number

US 20040162827A1
Time in Patent Office

1,658 Days
Field of Search

706/52
US Class Current

706/52
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

67 Citations

53 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

67 Citations

53 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links