METHOD FOR COMPUTING SIMILARITY BETWEEN TEXT SPANS USING FACTORED WORD SEQUENCE KERNELS

US 20090175545A1
Filed: 01/04/2008
Published: 07/09/2009
Est. Priority Date: 01/04/2008
Status: Active Grant

First Claim

Patent Images

1. A method of comparing spans of text comprising:

computing a similarity measure between a first sequence of symbols representing a first text span and a second sequence of symbols representing a second text span as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the two sequences of symbols, wherein each of the symbols comprises at least one consecutive word, the words being enriched with linguistic information allowing them to be defined according to a set of linguistic factors, whereby pairs of symbols in the first and second sequences forming a shared subsequence of symbols are each matched according to at least one of the factors and wherein all pairs of matching symbols in a shared subsequence need not match according to the same factor.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer implemented method and an apparatus for comparing spans of text are disclosed. The method includes computing a similarity measure between a first sequence of symbols representing a first text span and a second sequence of symbols representing a second text span as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the two sequences of symbols. Each of the symbols comprises at least one consecutive word and is defined according to a set of linguistic factors. Pairs of symbols in the first and second sequences that form a shared subsequence of symbols are each matched according to at least one of the factors.

Citations

24 Claims

1. A method of comparing spans of text comprising:
- computing a similarity measure between a first sequence of symbols representing a first text span and a second sequence of symbols representing a second text span as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the two sequences of symbols, wherein each of the symbols comprises at least one consecutive word, the words being enriched with linguistic information allowing them to be defined according to a set of linguistic factors, whereby pairs of symbols in the first and second sequences forming a shared subsequence of symbols are each matched according to at least one of the factors and wherein all pairs of matching symbols in a shared subsequence need not match according to the same factor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The method of claim 1, further comprising, prior to the computing, processing at least one of the first and second text spans to generate a sequence of tokens and enriching the tokens with linguistic information to form the first and second sequences of symbols.
  - 3. The method of claim 1, wherein the computing a similarity measure comprises computing at least one factored word sequence kernel.
  - 4. The method of claim 1, wherein the factored word sequence kernel weights at least one of the factors differently from another of the factors.
  - 5. The method of claim 3, wherein in the factored word sequence kernel, gaps in the shared subsequences are weighted with a decay factor.
  - 6. The method of claim 3, wherein the factored word sequence kernel has the general form:
    - $K_{n} (s, t) = \sum_{I, J} λ^{l (I) + l (J)} \prod_{k = 1}^{n} A (s_{i_{k}}, t_{j_{k}})$ where;
      
      s and t represent the first and second sequences;
      
      n represents a length of the shared subsequences to be used in the computation;
      
      I and J represent two of the subsequences of size n of indices ranging on the positions of the symbols in s and t respectively;
      
      λ
      
      represents an optional decay factor which weights gaps in non-contiguous sequences;
      
      l(I) represents the length, as the number of symbols plus any gaps, spanned by the subsequence I in s;
      
      l(J) represents the length, as the number of symbols plus any gaps, spanned by the subsequence J in t, respectively; and
      
      A(s_i_k,t_j_k) is a function of two symbols s_i_k,t_j_kin the sequences s and t respectively, the function quantifying the similarity between the two symbols according to the set of factors.
  - 7. The method of claim 6, wherein the function A(s_i_k,t_j_k) is a weighted sum over factors of the similarity between factors in each pair of symbols according to the expression:
    - $A (u, v) = \sum_{h = 1}^{p} w_{h}^{2} k_{h} (u [h], v [h])$ where;
      
      u and v are two symbols to be compared, represented each by p factors;
      
      w_h²,h=1, . . . , p are the weights of the individual terms in the linear combination;
      
      k_h(u[h],v[h]) is a function measuring the similarity between the h^thfactor of u and the h^thfactor of v.
  - 8. The method of claim 3, wherein the factored word sequence kernel has the general form:
    - $K_{n} (s, t) = \sum_{I, J} λ^{l (I) + l (J)} \prod_{k = 1}^{n} (\sum_{h = 1}^{p} w_{h}^{2} k_{h} (s_{i_{k}} [h], t_{j_{k}} [h]))$ where s and t represent the first and second sequences;
      
      n represents a length of the shared subsequences to be used in the computation;
      
      I and J represent two of the subsequences of size n of indices ranging on the positions of the symbols in s and t respectively;
      
      λ
      
      represents an optional decay factor which weights gaps in non-contiguous sequences;
      
      l(I) represents the length, as the number of symbols plus any gaps, spanned by the subsequence I in s;
      
      l(J) represents the length, as the number of symbols plus any gaps, spanned by the subsequence J in t, respectively;
      
      w_hrepresents the weight of each factor in a set of h factors; and
      
      k_hrepresents the similarity score between two symbols s_i_kand t_j_kof sequences s and t, respectively, for a given factor h.
  - 9. The method of claim 1, wherein a plurality of the factors are selected from the group consisting of surface forms, lemmas, parts-of-speech, and morphological tags.
  - 10. The method of claim 1 wherein there are at least three factors.
  - 11. The method of claim 1, wherein at least one factor comprises elements of a continuous inner-product space and the similarity is computed as the inner product between the factor elements.
  - 12. The method of claim 1, wherein at least one factor comprises real vectors forming latent-semantic representations of symbols.
  - 13. The method of claim 1, wherein the computing of the similarity measure comprises computing a plurality of similarity measures each of the similarity measures being computed for a different value of n and computing a combined similarity measure as a function of the computed similarity measures, where n represents a length of the subsequences of symbols shared by the two sequences of symbols.
  - 14. The method of claim 13, wherein the combined similarity measure is computed as a linear combination of the plurality of similarity measures according to the expression:
    - $K_{N} (s, t) = \sum_{n = 1}^{N} μ_{n} K_{n} (s, t)$ where;
      
      μ
      
      _n,n=1, . . . , N is a weight of the n^thsingle-subsequence-length measure.
  - 15. The method of claim 1, wherein at least one of the factors comprises elements of a countable set and the similarity of two symbols in a pair of symbols, one in each subsequence, is computed using the Kronecker delta function.
  - 16. The method of claim 1, wherein at least one of the factors comprises strings of characters and their similarity is computed as a decreasing function of the edit distance between the two strings.
  - 17. The method of claim 1, wherein at least one of the factors comprises sets of symbols and their similarity is computed using a set-similarity measure such as the Dice coefficient or the Jaccard coefficient.
  - 18. The method of claim 1, wherein at least one of the factors comprises sets of synonym sets from a thesaurus.
  - 19. The method of claim 1, wherein the second sequence comprises a set of second sequences and the method comprises, for each of the second sequences, computing of the similarity measure between the first sequence of symbols and the second sequence of symbols as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the two sequences of symbols, the method further comprising evaluating a fluency of the first text span in a natural language shared by the text sequences based on the computed similarity measures.
  - 20. The method of claim 19, wherein the set of second sequences comprises a set of good sequences determined to have good fluency and a set of bad sequences determined to have a poor fluency.
  - 21. The method of claim 1 wherein the text spans comprise sentences in the same natural language.
  - 22. A computer program product which encodes instructions which when executed by a computer, performs the method of claim 1.

23. An apparatus for computing a similarity measure between sequences of symbols in which each symbol is defined in terms of a plurality of factors comprising:
- a linguistic processor which takes as input a text span and enriches tokens of the text span with linguistic information to form a first sequence of symbols, each of the symbols comprising an enriched token; and
  
  a sequence kernel computation unit which takes as input the first sequence of symbols and computes a similarity measure between the first sequence of symbols and a second sequence of symbols as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the first and second sequences of symbols, whereby pairs of symbols in shared subsequences of symbols shared by the first and second sequences are independently each matched according any one or more of the factors based on the linguistic information.

24. A method of computing a similarity measure between sequences of symbols in which each symbol is defined in terms of a plurality of factors comprising:
- computing a sequence kernel in which optionally non-contiguous subsequences of the first and second sequences are compared, the kernel having the general form;
  
  $K_{n} (s, t) = \sum_{I, J} λ^{l (I) + l (J)} \prod_{k = 1}^{n} A (s_{i_{k}}, t_{j_{k}})$ where;
  
  s and t represent the first and second sequences;
  
  n represents a length of the shared subsequences to be used in the computation;
  
  I and J represent two of the subsequences of size n of indices ranging on the positions of the symbols in s and t respectively;
  
  λ
  
  represents an optional decay factor which weights gaps in non-contiguous sequences;
  
  l(I) represents the length, as the number of symbols plus any gaps, spanned by the subsequence I in s;
  
  l(J) represents the number of symbols plus any gaps, spanned by the subsequence J in t, respectively;
  
  A(s_i_k,t_j_k) is a function of two symbols s_i_k,t_j_kin the sequences s and t respectively, the function quantifying the similarity between the two symbols according to the set of factors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Mahe, Pierre, Cancedda, Nicola

Granted Patent

US 8,077,984 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/229
CPC Class Codes

G06F 18/2413   based on distances to train...

G06F 40/268   Morphological analysis

G06F 40/30   Semantic analysis

G06V 30/274   Syntactic or semantic conte...

METHOD FOR COMPUTING SIMILARITY BETWEEN TEXT SPANS USING FACTORED WORD SEQUENCE KERNELS

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR COMPUTING SIMILARITY BETWEEN TEXT SPANS USING FACTORED WORD SEQUENCE KERNELS

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links