METHOD FOR COMPUTING SIMILARITY BETWEEN TEXT SPANS USING FACTORED WORD SEQUENCE KERNELS
First Claim
1. A method of comparing spans of text comprising:
- computing a similarity measure between a first sequence of symbols representing a first text span and a second sequence of symbols representing a second text span as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the two sequences of symbols, wherein each of the symbols comprises at least one consecutive word, the words being enriched with linguistic information allowing them to be defined according to a set of linguistic factors, whereby pairs of symbols in the first and second sequences forming a shared subsequence of symbols are each matched according to at least one of the factors and wherein all pairs of matching symbols in a shared subsequence need not match according to the same factor.
7 Assignments
0 Petitions
Accused Products
Abstract
A computer implemented method and an apparatus for comparing spans of text are disclosed. The method includes computing a similarity measure between a first sequence of symbols representing a first text span and a second sequence of symbols representing a second text span as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the two sequences of symbols. Each of the symbols comprises at least one consecutive word and is defined according to a set of linguistic factors. Pairs of symbols in the first and second sequences that form a shared subsequence of symbols are each matched according to at least one of the factors.
-
Citations
24 Claims
-
1. A method of comparing spans of text comprising:
computing a similarity measure between a first sequence of symbols representing a first text span and a second sequence of symbols representing a second text span as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the two sequences of symbols, wherein each of the symbols comprises at least one consecutive word, the words being enriched with linguistic information allowing them to be defined according to a set of linguistic factors, whereby pairs of symbols in the first and second sequences forming a shared subsequence of symbols are each matched according to at least one of the factors and wherein all pairs of matching symbols in a shared subsequence need not match according to the same factor. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
23. An apparatus for computing a similarity measure between sequences of symbols in which each symbol is defined in terms of a plurality of factors comprising:
-
a linguistic processor which takes as input a text span and enriches tokens of the text span with linguistic information to form a first sequence of symbols, each of the symbols comprising an enriched token; and a sequence kernel computation unit which takes as input the first sequence of symbols and computes a similarity measure between the first sequence of symbols and a second sequence of symbols as a function of the occurrences of optionally noncontiguous subsequences of symbols shared by the first and second sequences of symbols, whereby pairs of symbols in shared subsequences of symbols shared by the first and second sequences are independently each matched according any one or more of the factors based on the linguistic information.
-
-
24. A method of computing a similarity measure between sequences of symbols in which each symbol is defined in terms of a plurality of factors comprising:
-
computing a sequence kernel in which optionally non-contiguous subsequences of the first and second sequences are compared, the kernel having the general form; where; s and t represent the first and second sequences; n represents a length of the shared subsequences to be used in the computation; I and J represent two of the subsequences of size n of indices ranging on the positions of the symbols in s and t respectively; λ
represents an optional decay factor which weights gaps in non-contiguous sequences;l(I) represents the length, as the number of symbols plus any gaps, spanned by the subsequence I in s; l(J) represents the number of symbols plus any gaps, spanned by the subsequence J in t, respectively; A(si k ,tjk ) is a function of two symbols sik ,tjk in the sequences s and t respectively, the function quantifying the similarity between the two symbols according to the set of factors.
-
Specification