System and method for handling the confounding effect of document length on vector-based similarity scores
First Claim
1. A computer-implemented method of generating vector-based similarity scores in text document comparisons considering document length, comprising:
- computing a mean of a number of word types of two text documents to be compared;
determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10;
performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and
outputting a normalized similarity score.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering confounding effects of document length. Vector-based methods for comparing the semantic similarity between texts (such as Content Vector Analysis and Random Indexing) have a characteristic which may reduce their usefulness for some applications: the similarity estimates they produce are strongly correlated with the lengths of the texts compared. The statistical basis for this confound is described, and suggests the application of a pivoted normalization method from information retrieval to correct for the effect of document length. In two text categorization experiments, Random Indexing similarity scores using pivoted normalization are shown to perform significantly better than standard vector-based similarity estimation methods.
24 Citations
6 Claims
-
1. A computer-implemented method of generating vector-based similarity scores in text document comparisons considering document length, comprising:
-
computing a mean of a number of word types of two text documents to be compared; determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10; performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and outputting a normalized similarity score.
-
-
2. A computer-implemented method of generating vector-based similarity scores in text document comparisons considering document length, comprising:
-
computing a mean of a number of word types of two text documents to be compared; determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Content-Vector Analysis model and wherein a normalization slope parameter has a value of 5; performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and outputting a normalized similarity score.
-
-
3. A computer system for generating vector-based similarity scores in text document comparisons considering document length, comprising:
-
a computer programmed with instructions that, when executed, cause the computer to execute steps comprising; computing a mean of a number of word types of two text documents; determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10; performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and outputting a normalized similarity score.
-
-
4. A computer system for generating vector-based similarity scores in text document comparisons considering document length, comprising:
-
a computer programmed with instructions that, when executed, cause the computer to execute steps comprising; computing a mean of a number of word types of two text documents to be compared; determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Content-Vector Analysis model and wherein a normalization slope parameter has a value of 5; performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and outputting a normalized similarity score.
-
-
5. An article of manufacture comprising a non-transitory computer-readable storage medium for causing a computer to generate vector-based similarity scores in text document comparisons considering document length, said computer readable medium including programming instructions that, when executed, cause the computer to execute steps comprising:
-
computing a mean of a number of word types of two text documents; determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10; performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and outputting a normalized similarity score.
-
-
6. An article of manufacture comprising a non-transitory computer-readable storage medium for causing a computer to generate vector-based similarity scores in text document comparisons considering document length, said computer readable medium including programming instructions that, when executed, cause the computer to execute steps comprising:
-
computing a mean of a number of word types of two text documents to be compared; determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Content-Vector Analysis model and wherein a normalization slope parameter has a value of 5; performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and outputting a normalized similarity score.
-
Specification