System and method for handling the confounding effect of document length on vector-based similarity scores

US 9,311,390 B2
Filed: 01/29/2009
Issued: 04/12/2016
Est. Priority Date: 01/29/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of generating vector-based similarity scores in text document comparisons considering document length, comprising:

computing a mean of a number of word types of two text documents to be compared;

determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10;

performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and

outputting a normalized similarity score.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering confounding effects of document length. Vector-based methods for comparing the semantic similarity between texts (such as Content Vector Analysis and Random Indexing) have a characteristic which may reduce their usefulness for some applications: the similarity estimates they produce are strongly correlated with the lengths of the texts compared. The statistical basis for this confound is described, and suggests the application of a pivoted normalization method from information retrieval to correct for the effect of document length. In two text categorization experiments, Random Indexing similarity scores using pivoted normalization are shown to perform significantly better than standard vector-based similarity estimation methods.

24 Citations

View as Search Results

6 Claims

1. A computer-implemented method of generating vector-based similarity scores in text document comparisons considering document length, comprising:
- computing a mean of a number of word types of two text documents to be compared;
  
  determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10;
  
  performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and
  
  outputting a normalized similarity score.

2. A computer-implemented method of generating vector-based similarity scores in text document comparisons considering document length, comprising:
- computing a mean of a number of word types of two text documents to be compared;
  
  determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Content-Vector Analysis model and wherein a normalization slope parameter has a value of 5;
  
  performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and
  
  outputting a normalized similarity score.

3. A computer system for generating vector-based similarity scores in text document comparisons considering document length, comprising:
- a computer programmed with instructions that, when executed, cause the computer to execute steps comprising;
  
  computing a mean of a number of word types of two text documents;
  
  determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10;
  
  performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and
  
  outputting a normalized similarity score.

4. A computer system for generating vector-based similarity scores in text document comparisons considering document length, comprising:
- a computer programmed with instructions that, when executed, cause the computer to execute steps comprising;
  
  computing a mean of a number of word types of two text documents to be compared;
  
  determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Content-Vector Analysis model and wherein a normalization slope parameter has a value of 5;
  
  performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and
  
  outputting a normalized similarity score.

5. An article of manufacture comprising a non-transitory computer-readable storage medium for causing a computer to generate vector-based similarity scores in text document comparisons considering document length, said computer readable medium including programming instructions that, when executed, cause the computer to execute steps comprising:
- computing a mean of a number of word types of two text documents;
  
  determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Random Indexing model and wherein a normalization slope parameter has a value of 10;
  
  performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and
  
  outputting a normalized similarity score.

6. An article of manufacture comprising a non-transitory computer-readable storage medium for causing a computer to generate vector-based similarity scores in text document comparisons considering document length, said computer readable medium including programming instructions that, when executed, cause the computer to execute steps comprising:
- computing a mean of a number of word types of two text documents to be compared;
  
  determining a similarity score with a vector-based similarity model, wherein the vector-based similarity model is a Content-Vector Analysis model and wherein a normalization slope parameter has a value of 5;
  
  performing pivoted document length normalization on the similarity score using the mean of the number of word types of the two text documents as a normalization affected by both text documents and using the normalization slope parameter; and
  
  outputting a normalized similarity score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Educational Testing Service
Original Assignee
Educational Testing Service
Inventors
Higgins, Derrick C.
Primary Examiner(s)
Reyes, Mariela
Assistant Examiner(s)
Black, Linh

Application Number

US12/362,380
Publication Number

US 20090190839A1
Time in Patent Office

2,630 Days
Field of Search

707749-750
US Class Current

1/1
CPC Class Codes

G06F 16/3347 using vector based model

System and method for handling the confounding effect of document length on vector-based similarity scores

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

24 Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for handling the confounding effect of document length on vector-based similarity scores

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links