Multiple correlation measures for measuring query similarity

US 8,825,571 B1
Filed: 06/04/2013
Issued: 09/02/2014
Est. Priority Date: 04/30/2010
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method performed by data processing apparatus, the method comprising:

receiving a first query and a plurality of second queries;

determining a temporal correlation score between the first query and each second query based on a comparison of a temporal series of occurrences of elements of the first query in a first corpus comprising a first document of a first document type and a temporal series of occurrences of elements of the second query in a second different textual corpus comprising a second document of a second document type that differs from the first document type, wherein the comparison is based on the first document and the second document having timestamps in a same time period;

computing a similarity score for the first query and a second query, the similarity score between the first query and a second query being computed based on the temporal correlation score between the first document and the second document; and

ranking the second query according to the similarity score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining query suggestions from multiple correlation measures. In one aspect, a method includes receiving a first query and second queries, each of the first and second queries including one or more terms; for each second query and a linear model, receiving correlation scores measuring the correlation between the first query and the respective second query, each correlation score received from a respective correlation process, and each respective correlation process being different from the other respective correlation processes, and applying the linear model to the plurality of correlation scores to determine a combined correlation score that quantifies a combined correlation between the first query and the respective second query based on the plurality of correlation scores. The second queries are ranked in an order according to their respective combined correlations scores.

14 Citations

View as Search Results

20 Claims

1. A computer-implemented method performed by data processing apparatus, the method comprising:
- receiving a first query and a plurality of second queries;
  
  determining a temporal correlation score between the first query and each second query based on a comparison of a temporal series of occurrences of elements of the first query in a first corpus comprising a first document of a first document type and a temporal series of occurrences of elements of the second query in a second different textual corpus comprising a second document of a second document type that differs from the first document type, wherein the comparison is based on the first document and the second document having timestamps in a same time period;
  
  computing a similarity score for the first query and a second query, the similarity score between the first query and a second query being computed based on the temporal correlation score between the first document and the second document; and
  
  ranking the second query according to the similarity score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - determining a distributional similarity score between the first query and the second query, the determination of the distributional similarity score between the first query and a second query being based on a comparison of first frequencies of terms that co-occur in text with terms of the first query and second frequencies of terms that co-occur in text with terms of the second query,wherein computing a similarity score for the first query and a second query comprises providing, as input to a trained model, the distributional similarity score between the first query and each second query and the temporal correlation score between the first query and the second query.
  - 3. The method of claim 2, further comprising determining the distributional similarity score between the first query and each second query comprising:
    - for the first query and each respective second query;
      
      selecting, for each query term in the respective query, context terms from terms included in a third corpus, wherein each context term is selected based on a distance metric between the query term included in the respective query and the terms included in the third corpus;
      
      generating, for each query term in the respective query, a context vector associated with the query term, the context vector having a plurality of context vector elements, each context vector element corresponding to a term included in the third corpus;
      
      for each context vector, determining, for each context vector element corresponding to a selected context term in the context vector, a frequency value based on a measure of occurrence of the selected context term in the context terms selected for the query term associated with the context vector; and
      
      generating a query vector for the respective query based on the context vectors for each query term in the respective query, the query vector having a plurality of query vector elements, each query vector element corresponding to a term in the third corpus, and each query vector element having a value based on the values of the corresponding vector elements in the context vectors; and
      
      determining, from the query vectors for the first query and each respective second query, the distributional similarity score.
  - 4. The method of claim 3, wherein the distributional similarity score is based on a cosine similarity, a dot-product, a mutual information, a Jensen Shannon divergence, or a dice coefficient.
  - 5. The method of claim 2, further comprising:
    - determining a query correlation score between the first query and the second query based on a comparison of a temporal series of occurrences of elements of the first query in a query log and a temporal series of occurrences of elements of the second query in the query log,wherein computing a similarity score for the first query and a second query comprises providing, as input to the trained model, the query correlation score between the first query and the second query, the distributional similarity score between the first query and the second query, and the temporal correlation score between the first query and the second query.
  - 6. The method of claim 5, further comprising:
    - training the linear model based on annotated queries using a machine learning process.
  - 7. The method of claim 1, wherein ranking the plurality of second queries in an order according to their respective similarity scores comprises determining a Boolean classification of each second query with respect to the first query indicative of a semantic relevance of the second query with respect to the first query.
  - 8. The method of claim 1, wherein the temporal correlation score is based on one of a cosine similarity, a dot-product, a mutual information, a Jensen-Shannon divergence, or a dice coefficient.

9. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving a first query and a plurality of second queries;
  
  determining a temporal correlation score between the first query and each second query based on a comparison of a temporal series of occurrences of elements of the first query in a first corpus comprising a first document of a first document type and a temporal series of occurrences of elements of the second query in a second different textual corpus comprising a second document of a second document type that differs from the first document type, wherein the comparison is based on the first document and the second document having timestamps in a same time period;
  
  computing a similarity score for the first query and a second query, the similarity score between the first query and a second query being computed based on the temporal correlation score between the first document and the second document; and
  
  ranking the second query according to the similarity score.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the operations further comprise:
    - determining a distributional similarity score between the first query and the second query, the determination of the distributional similarity score between the first query and a second query being based on a comparison of first frequencies of terms that co-occur in text with terms of the first query and second frequencies of terms that co-occur in text with terms of the second query,wherein computing a similarity score for the first query and a second query comprises providing, as input to a trained model, the distributional similarity score between the first query and each second query and the temporal correlation score between the first query and the second query.
  - 11. The system of claim 10, wherein the operations further comprise determining the distributional similarity score between the first query and each second query comprising:
    - for the first query and each respective second query;
      
      selecting, for each query term in the respective query, context terms from terms included in a third corpus, wherein each context term is selected based on a distance metric between the query term included in the respective query and the terms included in the third corpus;
      
      generating, for each query term in the respective query, a context vector associated with the query term, the context vector having a plurality of context vector elements, each context vector element corresponding to a term included in the third corpus;
      
      for each context vector, determining, for each context vector element corresponding to a selected context term in the context vector, a frequency value based on a measure of occurrence of the selected context term in the context terms selected for the query term associated with the context vector; and
      
      generating a query vector for the respective query based on the context vectors for each query term in the respective query, the query vector having a plurality of query vector elements, each query vector element corresponding to a term in the third corpus, and each query vector element having a value based on the values of the corresponding vector elements in the context vectors; and
      
      determining, from the query vectors for the first query and each respective second query, the distributional similarity score.
  - 12. The system of claim 11, wherein the distributional similarity score is based on a cosine similarity, a dot-product, a mutual information, a Jensen Shannon divergence, or a dice coefficient.
  - 13. The system of claim 10, wherein the operations further comprise:
    - determining a query correlation score between the first query and the second query based on a comparison of a temporal series of occurrences of elements of the first query in a query log and a temporal series of occurrences of elements of the second query in the query log,wherein computing a similarity score for the first query and a second query comprises providing, as input to the trained model, the query correlation score between the first query and the second query, the distributional similarity score between the first query and the second query, and the temporal correlation score between the first query and the second query.
  - 14. The system of claim 13, wherein the operations further comprise training the linear model based on annotated queries using a machine learning process.
  - 15. The system of claim 9, wherein ranking the plurality of second queries in an order according to their respective similarity scores comprises determining a Boolean classification of each second query with respect to the first query indicative of a semantic relevance of the second query with respect to the first query.
  - 16. The system of claim 9, wherein the temporal correlation score is based on one of a cosine similarity, a dot-product, a mutual information, a Jensen Shannon divergence, or a dice coefficient.

17. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving a first query and a plurality of second queries;
  
  determining a temporal correlation score between the first query and each second query based on a comparison of a temporal series of occurrences of elements of the first query in a first corpus comprising a first document of a first document type and a temporal series of occurrences of elements of the second query in a second different textual corpus comprising a second document of a second document type that differs from the first document type, wherein the comparison is based on the first document and the second document having timestamps in a same time period;
  
  computing a similarity score for the first query and a second query, the similarity score between the first query and a second query being computed based on the temporal correlation score between the first document and the second document; and
  
  ranking the second query according to the similarity score.
- View Dependent Claims (18, 19, 20)
- - 18. The computer program product of claim 17, wherein the operations further comprise:
    - determining a distributional similarity score between the first query and the second query, the determination of the distributional similarity score between the first query and a second query being based on a comparison of first frequencies of terms that co-occur in text with terms of the first query and second frequencies of terms that co-occur in text with terms of the second query,wherein computing a similarity score for the first query and a second query comprises providing, as input to a trained model, the distributional similarity score between the first query and each second query and the temporal correlation score between the first query and the second query.
  - 19. The computer program product of claim 18, wherein the operations further comprise:
    - determining a query correlation score between the first query and the second query based on a comparison of a temporal series of occurrences of elements of the first query in a query log and a temporal series of occurrences of elements of the second query in the query log,wherein computing a similarity score for the first query and a second query comprises providing, as input to the trained model, the query correlation score between the first query and the second query, the distributional similarity score between the first query and the second query, and the temporal correlation score between the first query and the second query.
  - 20. The computer program product of claim 19, wherein the operations further comprise training the linear model based on annotated queries using a machine learning process.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Alfonseca, Enrique, Ciaramita, Massimiliano, Hall, Keith B.
Primary Examiner(s)
Gaffin, Jeffrey A
Assistant Examiner(s)
Chubb, Mikayla

Application Number

US13/909,715
Time in Patent Office

455 Days
Field of Search

None
US Class Current

706/12
CPC Class Codes

G06F 16/3322   using system suggestions G0...

G06F 16/951   Indexing; Web crawling tech...

G06N 5/022   Knowledge engineering; Know...

Multiple correlation measures for measuring query similarity

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

14 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multiple correlation measures for measuring query similarity

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links