Utilizing information redundancy to improve text searches

US 7,051,014 B2
Filed: 06/18/2003
Issued: 05/23/2006
Est. Priority Date: 06/18/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A machine implemented system that facilitates data retrieval, comprising:

a query component that receives a query to a first dataset, anda projection component that executes the query across a second dataset, and analyzes properties of results of the query on the first dataset and results of the second dataset to generate a refined version of the query to run on the first dataset to facilitate responding to the query across the first dataset, the projection component analyzes the properties of the results by determining a similarity measure that is a cosine distance for each result.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Architecture for improving text searches using information redundancy. A search component is coupled with an analysis component to rerank documents returned in a search according to a redundancy values. Each returned document is used to develop a corresponding word probability distribution that is further used to rerank the returned documents according to the associated redundancy values. In another aspect thereof, the query component is coupled with a projection component to project answer redundancy from one document search to another. This includes obtaining the benefit of considerable answer redundancy from a second data source by projecting the success of the search of the second data source against a first data source.

Citations

61 Claims

1. A machine implemented system that facilitates data retrieval, comprising:
- a query component that receives a query to a first dataset, anda projection component that executes the query across a second dataset, and analyzes properties of results of the query on the first dataset and results of the second dataset to generate a refined version of the query to run on the first dataset to facilitate responding to the query across the first dataset, the projection component analyzes the properties of the results by determining a similarity measure that is a cosine distance for each result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The system of claim 1, the projection component executes the query across the second dataset in response to determining tat projection is required on the first dataset.
  - 3. The system of claim 1, the projection component automatically executes the query across the second dataset substantially simultaneously with execution of the query across the first dataset.
  - 4. The system of claim 1, the second dataset having higher redundancy than the first dataset.
  - 5. The system of claim 1, the projection component analyzes the properties of the results by generating word probability distributions for each result.
  - 6. The system of claim 1, the results of the first and second datasets returned in the form of documents.
  - 7. The system of claim 1, the projection component determines a word probability distribution for one of the results.
  - 8. The system of claim 1, the projection component evaluates the results of the second dataset for pairwise information redundancy with the results of the first dataset.
  - 9. The system of claim 1, the query component ranks the results of the first dataset for output.
  - 10. The system of claim 1, the query component reranks the results of the first dataset for output according to projection information received from the projection component.
  - 11. The system of claim 1, the projection component generates the refined version of the query in accordance wit properties of a good answer document of the second dataset.
  - 12. The system of claim 1, the projection component determines the average pairwise redundancy of the results of the second dataset with the results of the first dataset.
  - 13. The system of claim 1, the projection component automatically determines the number of results of the second dataset query to use to generate the refined version of the query.
  - 14. The system of claim 13, the number of results determined according to a classification scheme that classifies the results according to predetermined criteria.
  - 15. The system of claim 13, the number of results determined according to a classification scheme that selects the results according to a redundancy value.
  - 16. The system of claim 1, further comprising a classifier tat determines a number of results of the second dataset to be used for generating the refined version of the query.
  - 17. The system of claim 16, the classifier is a support vector machine.
  - 18. A computer readable medium having stored thereon the components of claim 1.
  - 19. The system of claim 1, the properties of the results related to at least one of textual content, image content, and audio content.

20. A machine implemented system that facilitates data retrieval, comprising:
- a query component that receives a query to a first dataset; and
  
  a projection component that executes the query across a second dataset, and generates a result set that is employed in connection with the first dataset to facilitate responding to the query, the projection component analyzes the properties of the result set by determining a similarity measure that is a cosine distance for each result.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 21. The system of claim 20, the projection component executes the query across the second dataset in response to determining that projection is required on the first dataset.
  - 22. The system of claim 20, the projection component automatically executes the query across the second dataset substantially simultaneously with execution of the query across the first dataset.
  - 23. The system of claim 20, the second dataset having higher redundancy than the first dataset.
  - 24. The system of claim 20, the projection component generates a word probability distribution for at least one result of the result set.
  - 25. The system of claim 20, the result set in the form of documents.
  - 26. The system of claim 20, the result set generated from the query across the second dataset.
  - 27. The system of claim 20, the projection component evaluates the result set for pairwise information redundancy with query results of the first dataset.
  - 28. The system of claim 20, the query component ranks results of the first dataset for output.
  - 29. The system of claim 20, the query component re-ranks results of the first dataset for output according to projection information received from the projection component.
  - 30. The system of claim 20, the projection component generates a refined version of the query in accordance with properties of a good answer document of the second dataset.
  - 31. The system of claim 20, the projection component determines the average pairwise redundancy of the result set of the second dataset with a result set of the first dataset.
  - 32. The system of claim 20, the projection component automatically determines the number of results to use to generate the result set of the query.
  - 33. The system of claim 32, the number of results determined according to a classification scheme that classifies the results according to predetermined criteria.
  - 34. The system of claim 33, the number of results determined according to a classification scheme that selects the results according to a redundancy value.
  - 35. The system of claim 20, further comprising a support vector machine that determines a number of results of the second dataset to be used for generating the refined version of the query.
  - 36. A computer according to the system of claim 20.

37. A machine implemented system that facilitates data retrieval, comprising:
- a search component that executes a query and returns a dataset; and
  
  an analysis component that determines relevance of a subset of the returned dataset as a function of similarity properties thereof with respect to the entire returned dataset, the similarity properties determined according to a similarity measure that is a cosine distance measure.
- View Dependent Claims (38, 39, 40, 41, 42, 43)
- - 38. The system of claim 37, the analysis component generates a word probability distribution for at least one result of the returned dataset.
  - 39. The system of claim 37, the subset in the form of documents.
  - 40. The system of claim 37, the analysis component evaluates the returned dataset for average pairwise information redundancy.
  - 41. The system of claim 37, the search component reranks results of the query according to relevance of the subset as determined by an information redundancy value.
  - 42. The system of claim 41, the information redundancy value determined by the analysis component as the average pairwise information redundancy between one result and the remaining results.
  - 43. A computer according to the system of claim 37.

44. A machine implemented method of facilitating data retrieval, comprising:
- receiving a query for processing by a search engine against a first dataset;
  
  executing the query against the first dataset and a second dataset;
  
  analyzing properties of results of the first dataset query and of results of the second dataset query to determine a refined version of the query by determining a similarity measure that is a cosine distance for each result;
  
  transmitting the refined version of the query to the search engine; and
  
  reranking the results of the first dataset query according to the refined version of the query.
- View Dependent Claims (45, 46, 47, 48, 49, 50, 51, 52, 53)
- - 45. The method of claim 44, the second dataset remote from the first dataset.
  - 46. The method of claim 44, hither comprising reapplying the refined version of the query against the first dataset in order to obtain the re-ranked results.
  - 47. The method of claim 44, the first dataset having lower data redundancy than the second dataset.
  - 48. The method of claim 44, the query executed against the first and second dataset substantially simultaneously.
  - 49. The method of claim 44, the query executed against the second dataset only in response to the execution of the query against the first dataset returning a minimum number of results.
  - 50. The method of claim 44, the results in the form of documents, the properties of which relate to at least one of textual content, image content, audio content and hyperlink content.
  - 51. The method of claim 44, the documents are web pages returned from a website.
  - 52. The method of claim 44, the properties analyzed by a projection component that is remotely disposed from the search engine, and in operative communication with the search engine.
  - 53. The method of claim 44, the properties analyzed by:
    - generating word probability distributions for each of the results; and
      
      determining an avenge pairwise information redundancy value of the results of the second dataset with the results of the first dataset using a similarity measure.

54. A machine implemented method of facilitating data retrieval, comprising:
- receiving a query for processing by a search engine against a first dataset;
  
  executing the query against the first dataset and a second dataset, the query against the second dataset used to characterize likely properties of a good answer to the query;
  
  generating a result set from the query of the second dataset by determining the average pairwise information redundancy between results of the first dataset query and the second dataset query, the average pairwise information redundancy is based at least upon a cosine distance measurement for each result;
  
  applying the result set in a subsequent query of the first dataset; and
  
  providing a ranked output according to the result set query.
- View Dependent Claims (55, 56, 57)
- - 55. The method of claim 54, the result set generated by:
    - computing a word probability distribution for each result of the first dataset query and the second dataset query; and
      
      comparing the word probability distributions to determine a level of redundancy sufficient to improve the query of the first dataset.
  - 56. The method of claim 54, further comprising classifying with a classification component the results of the second dataset query according to predetermined classification criteria.
  - 57. The method of claim 56, further comprising training the classification component according to at least one of the number of results returned, the type and/or importance of query terms used in the query, time of the query, properties of the results included in the result set, and properties of results not included in the result set.

58. A machine implemented method of facilitating data retrieval, comprising:
- processing a query against a plurality of documents;
  
  measuring information redundancy of a returned document of a return set by determining an average pairwise information redundancy value between the returned document and the remaining documents of the return set, the average pairwise information redundancy is based at least upon a cosine distance measurement for each document; and
  
  providing a ranked output of documents according to corresponding pairwise information redundancy values.
- View Dependent Claims (59)
- - 59. The method of claim 58, further comprising selecting the documents associated with the higher average pairwise information redundancy values for the ranked output.

60. A machine implemented system that facilitates data retrieval, comprising:
- means for processing a query against a plurality of documents;
  
  means for measuring information redundancy of a returned document of a return set by determining an average pairwise information redundancy value between the returned document and the remaining documents of the return set, the average pairwise information redundancy is based at least upon a cosine distance measurement for each document; and
  
  means for providing a ranked output of documents according to corresponding pairwise information redundancy values.

61. A machine implemented system that facilitates data retrieval, comprising:
- means for receiving a query for processing by a search engine against a first dataset;
  
  means for executing the query against the first dataset and a second dataset, the query against the second dataset used to characterize likely properties of a good answer to the query;
  
  means for generating a result set from the query of the second dataset by determining the average pairwise information redundancy between results of the first dataset query and the second dataset query, the average pairwise information redundancy is based at least upon a cosine distance measurement for each result;
  
  means for applying the result set in a subsequent query of the first dataset; and
  
  means for providing a ranked output according to the result set query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Dumais, Susan T., Brill, Eric D.
Primary Examiner(s)
Waussum, Luke S.
Assistant Examiner(s)
LEWIS, CHERYL RENEA

Application Number

US10/464,081
Publication Number

US 20040260692A1
Time in Patent Office

1,070 Days
Field of Search

707 2- 6, 707/100, 707/104.1, 707/203, 715/500, 715/511, 715/516
US Class Current

1/1
CPC Class Codes

G06F 16/3347   using vector based model

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Utilizing information redundancy to improve text searches

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

61 Claims

Specification

Solutions

Use Cases

Quick Links

Utilizing information redundancy to improve text searches

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

61 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links