Utilizing information redundancy to improve text searches

US 20040260692A1
Filed: 06/18/2003
Published: 12/23/2004
Est. Priority Date: 06/18/2003
Status: Active Grant

First Claim

Patent Images

1. A system that facilitates data retrieval, comprising:

a query component that receives a query to a first dataset; and

a projection component that executes the query across a second dataset, and analyzes properties of results of the query on the first dataset and results of the second dataset to generate a refined version of the query to run on the first dataset to facilitate responding to the query across the first dataset.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Architecture for improving text searches using information redundancy. A search component is coupled with an analysis component to rerank documents returned in a search according to a redundancy values. Each returned document is used to develop a corresponding word probability distribution that is further used to rerank the returned documents according to the associated redundancy values. In another aspect thereof, the query component is coupled with a projection component to project answer redundancy from one document search to another. This includes obtaining the benefit of considerable answer redundancy from a second data source by projecting the success of the search of the second data source against a first data source.

Citations

67 Claims

1. A system that facilitates data retrieval, comprising:
- a query component that receives a query to a first dataset; and
  
  a projection component that executes the query across a second dataset, and analyzes properties of results of the query on the first dataset and results of the second dataset to generate a refined version of the query to run on the first dataset to facilitate responding to the query across the first dataset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The system of claim 1, the projection component executes the query across the second dataset in response to determining that projection is required on the first dataset.
  - 3. The system of claim 1, the projection component automatically executes the query across the second dataset substantially simultaneously with execution of the query across the first dataset.
  - 4. The system of claim 1, the second dataset having higher redundancy than the first dataset.
  - 5. The system of claim 1, the projection component analyzes the properties of the results by generating word probability distributions for each result.
  - 6. The system of claim 1, the projection component analyzes the properties of the results by determining a similarity measure for each result.
  - 7. The system of claim 6, the similarity property is a cosine distance.
  - 8. The system of claim 1, the results of the first and second datasets returned in the form of documents.
  - 9. The system of claim 1, the projection component determines a word probability distribution for one of the results.
  - 10. The system of claim 1, the projection component evaluates the results of the second dataset for pairwise information redundancy with the results of the first dataset.
  - 11. The system of claim 1, the query component ranks the results of the first dataset for output.
  - 12. The system of claim 1, the query component reranks the results of the first dataset for output according to projection information received from the projection component.
  - 13. The system of claim 1, the projection component generates the refined version of the query in accordance with properties of a good answer document of the second dataset.
  - 14. The system of claim 1, the projection component determines the average pairwise redundancy of the results of the second dataset with the results of the first dataset.
  - 15. The system of claim 1, the projection component automatically determines the number of results of the second dataset query to use to generate the refined version of the query.
  - 16. The system of claim 15, the number of results determined according to a classification scheme that classifies the results according to predetermined criteria.
  - 17. The system of claim 15, the number of results determined according to a classification scheme that selects the results according to a redundancy value.
  - 18. The system of claim 1, further comprising a classifier that determines a number of results of the second dataset to be used for generating the refined version of the query.
  - 19. The system of claim 18, the classifier is a support vector machine.
  - 20. A computer readable medium having stored thereon the components of claim 1.
  - 21. The system of claim 1, the properties of the results related to at least one of textual content, image content, and audio content.

22. A system that facilitates data retrieval, comprising:
- a query component that receives a query to a first dataset; and
  
  a projection component that executes the query across a second dataset, and generates a result set that is employed in connection with the first dataset to facilitate responding to the query.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 23. The system of claim 22, the projection component executes the query across the second dataset in response to determining that projection is required on the first dataset.
  - 24. The system of claim 22, the projection component automatically executes the query across the second dataset substantially simultaneously with execution of the query across the first dataset.
  - 25. The system of claim 22, the second dataset having higher redundancy than the first dataset.
  - 26. The system of claim 22, the projection component generates a word probability distribution for at least one result of the result set.
  - 27. The system of claim 22, the projection component determines a similarity measure for at least one result of the result set.
  - 28. The system of claim 27, the similarity property is the cosine distance.
  - 29. The system of claim 22, the result set in the form of documents.
  - 30. The system of claim 22, the result set generated from the query across the second dataset.
  - 31. The system of claim 22, the projection component evaluates the result set for pairwise information redundancy with query results of the first dataset.
  - 32. The system of claim 22, the query component ranks results of the first dataset for output.
  - 33. The system of claim 22, the query component reranks results of the first dataset for output according to projection information received from the projection component.
  - 34. The system of claim 22, the projection component generates a refined version of the query in accordance with properties of a good answer document of the second dataset.
  - 35. The system of claim 22, the projection component determines the average pairwise redundancy of the result set of the second dataset with a result set of the first dataset.
  - 36. The system of claim 22, the projection component automatically determines the number of results to use to generate the result set of the query.
  - 37. The system of claim 36, the number of results determined according to a classification scheme that classifies the results according to predetermined criteria.
  - 38. The system of claim 37, the number of results determined according to a classification scheme that selects the results according to a redundancy value.
  - 39. The system of claim 22, further comprising a support vector machine that determines a number of results of the second dataset to be used for generating the refined version of the query.
  - 40. A computer according to the system of claim 22.

41. A system that facilitates data retrieval, comprising:
- a search component that executes a query and returns a dataset; and
  
  an analysis component that determines relevance of a subset of the returned dataset as a function of similarity properties thereof with respect to the entire returned dataset.
- View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49)
- - 42. The system of claim 41, the analysis component generates a word probability distribution for at least one result of the returned dataset.
  - 43. The system of claim 41, the similarity properties determined according to a similarity measure.
  - 44. The system of claim 43, the similarity measure is a cosine distance measure.
  - 45. The system of claim 41, the subset in the form of documents.
  - 46. The system of claim 41, the analysis component evaluates the returned dataset for average pairwise information redundancy.
  - 47. The system of claim 41, the search component reranks results of the query according to relevance of the subset as determined by an information redundancy value.
  - 48. The system of claim 47, the information redundancy value determined by the analysis component as the average pairwise information redundancy between one result and the remaining results.
  - 49. A computer according to the system of claim 41.

50. A method of facilitating data retrieval, comprising:
- receiving a query for processing by a search engine against a first dataset;
  
  executing the query against the first dataset and a second dataset;
  
  analyzing properties of results of the second dataset query to determine a refined version of the query;
  
  transmitting the refined version of the query to the search engine; and
  
  reranking the results of the first dataset query according to the refined version of the query.
- View Dependent Claims (51, 52, 53, 54, 55, 56, 57, 58, 59)
- - 51. The method of claim 50, the second dataset remote from the first dataset.
  - 52. The method of claim 50, further comprising reapplying the refined version of the query against the first dataset in order to obtain the reranked results.
  - 53. The method of claim 50, the first dataset having lower data redundancy than the second dataset.
  - 54. The method of claim 50, the query executed against the first and second dataset substantially simultaneously.
  - 55. The method of claim 50, the query executed against the second dataset only in response to the execution of the query against the first dataset returning a minimum number of results.
  - 56. The method of claim 50, the results in the form of documents, the properties of which relate to at least one of textual content, image content, audio content, and hyperlink content.
  - 57. The method of claim 56, the documents are web pages returned from a website.
  - 58. The method of claim 50, the properties analyzed by a projection component that is remotely disposed from the search engine, and in operative communication with the search engine.
  - 59. The method of claim 50, the properties analyzed by:
    - generating word probability distributions for each of the results; and
      
      determining an average pairwise information redundancy value of the results of the second dataset with the results of the first dataset using a similarity measure.

60. A method of facilitating data retrieval, comprising:
- receiving a query for processing by a search engine against a first dataset;
  
  executing the query against the first dataset and a second dataset, the query against the second dataset used to characterize likely properties of a good answer to the query;
  
  generating a result set from the query of the second dataset by determining the average pairwise information redundancy between results of the first dataset query and the second dataset query;
  
  applying the result set in a subsequent query of the first dataset; and
  
  providing a ranked output according to the result set query.
- View Dependent Claims (61, 62, 63)
- - 61. The method of claim 60, the result set generated by:
    - computing a word probability distribution for each result of the first dataset query and the second dataset query; and
      
      comparing the word probability distributions to determine a level of redundancy sufficient to improve the query of the first dataset.
  - 62. The method of claim 60, further comprising classifying with a classification component the results of the second dataset query according to predetermined classification criteria.
  - 63. The method of claim 62, further comprising training the classification component according to at least one of the number of results returned, the type and/or importance of query terms used in the query, time of the query, properties of the results included in the result set, and properties of results not included in the result set.

64. A method of facilitating data retrieval, comprising:
- processing a query against a plurality of documents;
  
  measuring information redundancy of a returned document of a return set by determining an average pairwise information redundancy value between the returned document and the remaining documents of the return set; and
  
  providing a ranked output of documents according to corresponding pairwise information redundancy values.
- View Dependent Claims (65)
- - 65. The method of claim 64, further comprising selecting the documents associated with the higher average pairwise information redundancy values for the ranked output.

66. A system that facilitates data retrieval, comprising:
- means for processing a query against a plurality of documents;
  
  means for measuring information redundancy of a returned document of a return set by determining an average pairwise information redundancy value between the returned document and the remaining documents of the return set; and
  
  means for providing a ranked output of documents according to corresponding pairwise information redundancy values.

67. A system that facilitates data retrieval, comprising:
- means for receiving a query for processing by a search engine against a first dataset;
  
  means for executing the query against the first dataset and a second dataset, the query against the second dataset used to characterize likely properties of a good answer to the query;
  
  means for generating a result set from the query of the second dataset by determining the average pairwise information redundancy between results of the first dataset query and the second dataset query;
  
  means for applying the result set in a subsequent query of the first dataset; and
  
  means for providing a ranked output according to the result set query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Dumais, Susan T., Brill, Eric D.

Granted Patent

US 7,051,014 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/3347   using vector based model

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Utilizing information redundancy to improve text searches

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

67 Claims

Specification

Solutions

Use Cases

Quick Links

Utilizing information redundancy to improve text searches

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

67 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links