Systems, methods, and computer program products for fast and scalable proximal search for search queries

US 8,745,062 B2
Filed: 08/16/2012
Issued: 06/03/2014
Est. Priority Date: 05/24/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A method of information retrieval from multiple documents, comprising:

splitting each document into multiple snippets of words;

generating a separate index for each snippet;

receiving an input search query including at least one sentence; and

processing the search query against each separate index of each snippet of the multiple snippets by searching query terms over each of the multiple snippets to implicitly introduce term proximity information in the information retrieval, wherein processing the search query further comprises;

creating an OR-Query of all non-stopwords in each sentence;

returning a fit value for each OR-Query, wherein a fit value represents a similarity metric that measures the amount of word content overlap between two text units; and

aggregating the fit values to provide a score for every document returned by the OR-Queries.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the invention provide a method and computer program products for information retrieval from multiple documents by proximity searching for search queries. A method includes generating an index for the multiple documents, wherein the index includes words in snippets in the documents. An input search query is processed against the index by searching query terms over the snippets to introduce term proximity information implicitly in the information retrieval. Results of multiple sentence level search operations are combined as output.

Citations

11 Claims

1. A method of information retrieval from multiple documents, comprising:
- splitting each document into multiple snippets of words;
  
  generating a separate index for each snippet;
  
  receiving an input search query including at least one sentence; and
  
  processing the search query against each separate index of each snippet of the multiple snippets by searching query terms over each of the multiple snippets to implicitly introduce term proximity information in the information retrieval, wherein processing the search query further comprises;
  
  creating an OR-Query of all non-stopwords in each sentence;
  
  returning a fit value for each OR-Query, wherein a fit value represents a similarity metric that measures the amount of word content overlap between two text units; and
  
  aggregating the fit values to provide a score for every document returned by the OR-Queries.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the search query includes natural text input of multiple sentences, the multiple snippets are split into separate files.
  - 3. The method of claim 1, wherein processing the search query comprises:
    - processing the search query against the indexes of each of the multiple snippets, sentence by sentence,wherein creating an OR-Query comprises using all words in each sentence to create the OR-Query of all non-stopwords in each sentence.
  - 4. The method of claim 3,wherein for each document, rank scores of all snippets are summed, and a resulting score is assigned to each document, wherein higher scores are assigned for a given input query based on multiple matching snippets.
  - 5. The method of claim 4, wherein the similarity is based on measuring similarity between a snippet in the documents and the original input text of the query.
  - 6. The method of claim 1, comprising:
    - determining the frequency of each non-stopword in the documents; and
      
      removing from the search those words having a frequency that exceeds a threshold.
  - 7. The method of claim 1, comprising:
    - decomposing the search query into sub-queries;
      
      processing each sub-query against the indexes of each of the multiple snippets for searching sub-query terms over the multiple snippets, thereby implicitly introducing term proximity information in the information retrieval.
  - 8. The method of claim 7, comprising:
    - processing each sub-query against the indexes of each of the multiple snippets, sentence by sentence, using all words in each sentence of the sub-query to create an OR-Query of all non- stopwords in the sentence;
      
      returning a fit value for each OR-Query; and
      
      aggregating the fit values to provide a score for every document returned by the OR- Queries.
  - 9. The method of claim 1, wherein a snippet comprises multiple consecutive sentences of an original document.
  - 10. The method of claim 1, wherein a paragraph long query is split into constituent sentences, and each constituent sentence is used as a separate query, wherein results of all constituent sentence queries are combined for constructing a final search result set for the paragraph long query.
  - 11. The method of claim 1, wherein no additional computations are required to compute distances between terms in a document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bhatia, Sumit, He, Bin, He, Qi, Spangler, William S.
Primary Examiner(s)
Corrielus, Jean M

Application Number

US13/587,413
Publication Number

US 20130318091A1
Time in Patent Office

656 Days
Field of Search

707/710, 707/737, 707/741, 707/729, 707/728, 707/706, 707/751
US Class Current

707/741
CPC Class Codes

G06F 16/31 Indexing; Data structures t...

G06F 16/33 Querying

Systems, methods, and computer program products for fast and scalable proximal search for search queries

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Systems, methods, and computer program products for fast and scalable proximal search for search queries

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links