Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections

US 5,983,216 A
Filed: 09/12/1997
Issued: 11/09/1999
Est. Priority Date: 09/12/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method of performing automated document collection selection and document selection relative to a plurality of independently maintained document collections, each including a plurality of documents, using a list of qualified terms developed from an input query text, said method comprising the steps of:

providing a meta-index having meta-index values identifying corresponding ones of the document collections and information about documents in the corresponding ones of the document collections;

parsing said input query text to select single-word terms and multiple-word phrase terms from said query text by exclusion of predetermined context-free single-word terms and punctuation;

applying each such selected term against the meta-index values in said meta-index to determine correlation between the selected terms and the meta-index values;

determining cumulative rankings for said document collections based upon said correlation relative to each such selected term normalized against said plurality of document collections; and

selecting a subset of said document collections having the highest relative cumulative rankings whereby said subset of said document collections is established to be the most appropriate subset of said plurality of document collections to search using said input query text,searching each of said subset of document collections with said input query text to select documents correlating to said query text.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of performing automated collection selection relative to a plurality of document collections, each including one or more documents, using a list of qualified terms developed from an input query text. The method comprises the steps of: (a) parsing the input query text to select single-word terms and multiple-word phrase terms from the query text by exclusion of predetermined context-free single-word terms and punctuation; (b) applying each such selected term against a meta-index descriptive of the document collections; (c) determining cumulative rankings for the document collections relative to each such selected term normalized against the plurality of document collections; and (d) selecting a set of the document collections having the highest relative cumulative rankings.

448 Citations

18 Claims

1. A method of performing automated document collection selection and document selection relative to a plurality of independently maintained document collections, each including a plurality of documents, using a list of qualified terms developed from an input query text, said method comprising the steps of:
- providing a meta-index having meta-index values identifying corresponding ones of the document collections and information about documents in the corresponding ones of the document collections;
  
  parsing said input query text to select single-word terms and multiple-word phrase terms from said query text by exclusion of predetermined context-free single-word terms and punctuation;
  
  applying each such selected term against the meta-index values in said meta-index to determine correlation between the selected terms and the meta-index values;
  
  determining cumulative rankings for said document collections based upon said correlation relative to each such selected term normalized against said plurality of document collections; and
  
  selecting a subset of said document collections having the highest relative cumulative rankings whereby said subset of said document collections is established to be the most appropriate subset of said plurality of document collections to search using said input query text,searching each of said subset of document collections with said input query text to select documents correlating to said query text.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein said step of parsing is performed to select single-word terms and multiple-word phrase terms from said input query text by exclusion of predetermined context-free single word terms and punctuation where said selected single-word terms and multiple-word phrase terms are identified as any of:
    - a) a noun phrase;
      
      b) a word tuple;
      
      c) a capitalized word phrase; and
      
      d) a proper name;
      
      where each of said single-word terms and multiple-word phrase terms is delimited by any of a capitalized word, a space, a period, and a predetermined phrase word-length.
  - 3. The method of claim 2 wherein said selected single-word terms and multiple-word phrase terms are further identified as any of:
    - a) a hyphenated phrase expanded to include, as separate terms, the single-word terms of the hyphenated phrase, a single-word term that is the combined form of the single-word terms of the hyphenated phrase, and sequential pairs of the single-word terms of the hyphenated phrase; and
      
      b) a proper name, having more than two single-word terms, expanded to include, as a separate terms, the single-word terms of said proper name;
      
      where each of said single-word terms and multiple-word phrase terms is delimited by any of a capitalized word, a space, a period, and a predetermined phrase word-length.
  - 4. The method of claim 1 wherein said meta-index is provided with statistical data and fielded data descriptive of said document collections, said statistical data including frequency-of-occurrence statistics on multiple-word phrase terms to provide single-word term proximity information, and said fielded data including information respectively characterizing each of said document collections.
  - 5. the method of claim 4 wherein said frequency-of-occurrence statistical data describes the number of occurrences of unstemmed single-word and multiple-word phrase terms within a corresponding one of said document collections, and wherein said step of applying includes execution of a stemming algorithm.
  - 6. The method of claim 1 wherein each of said document collections are represented as records in said meta-index, wherein a predetermined record stores fielded data descriptive of said document collection represented by said predetermined record, and a predetermined term with corresponding statistical data, said predetermined term being qualified for entry into said predetermined record where:
    - a) the number of occurrences of said predetermined term within a predetermined document collection is in excess of a first predetermined number;
      
      b) the number of occurrences of said predetermined term within at least one document within said predetermined document collection is in excess of a second predetermined number;
      
      orc) said predetermined term occurs within a number of documents within said predetermined document collection in excess of a third predetermined number.
  - 7. The method of claim 6 wherein at least one of said first, second, and third predetermined numbers is in excess of one.

8. A method of performing automated collection selection relative to a plurality of document collections, each including one or more documents, using a list of qualified terms developed from an input query text, said method comprising the steps of:
- a) parsing said input query text to select single-word terms and multiple-word phrase terms from said query text by exclusion of predetermined context-free single-word terms and punctuation;
  
  b) applying each such selected term against a meta-index wherein said document collections are represented as respective collection records in said meta-index, wherein a predetermined collection record stores fielded data descriptive of said document collection represented by said predetermined collection record, and a predetermined term with corresponding statistical data, said predetermined term being qualified for entry into said predetermined collection record where;
  
  i) the number of occurrences of said predetermined term within a perdetermined document collection is in excess of a first predetermined number;
  
  ii) the number of occurrences of said predetermined term within at least one document within said predetermined document collection is in excess of a second predetermined number;
  
  oriii) said predetermined term occurs within a number of documents within said predetermined document collection in excess of a third predetermined number, wherein at least one of said first, second, and third predetermined numbers is in excess of one, and wherein said statistical data includes the number of occurrences of said predetermined term within the documents of said predetermined document collection, the number of documents within said predetermined document collection containing said predetermined term, the number of qualified terms that occur in said predetermined document collection, and the number of documents within said predetermined document collection;
  
  c) determining cumulative rankings for said document collections relative to each such selected term normalized against said plurality of document collections; and
  
  d) selecting a set of said document collections having the highest relative cumulative rankings.

9. A method of performing automated collection selection relative to a plurality of document collections, each including one or more documents, using a list of qualified terms developed from an input query text, said method comprising the steps of:
- a) parsing said input query text to select single-word terms and multiple-word phrase terms from said query text by exclusion of predetermined context-free single-word terms and punctuation;
  
  b) applying each such selected term against a meta-index descriptive of said document collections;
  
  c) determining cumulative rankings for said document collections relative to each such selected term normalized against said plurality of document collections by, for each of said document collections and for each of said terms relative to a respective one of said document collections, performing the steps of;
  
  i) calculating an initial term ranking for a predetermined term relative to a predetermined document collection based on a ratio of the number of documents having a qualified number of occurrences of said predetermined term in said predetermined document collection and a qualified number of documents within said predetermined document collection;
  
  ii) scaling said initial term ranking;
  
  iii) calculating a normalizing factor based on the ratio of the total number of documents in said document collections and the total number of documents in said document collections having a qualified number of occurrences of said predetermined term;
  
  iv) scaling said normalizing factor;
  
  v) calculating a product of said scaled initial term ranking and said scaled normalizing factor to provide a term ranking for said predetermined term; and
  
  vi) summing said products corresponding to each of said terms relative to said predetermined document collection to provide said cumulative term ranking for said predetermined document collection; and
  
  d) selecting a set of said document collections having the highest relative cumulative rankings.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The method of claim 9 wherein said meta-index stores a plurality of records representing respective said document collections, each said record including a term list including term frequency of occurrence data and fielded data descriptive of said respective document collection.
  - 11. The method of claim 10 wherein said step of parsing includes the identification of pre-search qualifiers from said query text, said method further including, before the step of applying, a step of pre-qualifying said document collections to select a set of document collections use that meet the conditions of said pre-search qualifiers, said set of document collections being said document collections for subsequent steps.
  - 12. The method of claim 11 wherein said step of parsing includes the identification of post-search qualifiers from said query text, said step of selecting including qualifying said document collections as selected to include only those that meet the conditions of said post-search qualifiers.
  - 13. The method of claim 12 wherein said steps of scaling said normalizing factor provides for storing said scaled normalizing factor and wherein said step of calculating a product provides for storing said term rankings, said stored scaled normalizing factors and said stored term rankings being useable subsequently in said method, said method further including an optional step of permitting modification of said query text and resubmitting said modified query text as said query text in said step of parsing, said stored scaled normalizing factors and said stored term rankings being used as appropriate where common terms and document collections are present in said query text and in said query text as modified to reduce time delays due to redundant calculations.

14. A method of selecting a subset of a set of document collections to search based on an input query text from a query source in advance of selecting a plurality of documents from said subset to identify to said query source in response to said input query text, said method comprising the steps of:
- a) parsing said input query text to select predetermined single-word terms and multiple-word phrase terms from said query text by exclusion of predetermined context-free single-word terms and punctuation and determining each remaining word to be a single-word term and each set of two successive remaining words being a multiple-word phrase term;
  
  b) applying said predetermined single-word terms and multiple-word phrase terms against a meta-index including a plurality of collection records wherein each of said collection records is descriptive of a corresponding one of said document collections;
  
  c) determining cumulative rankings for each of said document collections relative to the set of said predetermined single-word terms and multiple-word phrase terms, wherein the ranking of each of said predetermined single-word terms and multiple-word phrase terms for each of said document collections is normalized against said plurality of document collections; and
  
  d) selecting said subset of said document collections based on the respective cumulative rankings of said document collections.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The method of claim 14 wherein each of said collection records includes statistical data for use in determining cumulative rankings and wherein the statistical data stored by each of said collection records includes:
    - a) the number of occurrences of each single-word term and multiple-word phrase term within the documents of a respective document collection;
      
      b) the number of documents within said respective document collection that contain each said single-word term and multiple-word phrase term;
      
      c) the number of said single-word terms and multiple-word phrase terms that occur in said predetermined document collection; and
      
      d) the number of documents within said predetermined document collection.
  - 16. The method of claim 15 wherein each of said document collections is separate and described by a respective collection record containing respective statistical data.
  - 17. The method of claim 14, 15, and 16 wherein said steps of parsing, applying, determining, and selecting are performed automatically in response to said input query text.
  - 18. The method of claim 17 further comprising a step of identifying a predetermined set of documents from said subset of documents with respect to said input query text automatically following said selecting of said subset of said document collections.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Infoseek Corporation (The Walt Disney Company)
Inventors
Chang, William I., Kirsch, Steven T.
Primary Examiner(s)
Lintz, Paul R.
Assistant Examiner(s)
Colbert, Ella

Application Number

US08/928,294
Time in Patent Office

788 Days
Field of Search

707/2, 707/3, 707/4, 707/5, 704/256
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

448 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

448 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links