Expanded inverted index

US 7,634,468 B2
Filed: 11/29/2006
Issued: 12/15/2009
Est. Priority Date: 05/06/2003
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method comprising:

parsing, by at least one of one or more processors, each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms;

identifying, by at least one of the one or more processors, the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms;

generating, by at least one of the one or more processors, a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising;

an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears,an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms,a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents,a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, andthe unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms;

creating, by at least one of the one or more processors, an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms;

parsing, by at least one of the one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms;

retrieving, by at least one of the one or more processors, two or more of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and

returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Indexing documents is accomplished by generating an inverted index for a collection of one or more documents. The inverted index includes an inverted list for an index term appearing in one or more of the documents in the collection, and one or more postings. A posting includes a document identifier identifying a document in the collection of documents, a position identifier identifying a position of the index term in the document, and proximity information specifying whether the index term is positioned in a predefined proximal relationship between the index term and another a second index term in the document.

58 Citations

View as Search Results

10 Claims

1. A computer implemented method comprising:
- parsing, by at least one of one or more processors, each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms;
  
  identifying, by at least one of the one or more processors, the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms;
  
  generating, by at least one of the one or more processors, a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising;
  
  an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears,an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms,a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents,a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, andthe unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms;
  
  creating, by at least one of the one or more processors, an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms;
  
  parsing, by at least one of the one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms;
  
  retrieving, by at least one of the one or more processors, two or more of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and
  
  returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method as in claim 1, wherein the indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and that the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms indicates that the corresponding one of the plurality of index terms occurs immediately after one of the commonly occurring terms.
  - 3. A method as in claim 1, wherein the indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and that the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms indicates that the corresponding one of the plurality of index terms occurs immediately before one of the commonly occurring terms.
  - 4. A method as in claim 1, wherein each posting further comprises information about whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and that the corresponding one of the plurality of index terms occurs both immediately after one of the commonly occurring terms and immediately before another one of the commonly occurring terms;
    - andwherein the parsing further comprises identifying an additional sequence of the search terms that includes one of the commonly occurring terms that is both immediately followed and immediately preceded by one of the infrequently occurring terms.
  - 5. A method as in claim 1, wherein the documents in the collections of documents comprise one or more of source code, binary files, tables of genetic code, text documents, structured documents, and unstructured documents.
  - 6. A method as in claim 1, further comprising compressing the expanded inverted index using an integer compression scheme.
  - 7. A method as in claim 1, wherein the vocabulary comprises all index terms that occur in the collection of documents.
  - 8. A method as in claim 1, wherein the reference location is a beginning of the document and wherein the location offset value indicates how many terms from the beginning of the document the corresponding term is.

9. A computer-implemented method comprising:
- parsing, by at least one of one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of a plurality of commonly occurring terms that is immediately adjacent to one of a plurality of infrequently occurring terms and also identifying a second of the search terms that is a second one of the plurality of infrequently occurring terms and that is not immediately adjacent to any of the commonly occurring terms;
  
  retrieving, by at least one of the one or more processors, two or more expanded inverted lists of a plurality of expanded inverted lists that comprise an expanded inverted index, the expanded inverted index comprising a vocabulary that comprises a plurality of index terms that occur in a collection of documents that are searched in response to the search query, each of the plurality of expanded inverted lists corresponding to one of the plurality of index terms, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms and infrequently occurring terms differing form each other, each expanded inverted list comprising a posting for each document in the collection of documents that includes the corresponding index term, each posting in each expanded inverted list comprising;
  
  an identification of the document of the collection of documents in which the corresponding index term appears,an indication of whether the corresponding index term in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding index term occurs immediately adjacent to one of the commonly occurring terms,a number of times the corresponding index term occurs in the one of the collection of documents,a location offset value specifying where in the document the corresponding term occurs relative to a reference location in the document, anda unique common term identifier assigned to the commonly occurring term if the corresponding index term occurs immediately adjacent to one of the commonly occurring terms,the retrieved two or more of the plurality of expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing;
  
  returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in the all of the two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.

10. An apparatus comprising:
- one or more processors that perform functions comprising;
  
  parsing each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms;
  
  identifying the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms;
  
  generating a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising;
  
  an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears,an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms,a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents,a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, andthe unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms;
  
  creating an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms;
  
  parsing a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms;
  
  retrieving of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and
  
  returning one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAP SE
Original Assignee
SAP AG (SAP SE)
Inventors
Stephan, Wolfgang
Primary Examiner(s)
Robinson; Greta L
Assistant Examiner(s)
Weinrich; Brian E

Application Number

US11/606,804
Publication Number

US 20070073668A1
Time in Patent Office

1,112 Days
Field of Search

707/3
US Class Current

1/1
CPC Class Codes

G06F 16/319   Inverted lists

Y10S 707/917   Text

Y10S 707/954   Relational

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

Expanded inverted index

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

58 Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Expanded inverted index

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

58 Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links