Expanded inverted index
First Claim
Patent Images
1. A computer implemented method comprising:
- parsing, by at least one of one or more processors, each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms;
identifying, by at least one of the one or more processors, the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms;
generating, by at least one of the one or more processors, a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising;
an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears,an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms,a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents,a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, andthe unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms;
creating, by at least one of the one or more processors, an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms;
parsing, by at least one of the one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms;
retrieving, by at least one of the one or more processors, two or more of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and
returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.
2 Assignments
0 Petitions
Accused Products
Abstract
Indexing documents is accomplished by generating an inverted index for a collection of one or more documents. The inverted index includes an inverted list for an index term appearing in one or more of the documents in the collection, and one or more postings. A posting includes a document identifier identifying a document in the collection of documents, a position identifier identifying a position of the index term in the document, and proximity information specifying whether the index term is positioned in a predefined proximal relationship between the index term and another a second index term in the document.
58 Citations
10 Claims
-
1. A computer implemented method comprising:
-
parsing, by at least one of one or more processors, each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms; identifying, by at least one of the one or more processors, the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms; generating, by at least one of the one or more processors, a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising; an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears, an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms, a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents, a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, and the unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms; creating, by at least one of the one or more processors, an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms; parsing, by at least one of the one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms; retrieving, by at least one of the one or more processors, two or more of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented method comprising:
-
parsing, by at least one of one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of a plurality of commonly occurring terms that is immediately adjacent to one of a plurality of infrequently occurring terms and also identifying a second of the search terms that is a second one of the plurality of infrequently occurring terms and that is not immediately adjacent to any of the commonly occurring terms; retrieving, by at least one of the one or more processors, two or more expanded inverted lists of a plurality of expanded inverted lists that comprise an expanded inverted index, the expanded inverted index comprising a vocabulary that comprises a plurality of index terms that occur in a collection of documents that are searched in response to the search query, each of the plurality of expanded inverted lists corresponding to one of the plurality of index terms, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms and infrequently occurring terms differing form each other, each expanded inverted list comprising a posting for each document in the collection of documents that includes the corresponding index term, each posting in each expanded inverted list comprising; an identification of the document of the collection of documents in which the corresponding index term appears, an indication of whether the corresponding index term in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding index term occurs immediately adjacent to one of the commonly occurring terms, a number of times the corresponding index term occurs in the one of the collection of documents, a location offset value specifying where in the document the corresponding term occurs relative to a reference location in the document, and a unique common term identifier assigned to the commonly occurring term if the corresponding index term occurs immediately adjacent to one of the commonly occurring terms, the retrieved two or more of the plurality of expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in the all of the two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.
-
-
10. An apparatus comprising:
-
one or more processors that perform functions comprising; parsing each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms; identifying the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms; generating a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising; an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears, an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms, a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents, a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, and the unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms; creating an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms; parsing a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms; retrieving of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and returning one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.
-
Specification