Method and apparatus for representation of unstructured data
First Claim
1. A system for representing and searching a document including unstructured data, the system comprising:
- a data store storing a plurality of documents;
a processor executing program instructions, the program instructions including generating a binary representation of the unstructured data in the plurality of documents and searching the binary representation in response to a search request, the processor generating an output based on the search; and
a memory storing the binary representation of the unstructured data in a plurality of data structures, the data structures including;
a first binary bit vector identifying a plurality of unstructured data included in the plurality of documents;
a plurality of second binary bit vectors, wherein for each of the plurality of unstructured data identified in the first binary bit vector, a corresponding second binary bit vector sets one or more bits for one or more position identifiers assigned to one or more instances of the associated unstructured data appearing in one or more of the plurality of documents, wherein the instance of an unstructured data appearing at the end of a first one of the plurality of documents is assigned a position identifier of n, and the instance of an unstructured data appearing at the beginning of a second one of the plurality of documents is assigned a position identifier of n+1, wherein n is an integer greater than 0; and
a positional ID vector indicating a start position identifier of each word appearing at the beginning of each of the plurality of documents, wherein the program instructions for searching the binary representation include;
determining if a particular search term provided with the search request is identified in the first binary bit vector;
if the particular search term is identified in the first binary bit vector, retrieving the corresponding second binary bit vector;
identifying from the positional ID vector the start position identifier of the word at the beginning of a particular one of the plurality of documents to be searched;
deducing from the positional ID vector an end position identifier of a word at the end of the particular one of the plurality of documents to be searched; and
identifying one or more bits set for one or more of the position identifiers in the retrieved secondary binary bit vector between the start position identifier and the end position identifier for identifying all instances of the search term occurring in the particular document.
1 Assignment
0 Petitions
Accused Products
Abstract
Method and apparatus providing a binary representation of a document storing unstructured data. A unique word identifier is obtained for each word included in the document. A word select vector includes positions identified by different word identifiers. A 1-bit value is stored at positions identified by the word identifiers of the words included in the document. A unique position identifier is further assigned to each word appearing in the document. A word use set includes vectors for each unique word identifier for which a 1-bit is stored in the word select vector. Each vector in the word use set indicates the position identifiers of the instances of a particular word included in the document. Once the binary representation is generated, it may be efficiently searched to determine whether particular words appear in the document.
71 Citations
22 Claims
-
1. A system for representing and searching a document including unstructured data, the system comprising:
-
a data store storing a plurality of documents; a processor executing program instructions, the program instructions including generating a binary representation of the unstructured data in the plurality of documents and searching the binary representation in response to a search request, the processor generating an output based on the search; and a memory storing the binary representation of the unstructured data in a plurality of data structures, the data structures including; a first binary bit vector identifying a plurality of unstructured data included in the plurality of documents; a plurality of second binary bit vectors, wherein for each of the plurality of unstructured data identified in the first binary bit vector, a corresponding second binary bit vector sets one or more bits for one or more position identifiers assigned to one or more instances of the associated unstructured data appearing in one or more of the plurality of documents, wherein the instance of an unstructured data appearing at the end of a first one of the plurality of documents is assigned a position identifier of n, and the instance of an unstructured data appearing at the beginning of a second one of the plurality of documents is assigned a position identifier of n+1, wherein n is an integer greater than 0; and a positional ID vector indicating a start position identifier of each word appearing at the beginning of each of the plurality of documents, wherein the program instructions for searching the binary representation include; determining if a particular search term provided with the search request is identified in the first binary bit vector; if the particular search term is identified in the first binary bit vector, retrieving the corresponding second binary bit vector; identifying from the positional ID vector the start position identifier of the word at the beginning of a particular one of the plurality of documents to be searched; deducing from the positional ID vector an end position identifier of a word at the end of the particular one of the plurality of documents to be searched; and identifying one or more bits set for one or more of the position identifiers in the retrieved secondary binary bit vector between the start position identifier and the end position identifier for identifying all instances of the search term occurring in the particular document. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method for representing and searching a document including unstructured data, the method comprising:
-
generating, under control of the computer, a binary representation of the unstructured data in a plurality of documents; storing the binary representation of the unstructured data in a plurality of data structures, the data structures including; a first binary bit vector identifying a plurality of unstructured data stored in the plurality of documents; a plurality of second binary bit vectors, wherein for each of the plurality of unstructured data identified in the first binary bit vector, a corresponding second binary bit vector sets one or more bits for one or more position identifiers assigned to one or more instances of the associated unstructured data appearing in one or more of the plurality of documents, wherein the instance of an unstructured data appearing at the end of a first one of the plurality of documents is assigned a position identifier of n, and the instance of an unstructured data appearing at the beginning of a second one of the plurality of documents is assigned a position identifier of n+1, wherein n is an integer greater than 0; and a positional ID vector indicating a start position identifier of each word appearing at the beginning of each of the plurality of documents; receiving a search request including a search term; determining if a particular search term provided with the search request is identified in the first binary bit vector; if the particular search term is identified in the first binary bit vector, retrieving the corresponding second binary bit vector; identifying from the positional ID vector the start position identifier of the word at the beginning of a particular one of the plurality of documents to be searched; deducing from the positional ID vector an end position identifier of a word at the end of the particular one of the plurality of documents to be searched; identifying one or more bits set for one or more of the position identifiers in the retrieved secondary binary bit vector between the start position identifier and the end position identifier for identifying all instances of the search term occurring in the particular document; and generating, under control of the computer, an output based on the search. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method for representing unstructured data included in a plurality of documents, the method comprising:
-
parsing the plurality of documents; obtaining a unique word identifier for each unstructured data included in the plurality of documents; storing a first bit-value at each position of a first binary bit vector identified by each obtained unique word identifier; assigning a unique position identifier to each unstructured data included in the plurality of documents, wherein the unstructured data appearing at the end of a first one of the plurality of documents is assigned a position identifier of n, and the unstructured data appearing at the beginning of a second one of the plurality of documents is assigned a position identifier of n+1, wherein n is an integer greater than 0; retrieving a second binary bit vector for each of the unique word identifiers for which the first bit-value is set in the first binary bit vector; storing a second bit-value at each position of the retrieved second binary bit vector identified by one or more of the position identifiers assigned to one or more instances of the corresponding unstructured data identified by the unique word identifier; and setting a positional ID vector indicating a start position identifier of each word appearing at the beginning of each of the plurality of documents, and a word appearing at the beginning of a future document to the added to the plurality of documents. - View Dependent Claims (16)
-
-
17. A computer-implemented method for representing and searching a document including unstructured data, the method comprising:
-
providing access to a plurality of document collections, each document collection storing a plurality of documents; representing the plurality of documents in a particular one of the document collections via a single text object, the single text object including a plurality of structures providing a binary representation of the plurality of the documents, the plurality of structures including; an inverted index including a word select vector and a word use set for indexing a plurality of words appearing in the plurality of documents, the word select vector setting a bit for each of the plurality of words, and the word use set including a position vector for each of the words having a bit set in the word select vector, the position vector setting a bit for each position identifier assigned to each instance of the corresponding word appearing in one or more of the plurality of documents, wherein the instance of a word appearing at the end of a first one of the plurality of documents is assigned a position identifier of n, and the instance of a word appearing at the beginning of a second one of the plurality of documents is assigned a position identifier of n+1, wherein n is an integer greater than 0; a positional ID vector indicating a start position identifier of each indexed word appearing at the beginning of each of the plurality of documents, and an indexed word appearing at the beginning of a future document to the added to the plurality of documents; and a document positions vector indicating, for each of the plurality of documents, an actual position in the corresponding document in which an instance of the indexed word occurs; receiving a search term; determining whether the bit corresponding to the search term is set in the word select vector; if the bit corresponding to the search term is set in the word select vector, retrieving the corresponding position vector from the word use set; selecting a first document to be searched from the plurality of documents in the particular document collection; retrieving the start position identifier of the indexed word at the beginning of the first document to be searched from the positional ID vector; deducing an end position identifier corresponding to an indexed word at the end of the first document to be searched from the positional ID vector; setting bits in a mask vector from a start position identified by the start position identifier to an end position identified by the end position identifier; performing a logical AND operation with the mask vector and the retrieved position vector for identifying all position identifiers for all instances of the search term occurring in the first document; and identifying actual locations of the identified instances of the search term in the first document from the document positions vector. - View Dependent Claims (18, 19, 20, 21, 22)
-
Specification