System and method for word indexing in a capture system and querying thereof
First Claim
Patent Images
1. A method to be performed by a processor in an electronic environment, comprising:
- receiving a query for an object captured by a capture system, the query including at least one search term, wherein a content type of the object is determined and the content type is positioned in a content field of a tag associated with the object;
hashing the at least one search term to a term bit position using a hash function; and
eliminating at least one other object from the query using a word index associated with the at least one other object, the word index comprising a plurality of bits, if a bit in the term bit position is not set, wherein eliminating at least one other object from the query comprises performing a search over a tag database containing tags associated with the captured objects.
13 Assignments
0 Petitions
Accused Products
Abstract
Searching of objects captured by a capture system can be improved by eliminating irrelevant objects from a query. In one embodiment, the present invention includes receiving such a query for objects captured by a capture system, the query including at least one search term. This search term is then hashed to a term bit position using a hash function. Then objects can be eliminated if, in a word index associated with the object, the term bit position is not set.
225 Citations
22 Claims
-
1. A method to be performed by a processor in an electronic environment, comprising:
-
receiving a query for an object captured by a capture system, the query including at least one search term, wherein a content type of the object is determined and the content type is positioned in a content field of a tag associated with the object; hashing the at least one search term to a term bit position using a hash function; and eliminating at least one other object from the query using a word index associated with the at least one other object, the word index comprising a plurality of bits, if a bit in the term bit position is not set, wherein eliminating at least one other object from the query comprises performing a search over a tag database containing tags associated with the captured objects. - View Dependent Claims (2, 3, 4)
-
-
5. A method to be performed by a processor in an electronic environment, the method comprising:
-
capturing an object being transmitted over a network, wherein a content type of the object is determined and the content type is positioned in a content field of a tag associated with the object; extracting text contained in the captured object; generating a plurality of tokens from the extracted text; hashing the plurality of tokens to a plurality of bit position values; and creating a word index to be associated with the captured object by setting bits of a bit vector corresponding to the plurality of bit position values, wherein generating the plurality of tokens comprises tokenizing the extracted text using a context-aware parser, and wherein the context-aware parser uses a list of patterns associated with a content type of the captured object to sub-tokenize at least one of the tokens generated from the extracted text. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A capture system comprising:
-
one or more capture modules including a network interface card to capture an object being transmitted over a network, wherein a content type of the object is determined and the content type is positioned in a content field of a tag associated with the object; a text extractor to extract text contained in the captured object; a tokenizer to generate a plurality of tokens from the extracted text; a sub-tokenizer to sub-tokenize the plurality of tokens comprises using a context-aware parser; and an index generator to create a word index by hashing the plurality of tokens to a plurality of bit position values and setting bits of a bit vector corresponding to the plurality of bit position values, wherein the context-aware parser uses a list of patterns associated with a content type of the captured object to sub-tokenize at least one of the tokens generated from the extracted text. - View Dependent Claims (11, 12, 13)
-
-
14. A machine-readable medium having stored thereon data representing instructions that, when executed by a processor, cause the processor to perform operations comprising:
-
receiving a query for an object captured by a capture system, the query including at least one search term, wherein a content type of the object is determined and the content type is positioned in a content field of a tag associated with the object; hashing the at least one search term to a term bit position using a hash function; and eliminating at least one other object from the query using a word index associated with the at least one other object, the word index comprising a plurality of bits, if a bit in the term bit position is not set, wherein eliminating the at least one other object from the query comprises performing a search over a tag database containing tags associated with the capture objects. - View Dependent Claims (15, 16, 17)
-
-
18. A machine-readable medium having stored thereon data representing instructions that, when executed by a processor of a capture system, cause the processor to perform operations comprising:
-
extracting text contained in a captured object, wherein a content type of the object is determined and the content type is positioned in a content field of a tag associated with the object; generating a plurality of tokens from the extracted text; hashing the plurality of tokens to a plurality of bit position values; and creating a word index to be associated with the captured object by setting bits of a bit vector corresponding to the plurality of bit position values, wherein generating the plurality of tokens comprises tokenizing the extracted text using a context-aware parser, and wherein the context-aware parser uses a list of patterns associated with a content type of the captured object to sub-tokenize at least one of the tokens generated from the extracted text. - View Dependent Claims (19, 20, 21, 22)
-
Specification