Apparatus and method for generating data useful in indexing and searching
First Claim
1. A method for computerized processing of document data, comprising:
- receiving the document data;
retrieving tokenization rules for the document data;
applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data;
receiving a user query;
re-retrieving the tokenization rules;
applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query and wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and
searching an index comprising the tokens with the one or more search tokens.
5 Assignments
0 Petitions
Accused Products
Abstract
Processing of source documents to generate data for indexing, and of queries to generate data for searching, is done in accordance with retrieved tokenization rules and, if desired, retrieved normalization rules. Tokenization rules are used to define exactly what characters (letters, numbers, punctuation characters, etc.) and exactly what patterns of those characters (one or more contiguous characters, every individual character, etc.) comprise indexable and searchable units of data. Normalization rules are used to (potentially) modify the tokens created by the tokenizer in indexing and/or searching operations. Normalization accounts for things such as case-insensitive searching and language-specific nuances in which document authors can use accepted variations in the spelling of words. Query processing must employ the same tokenization and normalization rules as source processing in order for queries to accurately search the databases, and must also employ another set of concordable characters for use in the query language. This set of “reserved” characters includes characters for wildcard searching, quoted strings, field-qualified searching, range searching, and so forth.
18 Citations
17 Claims
-
1. A method for computerized processing of document data, comprising:
-
receiving the document data; retrieving tokenization rules for the document data; applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data; receiving a user query; re-retrieving the tokenization rules; applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query and wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and searching an index comprising the tokens with the one or more search tokens. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for computerized query processing comprising:
-
receiving a user query; retrieving tokenization rules; applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data. - View Dependent Claims (10)
-
-
11. A method for computerized searching of document data, comprising:
-
receiving the document data; retrieving tokenization rules for the document data; applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data; storing the tokens in a data base; receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language; re-retrieving the tokenization rules; applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and searching the tokens in the data base with the one or more search tokens.
-
-
12. An apparatus comprising a subsystem for computerized processing of search queries, the query processing subsystem comprising:
-
a programmed component for receiving a user query; a programmed component for retrieving tokenization rules; a programmed component for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and a programmed component for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data. - View Dependent Claims (13)
-
-
14. An apparatus for computerized searching of document data comprising a source processing subsystem and a query processing subsystem, wherein:
-
the source processing subsystem comprises; a programmed component for receiving the document data; a programmed component for retrieving tokenization rules for the document data; and a programmed component for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and the query processing subsystem comprises; a programmed component for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language; a programmed component for re-retrieving the tokenization rules; and a programmed component for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.
-
-
15. A computer-readable medium carrying a subsystem program for computerized processing of search queries, the subsystem program executed on a computer comprising:
-
instructions for receiving a user query; instructions for retrieving tokenization rules; instructions for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and instructions for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data. - View Dependent Claims (16)
-
-
17. A computer-readable medium carrying a subsystem program for computerized processing of source documents and a subsystem program for computerized processing of search queries executed on a computer, wherein:
-
the source processing subsystem program comprises; instructions for receiving the document data; instructions for retrieving tokenization rules for the document data; and instructions for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and the query processing subsystem program comprises; instructions for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language; instructions for re-retrieving the tokenization rules; and instructions for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.
-
Specification