Apparatus and method for generating data useful in indexing and searching
First Claim
1. A method for computerized processing of document data, comprising:
- receiving the document data;
retrieving tokenization rules for the document data; and
applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data.
5 Assignments
0 Petitions
Accused Products
Abstract
Processing of source documents to generate data for indexing, and of queries to generate data for searching, is done in accordance with retrieved tokenization rules and, if desired, retrieved normalization rules. Tokenization rules are used to define exactly what characters (letters, numbers, punctuation characters, etc.) and exactly what patterns of those characters (one or more contiguous characters, every individual character, etc.) comprise indexable and searchable units of data. Normalization rules are used to (potentially) modify the tokens created by the tokenizer in indexing and/or searching operations. Normalization accounts for things such as case-insensitive searching and language-specific nuances in which document authors can use accepted variations in the spelling of words. Query processing must employ the same tokenization and normalization rules as source processing in order for queries to accurately search the databases, and must also employ another set of concordable characters for use in the query language. This set of “reserved” characters includes characters for wildcard searching, quoted strings, field-qualified searching, range searching, and so forth.
49 Citations
26 Claims
-
1. A method for computerized processing of document data, comprising:
-
receiving the document data;
retrieving tokenization rules for the document data; and
applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for query processing comprising:
-
receiving a user query;
retrieving tokenization rules;
applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data. - View Dependent Claims (12, 13)
-
-
14. A method for computerized searching of document data, comprising:
-
receiving the document data;
retrieving tokenization rules for the document data;
applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data;
storing the tokens in a data base;
receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
re-retrieving the tokenization rules;
applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
searching the tokens in the data base with the one or more search tokens.
-
-
15. An apparatus comprising a subsystem for computerized processing of source documents, the source processing subsystem comprising:
-
a programmed component for receiving the document data;
a programmed component for retrieving tokenization rules for the document data; and
a programmed component for applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data. - View Dependent Claims (16, 17)
-
-
18. An apparatus comprising a subsystem for computerized processing of search queries, the query processing subsystem comprising:
-
a programmed component for receiving a user query;
a programmed component for retrieving tokenization rules;
a programmed component for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
a programmed component for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data. - View Dependent Claims (19)
-
-
20. An apparatus for computerized searching of document data comprising a source processing subsystem and a query processing subsystem, wherein:
-
the source processing subsystem comprises;
a programmed component for receiving the document data;
a programmed component for retrieving tokenization rules for the document data; and
a programmed component for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and
the query processing subsystem comprises;
a programmed component for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
a programmed component for re-retrieving the tokenization rules; and
a programmed component for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.
-
-
21. A computer-readable medium carrying a subsystem program for computerized processing of source documents, the subsystem program comprising:
-
instructions for receiving the document data;
instructions for retrieving tokenization rules for the document data; and
instructions for applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data. - View Dependent Claims (22, 23)
-
-
24. A computer-readable medium carrying a subsystem program for computerized processing of search queries, the subsystem program comprising:
-
instructions for receiving a user query;
instructions for retrieving tokenization rules;
instructions for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
instructions for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data. - View Dependent Claims (25)
-
-
26. A computer-readable medium carrying a subsystem program for computerized processing of source documents and a subsystem program for computerized processing of search queries, wherein:
-
the source processing subsystem program comprises;
instructions for receiving the document data;
instructions for retrieving tokenization rules for the document data; and
instructions for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and
the query processing subsystem program comprises;
instructions for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
instructions for re-retrieving the tokenization rules; and
instructions for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.
-
Specification