Apparatus and method for generating data useful in indexing and searching

US 7,152,056 B2
Filed: 04/16/2003
Issued: 12/19/2006
Est. Priority Date: 04/19/2002
Status: Active Grant

First Claim

Patent Images

1. A method for computerized processing of document data, comprising:

receiving the document data;

retrieving tokenization rules for the document data;

applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data;

receiving a user query;

re-retrieving the tokenization rules;

applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query and wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and

searching an index comprising the tokens with the one or more search tokens.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Processing of source documents to generate data for indexing, and of queries to generate data for searching, is done in accordance with retrieved tokenization rules and, if desired, retrieved normalization rules. Tokenization rules are used to define exactly what characters (letters, numbers, punctuation characters, etc.) and exactly what patterns of those characters (one or more contiguous characters, every individual character, etc.) comprise indexable and searchable units of data. Normalization rules are used to (potentially) modify the tokens created by the tokenizer in indexing and/or searching operations. Normalization accounts for things such as case-insensitive searching and language-specific nuances in which document authors can use accepted variations in the spelling of words. Query processing must employ the same tokenization and normalization rules as source processing in order for queries to accurately search the databases, and must also employ another set of concordable characters for use in the query language. This set of “reserved” characters includes characters for wildcard searching, quoted strings, field-qualified searching, range searching, and so forth.

18 Citations

View as Search Results

17 Claims

1. A method for computerized processing of document data, comprising:
- receiving the document data;
  
  retrieving tokenization rules for the document data;
  
  applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data;
  
  receiving a user query;
  
  re-retrieving the tokenization rules;
  
  applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query and wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and
  
  searching an index comprising the tokens with the one or more search tokens.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 further comprising:
    - identifying characters and character patterns in textual data expected in the document data; and
      
      establishing the tokenization rules based upon the identified characters and character patterns;
      
      wherein the tokens are defined by the identified characters and character patterns.
  - 3. The method of claim 2 further comprising:
    - identifying new characters or new character patterns for new conventions in textual data expected in the document data; and
      
      modifying the tokenization rules based upon the new identified characters and the new character patterns;
      
      wherein the tokens are defined by the identified characters, the new identified characters, the identified character patterns, and the new identified character patterns.
  - 4. The method of claim 1 wherein the receiving step comprises receiving one or more letters, numbers, punctuation characters, or any combination thereof.
  - 5. The method of claim 1 wherein the tokenization rules comprise pattern definitions and concordable character definitions.
  - 6. The method of claim 5 wherein the concordable character definitions comprise a non-spaceless character set, a punctuation character set, and a spaceless character set.
  - 7. The method of claim 1 further comprising:
    - retrieving normalization rules; and
      
      applying the normalization rules with the tokenization rules to the document data to generate the tokens.
  - 8. The method of claim 1 further comprising:
    - retrieving normalization rules;
      
      applying the normalization rules with the tokenization rules to the document data to generate the tokens;
      
      re-retrieving the normalization rules; and
      
      applying the re-retrieved normalization rules with the re-retrieved tokenization rules to the user query to generate the search tokens.

9. A method for computerized query processing comprising:
- receiving a user query;
  
  retrieving tokenization rules;
  
  applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and
  
  searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data.
- View Dependent Claims (10)
- - 10. The method of claim 9 further comprising:
    - retrieving normalization rules; and
      
      applying the normalization rules with the tokenization rules to the user query to generate the search tokens, the tokens in the index further being based on application of the normalization rules to the document data.

11. A method for computerized searching of document data, comprising:
- receiving the document data;
  
  retrieving tokenization rules for the document data;
  
  applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data;
  
  storing the tokens in a data base;
  
  receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
  
  re-retrieving the tokenization rules;
  
  applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
  
  searching the tokens in the data base with the one or more search tokens.

12. An apparatus comprising a subsystem for computerized processing of search queries, the query processing subsystem comprising:
- a programmed component for receiving a user query;
  
  a programmed component for retrieving tokenization rules;
  
  a programmed component for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and
  
  a programmed component for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data.
- View Dependent Claims (13)
- - 13. The apparatus of claim 12 further comprising:
    - a programmed component for retrieving normalization rules; and
      
      a programmed component for applying the normalization rules with the tokenization rules to the user query to generate the search tokens, the tokens in the index further being based on application of the normalization rules to the document data.

14. An apparatus for computerized searching of document data comprising a source processing subsystem and a query processing subsystem, wherein:
- the source processing subsystem comprises;
  
  a programmed component for receiving the document data;
  
  a programmed component for retrieving tokenization rules for the document data; and
  
  a programmed component for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and
  
  the query processing subsystem comprises;
  
  a programmed component for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
  
  a programmed component for re-retrieving the tokenization rules; and
  
  a programmed component for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.

15. A computer-readable medium carrying a subsystem program for computerized processing of search queries, the subsystem program executed on a computer comprising:
- instructions for receiving a user query;
  
  instructions for retrieving tokenization rules;
  
  instructions for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language; and
  
  instructions for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data.
- View Dependent Claims (16)
- - 16. The computer-readable medium of claim 15 further comprising:
    - instructions for retrieving normalization rules; and
      
      instructions for applying the normalization rules with the tokenization rules to the user query to generate the search tokens, the tokens in the index further being based on application of the normalization rules to the document data.

17. A computer-readable medium carrying a subsystem program for computerized processing of source documents and a subsystem program for computerized processing of search queries executed on a computer, wherein:
- the source processing subsystem program comprises;
  
  instructions for receiving the document data;
  
  instructions for retrieving tokenization rules for the document data; and
  
  instructions for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and
  
  the query processing subsystem program comprises;
  
  instructions for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
  
  instructions for re-retrieving the tokenization rules; and
  
  instructions for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Factiva Incorporated, Dow Jones Reuters Business Interactive LLC (News Corporation)
Original Assignee
Dow Jones Reuters Business Interactive LLC (News Corporation)
Inventors
Snyder, James D.
Primary Examiner(s)
Robinson, Greta
Assistant Examiner(s)
LEWIS, CHERYL RENEA

Application Number

US10/417,548
Publication Number

US 20030200199A1
Time in Patent Office

1,343 Days
Field of Search

707 1- 6, 707/100, 715/500, 715/531
US Class Current

1/1
CPC Class Codes

G06F 40/284   Lexical analysis, e.g. toke...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99936   Pattern matching access

Apparatus and method for generating data useful in indexing and searching

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

18 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method for generating data useful in indexing and searching

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links