Apparatus and method for generating data useful in indexing and searching

US 20030200199A1
Filed: 04/16/2003
Published: 10/23/2003
Est. Priority Date: 04/19/2002
Status: Active Grant

First Claim

Patent Images

1. A method for computerized processing of document data, comprising:

receiving the document data;

retrieving tokenization rules for the document data; and

applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Processing of source documents to generate data for indexing, and of queries to generate data for searching, is done in accordance with retrieved tokenization rules and, if desired, retrieved normalization rules. Tokenization rules are used to define exactly what characters (letters, numbers, punctuation characters, etc.) and exactly what patterns of those characters (one or more contiguous characters, every individual character, etc.) comprise indexable and searchable units of data. Normalization rules are used to (potentially) modify the tokens created by the tokenizer in indexing and/or searching operations. Normalization accounts for things such as case-insensitive searching and language-specific nuances in which document authors can use accepted variations in the spelling of words. Query processing must employ the same tokenization and normalization rules as source processing in order for queries to accurately search the databases, and must also employ another set of concordable characters for use in the query language. This set of “reserved” characters includes characters for wildcard searching, quoted strings, field-qualified searching, range searching, and so forth.

49 Citations

View as Search Results

26 Claims

1. A method for computerized processing of document data, comprising:
- receiving the document data;
  
  retrieving tokenization rules for the document data; and
  
  applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 further comprising:
    - identifying characters and character patterns in textual data expected in the document data; and
      
      establishing the tokenization rules based upon the identified characters and character patterns;
      
      wherein the tokens are defined by the identified characters and character patterns.
  - 3. The method of claim 2 further comprising:
    - identifying new characters or new character patterns for new conventions in textual data expected in the document data; and
      
      modifying the tokenization rules based upon the new identified characters and the new character patterns;
      
      wherein the tokens are defined by the identified characters, the new identified characters, the identified character patterns, and the new identified character patterns.
  - 4. The method of claim 1 wherein the receiving step comprises receiving one or more letters, numbers, punctuation characters, or any combination thereof.
  - 5. The method of claim 1 wherein the tokenization rules comprise pattern definitions and concordable character definitions.
  - 6. The method of claim 5 wherein the concordable character definitions comprise a non-spaceless character set, a punctuation character set, and a spaceless character set.
  - 7. The method of claim 1 further comprising:
    - retrieving normalization rules; and
      
      applying the normalization rules with the tokenization rules to the document data to generate the tokens.
  - 8. The method of claim 1 further comprising:
    - receiving a user query;
      
      re-retrieving the tokenization rules;
      
      applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
      
      searching an index comprising the tokens with the one or more search tokens.
  - 9. The method of claim 8 further comprising:
    - retrieving normalization rules;
      
      applying the normalization rules with the tokenization rules to the document data to generate the tokens;
      
      re-retrieving the normalization rules; and
      
      applying the re-retrieved normalization rules with the re-retrieved tokenization rules to the user query to generate the search tokens.
  - 10. The method of claim 8 wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language.

11. A method for query processing comprising:
- receiving a user query;
  
  retrieving tokenization rules;
  
  applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
  
  searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data.
- View Dependent Claims (12, 13)
- - 12. The method of claim 11 further comprising:
    - retrieving normalization rules; and
      
      applying the normalization rules with the tokenization rules to the user query to generate the search tokens, the tokens in the index further being based on application of the normalization rules to the document data.
  - 13. The method of claim 11 wherein the user query is derived from a query language and comprises at least one reserved character from a set of concordable characters for use in the query language.

14. A method for computerized searching of document data, comprising:
- receiving the document data;
  
  retrieving tokenization rules for the document data;
  
  applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data;
  
  storing the tokens in a data base;
  
  receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
  
  re-retrieving the tokenization rules;
  
  applying the re-retrieved tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
  
  searching the tokens in the data base with the one or more search tokens.

15. An apparatus comprising a subsystem for computerized processing of source documents, the source processing subsystem comprising:
- a programmed component for receiving the document data;
  
  a programmed component for retrieving tokenization rules for the document data; and
  
  a programmed component for applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data.
- View Dependent Claims (16, 17)
- - 16. The apparatus of claim 15 wherein:
    - the tokenization rules are based at least in part upon characters and character patterns;
      
      the source processing subsystem further comprises a programmed component for modifying the tokenization rules based upon new characters, or new character patterns, or a any combination thereof, for new conventions in textual data expected in the document data; and
      
      the tokens are defined by the characters, the new characters, the character patterns, and the new character patterns.
  - 17. The apparatus of claim 15 wherein the source processing subsystem further comprises:
    - a programmed component for retrieving normalization rules; and
      
      a programmed component for applying the normalization rules with the tokenization rules to the document data to generate the tokens.

18. An apparatus comprising a subsystem for computerized processing of search queries, the query processing subsystem comprising:
- a programmed component for receiving a user query;
  
  a programmed component for retrieving tokenization rules;
  
  a programmed component for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
  
  a programmed component for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data.
- View Dependent Claims (19)
- - 19. The apparatus of claim 18 further comprising:
    - a programmed component for retrieving normalization rules; and
      
      a programmed component for applying the normalization rules with the tokenization rules to the user query to generate the search tokens, the tokens in the index further being based on application of the normalization rules to the document data.

20. An apparatus for computerized searching of document data comprising a source processing subsystem and a query processing subsystem, wherein:
- the source processing subsystem comprises;
  
  a programmed component for receiving the document data;
  
  a programmed component for retrieving tokenization rules for the document data; and
  
  a programmed component for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and
  
  the query processing subsystem comprises;
  
  a programmed component for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
  
  a programmed component for re-retrieving the tokenization rules; and
  
  a programmed component for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.

21. A computer-readable medium carrying a subsystem program for computerized processing of source documents, the subsystem program comprising:
- instructions for receiving the document data;
  
  instructions for retrieving tokenization rules for the document data; and
  
  instructions for applying the tokenization rules to the document data to generate a plurality of tokens, each having one or more concordable characters from the document data.
- View Dependent Claims (22, 23)
- - 22. The computer—
    - readable medium of claim 21 wherein;
      
      the tokenization rules are based at least in part upon characters and character patterns;
      
      the subsystem program further comprises instructions for modifying the tokenization rules based upon new characters, or new character patterns, or a any combination thereof, for new conventions in textual data expected in the document data; and
      
      the tokens are defined by the characters, the new characters, the character patterns, and the new character patterns.
  - 23. The computer-readable medium of claim 22 wherein the subsystem program further comprises:
    - instructions for retrieving normalization rules; and
      
      instructions for applying the normalization rules with the tokenization rules to the document data to generate the tokens.

24. A computer-readable medium carrying a subsystem program for computerized processing of search queries, the subsystem program comprising:
- instructions for receiving a user query;
  
  instructions for retrieving tokenization rules;
  
  instructions for applying the tokenization rules to the user query to generate one or more search tokens, each having one or more concordable characters from the user query; and
  
  instructions for searching an index with the one or more search tokens, the index comprising tokens based on application of the tokenization rules to document data.
- View Dependent Claims (25)
- - 25. The computer-readable medium of claim 24 further comprising:
    - instructions for retrieving normalization rules; and
      
      instructions for applying the normalization rules with the tokenization rules to the user query to generate the search tokens, the tokens in the index further being based on application of the normalization rules to the document data.

26. A computer-readable medium carrying a subsystem program for computerized processing of source documents and a subsystem program for computerized processing of search queries, wherein:
- the source processing subsystem program comprises;
  
  instructions for receiving the document data;
  
  instructions for retrieving tokenization rules for the document data; and
  
  instructions for applying the tokenization rules to the document data to generate a plurality of tokens for storage in a data base, each of the tokens having one or more concordable characters from the document data; and
  
  the query processing subsystem program comprises;
  
  instructions for receiving a user query derived from a query language and comprising at least one reserved character from a set of concordable characters for use in the query language;
  
  instructions for re-retrieving the tokenization rules; and
  
  instructions for applying the re-retrieved tokenization rules to the user query to generate one or more search tokens for searching the tokens in the data base, each of the search tokens having one or more concordable characters from the user query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Factiva Incorporated, Dow Jones Reuters Business Interactive LLC (News Corporation)
Original Assignee
Dow Jones Reuters Business Interactive LLC (News Corporation)
Inventors
Snyder, James D.

Granted Patent

US 7,152,056 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/2
CPC Class Codes

G06F 40/284   Lexical analysis, e.g. toke...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99936   Pattern matching access

Apparatus and method for generating data useful in indexing and searching

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

49 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method for generating data useful in indexing and searching

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

49 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links