System and method for managing a textual archive using semantic units
First Claim
1. A method for managing a textual database, comprising the steps of:
- receiving textual data;
identifying a data type of the textual data;
transcribing the textual data into corresponding semantic units of words using a recognition system for the identified data type, wherein the recognition system performs transcription by decoding the textual data using a language model and phonetic dictionary of semantic units;
storing the textual data in a textual database; and
generating an index based on semantic units of words for indexing the stored textual data with the corresponding semantic units.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for indexing and searching textual archives using semantic units such as syllables and morphemes. In one aspect, a system for indexing a textual archive comprises an AHR (automatic handwriting recognition) system and/or OCR (optical character recognition) system for transcribing (decoding) textual input data (handwritten or typed text) into a string of semantic units (e.g., syllables or morphemes) using a statistical language model and vocabulary based on semantic units (such as syllables or morphemes). The string of semantic units that result from a decoding process are stored in a semantic unit database and indexed with pointers to the corresponding textual data in the textual archive. In another aspect, a system for searching a textual archive is provided, wherein a word (or words) to be searched is rendered into a string of semantic units (e.g., syllables or morphemes) depending on the application. A search engine then compares the string of semantic units (resulting from the input query) against the decoded semantic unit database, and then identifies textual data stored in the textual archive using the indexes that were generated during a semantic unit-based indexing process.
-
Citations
23 Claims
-
1. A method for managing a textual database, comprising the steps of:
-
receiving textual data; identifying a data type of the textual data; transcribing the textual data into corresponding semantic units of words using a recognition system for the identified data type, wherein the recognition system performs transcription by decoding the textual data using a language model and phonetic dictionary of semantic units; storing the textual data in a textual database; and generating an index based on semantic units of words for indexing the stored textual data with the corresponding semantic units. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for managing a textual database, the method comprising the steps of:
-
receiving textual data; identifying a data type of the textual data; transcribing the textual data into corresponding semantic units of words using a recognition system for the identified data type, wherein the recognition system performs transcription by decoding the textual data using a language model and phonetic dictionary of semantic units; storing the textual data in a textual database; and generating an index based on semantic units of words for indexing the stored textual data with the corresponding semantic units.
-
-
20. A system for managing a textual database, comprising:
-
a data type identification system for identifying a data type of textual data; a recognition system for transcribing the textual data into corresponding semantic units of words based on the identified data type, wherein the text recognition system performs transcription by decoding the textual data using a language model and phonetic dictionary of semantic units, wherein the recognition system comprises an OCR (optical character recognition) system for transcribing typed text, and an AHR (automatic handwriting recognition) system for transcribing handwritten text; a textual database for storing the textual data; and an index generator adapted to generate an index based on semantic units of words, wherein the textual data stored in the textual database is indexed with the corresponding semantic units. - View Dependent Claims (21, 22, 23)
-
Specification