SYSTEMS AND METHODS FOR BUILDING AN ELECTRONIC DICTIONARY OF MULTI-WORD NAMES AND FOR PERFORMING FUZZY SEARCHES IN THE DICTIONARY
First Claim
1. A method for automatically building a contracted dictionary from a given list of multi-word units, comprising:
- receiving an input list of original multi-word units;
transforming the original multi-word units into single word elements;
associating an identifier with each single word element to obtain a collection of unique identifiers, each identifier being associated with a single word element;
storing the collection of identifiers and associated single word elements in a letter dictionary based on a trie, wherein;
each entry is a single words element;
letters of the single word elements are nodes of the trie; and
identifiers are glosses attached to the terminal nodes of the trie;
encoding each original multi-word unit in the input list by replacing each single word element within each multi-word unit by the associated identifier;
storing the encoded multi-word units in an identifier trie-based dictionary, wherein each entry is a set of identifiers representing a multi-word unit and each node is an identifier; and
building a contracted dictionary by contracting the letter trie-based dictionary and the identifier trie-based dictionary;
the contracting comprising, for the letter trie-based dictionary and the identifier trie-based dictionary, merging trie nodes while preserving each entry of the letter trie-based dictionary and each entry of the identifier trie-based dictionary.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention automatically builds a contracted dictionary from a given list of multi-word proper names and performs fuzzy searches in the contracted dictionary. The contracted dictionary of proper names includes two linked trie-based dictionaries: a first dictionary is used to store single word names, each word name having an ID number; and a second dictionary is used to store multi-word names encoded with ID numbers. Information related to the multi-word names is also stored as a gloss to the terminal node of the multi-word entry of the trie-based dictionary. An approximate lookup for a multi-word name is conducted first for each word of the multi-word name using an approximate matching technique such as a phonetic proximity or a simple edit distance. Accordingly, N suggestions is determined for each word of the multi-word name under consideration. Then, multi-word candidates are assembled in ID notation. Finally, an approximate search for each assembled candidate is performed based on an edit distance or a n-grams approximate string matching. Edit distances and N-grams are used to measure how similar two strings are. The result is a set of multi-word suggestions in an ID notation. This ID notation is encoded back to the original form using the first trie-based dictionary.
-
Citations
16 Claims
-
1. A method for automatically building a contracted dictionary from a given list of multi-word units, comprising:
-
receiving an input list of original multi-word units; transforming the original multi-word units into single word elements; associating an identifier with each single word element to obtain a collection of unique identifiers, each identifier being associated with a single word element; storing the collection of identifiers and associated single word elements in a letter dictionary based on a trie, wherein; each entry is a single words element; letters of the single word elements are nodes of the trie; and identifiers are glosses attached to the terminal nodes of the trie; encoding each original multi-word unit in the input list by replacing each single word element within each multi-word unit by the associated identifier; storing the encoded multi-word units in an identifier trie-based dictionary, wherein each entry is a set of identifiers representing a multi-word unit and each node is an identifier; and building a contracted dictionary by contracting the letter trie-based dictionary and the identifier trie-based dictionary;
the contracting comprising, for the letter trie-based dictionary and the identifier trie-based dictionary, merging trie nodes while preserving each entry of the letter trie-based dictionary and each entry of the identifier trie-based dictionary. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for automatically building a contracted dictionary from a given list of multi-word units, comprising:
-
a system for receiving an input list of original multi-word units; a system for transforming the original multi-word units into single word elements; a system for associating an identifier with each single word element to obtain a collection of unique identifiers, each identifier being associated with a single word element; a system for storing the collection of identifiers and associated single word elements in a letter dictionary based on a trie, wherein; each entry is a single words element; letters of the single word elements are nodes of the trie; and identifiers are glosses attached to the terminal nodes of the trie; a system for encoding each original multi-word unit in the input list by replacing each single word element within each multi-word unit by the associated identifier; a system for storing the encoded multi-word units in an identifier trie-based dictionary, wherein each entry is a set of identifiers representing a multi-word unit and each node is an identifier; and a system for building a contracted dictionary by contracting the letter trie-based dictionary and the identifier trie-based dictionary, the system for building a contracted dictionary further comprising, for the letter trie-based dictionary and the identifier trie-based dictionary, a system for merging trie nodes while preserving each entry of the letter trie-based dictionary and each entry of the identifier trie-based dictionary.
-
-
9. A program product stored on a computer readable medium, which when executed, automatically builds a contracted dictionary from a given list of multi-word units, the computer readable medium comprising program code for:
-
receiving an input list of original multi-word units; transforming the original multi-word units into single word elements; associating an identifier with each single word element to obtain a collection of unique identifiers, each identifier being associated with a single word element; storing the collection of identifiers and associated single word elements in a letter dictionary based on a trie, wherein; each entry is a single words element; letters of the single word elements are nodes of the trie; and identifiers are glosses attached to the terminal nodes of the trie; encoding each original multi-word unit in the input list by replacing each single word element within each multi-word unit by the associated identifier; storing the encoded multi-word units in an identifier trie-based dictionary, wherein each entry is a set of identifiers representing a multi-word unit and each node is an identifier; and building a contracted dictionary by contracting the letter trie-based dictionary and the identifier trie-based dictionary;
the contracting comprising, for the letter trie-based dictionary and the identifier trie-based dictionary, merging trie nodes while preserving each entry of the letter trie-based dictionary and each entry of the identifier trie-based dictionary.
-
-
10. A method for performing fuzzy searches in a contracted dictionary of multi-word units, the contracted dictionary comprising two contracted trie-based dictionaries:
-
a letter trie-based dictionary comprising single word elements, each word element having an identifier, each identifier being associated with a single word element, wherein each entry is a single word element, letters of the single word elements are nodes of the trie, and identifiers are glosses attached to the terminal nodes of the trie; and an identifier trie-based dictionary comprising multi-word units encoded with the identifiers wherein each entry is a set of identifiers representing a multi-word unit and each node is an identifier; the method comprising; receiving a request comprising a multi-word unit; transforming the multi-word unit into single word elements; for each single word element, identifying in the letter trie-based dictionary a suggested word element with an identical string of letters and if none exists, applying on the letter trie-based dictionary an approximate matching technique for generating at least one suggested single word element with a letter string as similar as possible to the single word element; for each suggested word element, extracting from the letter trie-based dictionary the suggested single word element encoded with an identifier, each identifier being associated with a single word element; generating at least one multi-word unit candidate encoded with identifiers by assembling identifiers of encoded existing and suggested single word elements; for each encoded multi-word unit candidate, applying on the identifier trie-based dictionary, an approximate matching technique for obtaining at least one suggested multi-word unit encoded with identifiers as similar as possible to the encoded multi-word unit candidate; and for each encoded suggested multi-word unit, replacing each identifier by the associated single word element. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
Specification