Method and system for searching words in documents written in a source language as transcript of words in an origin language
First Claim
1. Computer implemented method for searching words in documents written in a source language, words which are not meaningful in said source language, but are transcript of meaningful words in an origin language, the method is comprised of two processes:
- a) preparation process executed for each new document, the preparation process is comprised of the following steps;
i) reading the document;
ii) extracting unrecognized words in the source language;
iii) updating search indexes in the corpus for all document words;
iv) for each new unrecognized word in the source language;
1) removing prefixes and suffixes;
2) performing phonetic conversion;
3) checking frequency of unrecognized word spelling in System Hebraized Medical Lexicon (SHML);
4) defining the most frequent spelling of the unrecognized word as the central term and connect it to other allowable and close spellings of that term;
5) updating System Hebraized Medical Lexicon (SHML);
b) search process which is comprised of the following steps;
i) reading the search request and perform auto-complete for terms from the System Hebraized Medical Lexicon (SHML);
ii) generating phonetic conversion for all words in query;
iii) for each word;
1) searching for similar phonetics in the corpus and find central terms;
2) calculating the distance to the found similar words and order them in ascending order; and
3) displaying relevant documents according to the distance.
4 Assignments
0 Petitions
Accused Products
Abstract
The invention relates to a method used by computers for searching words in documents written in a source language, which are not in the vocabulary of said source language, but are transcript of meaningful words in an origin language. The method is comprised of a preparation process and a search process. During the preparation process a database of unrecognized words in the source language is maintained, which contains, among other data, normalized phonetic conversion of the unrecognized word, as well as a corpus of all words of the documents in the search domain and indexes for efficient search. During search, a phonetic conversion and normalization is done for the search word, and the distance to similar phonetics words in the corpus is calculated. The found words in the corpus are arranged in ascending order, and the relevant documents are displayed.
8 Citations
4 Claims
-
1. Computer implemented method for searching words in documents written in a source language, words which are not meaningful in said source language, but are transcript of meaningful words in an origin language, the method is comprised of two processes:
-
a) preparation process executed for each new document, the preparation process is comprised of the following steps; i) reading the document; ii) extracting unrecognized words in the source language; iii) updating search indexes in the corpus for all document words; iv) for each new unrecognized word in the source language; 1) removing prefixes and suffixes; 2) performing phonetic conversion; 3) checking frequency of unrecognized word spelling in System Hebraized Medical Lexicon (SHML); 4) defining the most frequent spelling of the unrecognized word as the central term and connect it to other allowable and close spellings of that term; 5) updating System Hebraized Medical Lexicon (SHML); b) search process which is comprised of the following steps; i) reading the search request and perform auto-complete for terms from the System Hebraized Medical Lexicon (SHML); ii) generating phonetic conversion for all words in query; iii) for each word; 1) searching for similar phonetics in the corpus and find central terms; 2) calculating the distance to the found similar words and order them in ascending order; and 3) displaying relevant documents according to the distance. - View Dependent Claims (2, 3, 4)
-
Specification