MULTILINGUAL SEARCH FOR TRANSLITERATED CONTENT
First Claim
1. A computer-implemented process for searching for transliterated content, comprising:
- collecting transliterated data in a foreign script and associated possible native forms for the transliterated data;
extracting textual content from the collected transliterated data and associated possible native forms and segmenting the extracted textual data into meaningful units;
creating a cross index in native script by indexing the textual units in a native script to related foreign script transliterated units from the collected transliterated data;
inputting a query to search the transliterated data and data in native forms;
searching the transliterated data and data in native forms using the cross index; and
returning transliterated data and data in native script in response to the input query.
2 Assignments
0 Petitions
Accused Products
Abstract
The multilingual search for transliterated content technique described herein enables a user to submit a search query in both a native script and its foreign script (e.g., Roman script) transliteration and return relevant results in both the scripts while taking care of the spelling variations in transliterated forms. The technique crawls the World Wide Web for data in both the native script and foreign script transliterated forms of the data. It uses a transliteration engine to generate native script equivalents of the foreign script transliterated data and disambiguates the data in native script (whenever possible). The unique native script word forms are then used to jointly index the data in both the scripts. If the query is in native script, it is directly searched for in the index, otherwise the transliterated query is first converted into native script form(s) and then searched in the indexed database to retrieve and rank results in both the scripts.
32 Citations
20 Claims
-
1. A computer-implemented process for searching for transliterated content, comprising:
-
collecting transliterated data in a foreign script and associated possible native forms for the transliterated data; extracting textual content from the collected transliterated data and associated possible native forms and segmenting the extracted textual data into meaningful units; creating a cross index in native script by indexing the textual units in a native script to related foreign script transliterated units from the collected transliterated data; inputting a query to search the transliterated data and data in native forms; searching the transliterated data and data in native forms using the cross index; and returning transliterated data and data in native script in response to the input query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented process for creating a database indexed to be used for searching for transliterated content, comprising:
-
collecting transliterated data and associated possible native forms of the transliterated data; extracting textual content from the collected transliterated data and segmenting the extracted textual content into meaningful units; creating a cross index by indexing the textual units in a native script to related foreign script transliterated units and if textual units in the native script cannot be cross-indexed to related transliterated units, generating equivalent native script forms for the foreign script transliterated unit which are indexed in the cross index. - View Dependent Claims (13, 14, 15)
-
-
16. A system for searching for transliterated content, comprising:
-
a general purpose computing device; a computer program comprising program modules executable by the general purpose computing device, wherein the computing device is directed by the program modules of the computer program to, collect multi-lingual transliterated data and associated native script forms for the transliterated data; create a cross index in native script by indexing textual data units of the collected multi-lingual transliterated data in a native script to related foreign script transliterated units from the collected multi-lingual transliterated data; input a query to search the collected transliterated data and associated data in native forms; search the multi-lingual transliterated data and data in native forms using the cross index; and return transliterated data and data in native script in response to the input query. - View Dependent Claims (17, 18, 19, 20)
-
Specification