Identifying parallel bilingual data over a network
First Claim
1. A method of identifying parallel, bilingual texts, comprising:
- executing a query against a data store over a network to identify a candidate text in a source language;
determining whether the candidate text includes a link to a linked text in a target language;
if the candidate text includes a link to a linked text, retrieving the linked text;
determining whether the candidate text and the linked text are parallel, bilingual texts; and
if so, providing an output identifying the parallel, bilingual texts.
2 Assignments
0 Petitions
Accused Products
Abstract
A set of candidate documents, each of which may be part of a bilingual, parallel set of documents, are identified. The set of documents illustratively includes textual material in a source language. It is then determined whether parallel text can be identified. For each document in the set of documents, it is first determined whether the parallel text resides within the document itself. If not, the document is examined for links to other documents, and those linked documents are examined for bilingual parallelism with the selected documents. If not, named entities are extracted from the document and translated into the target language. The translations are used to query search engines to retrieve the parallel correspondent for the selected documents.
-
Citations
19 Claims
-
1. A method of identifying parallel, bilingual texts, comprising:
-
executing a query against a data store over a network to identify a candidate text in a source language; determining whether the candidate text includes a link to a linked text in a target language; if the candidate text includes a link to a linked text, retrieving the linked text; determining whether the candidate text and the linked text are parallel, bilingual texts; and if so, providing an output identifying the parallel, bilingual texts. - View Dependent Claims (2, 3, 4, 5, 6, 12)
-
-
7. A method of generating parallel, bi-lingual corpora, comprising:
-
executing a query against a data store over a wide area network to identify a set of candidate web pages in a first language; executing a hierarchical set of content-based processing steps on a selected candidate web page to identify a possibly parallel web page in a second language that is possibly parallel to the selected candidate web page in the first language, wherein executing a hierarchical set of content-based processing steps to identify a possible parallel web page comprises; determining whether the selected candidate web page, itself includes parallel bi-lingual texts, and following hyperlinks in the selected candidate web page to linked web pages and determining whether the linked web pages are parallel to the selected candidate web page; verifying that the candidate web page and the possibly parallel web page are sufficiently parallel; and if so, outputting an indication that the candidate web page and the possibly parallel web page are parallel, bi-lingual web pages. - View Dependent Claims (8, 11, 13)
-
-
9-10. -10. (canceled)
-
14. A system for identifying parallel, bilingual documents stored in one or more data stores accessible over a network, the system comprising:
-
a search engine configured to execute a query over the network against the one or more data stores having a plurality of documents each defining a plurality of sentences, the search engine being further configured to retrieve a candidate document in a first language from the plurality of documents based on the query; and a parallelism verification system that comprises; a document scanning component configured to scan the candidate document to determine whether the candidate document includes words in a second language, the second language being different than the first language; and a parallelism verification component configured to determine whether the words in the second language comprise parallel text, that is parallel to text, in the first language, in the candidate document. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification