Automatic acquisition of a parallel corpus from a network
First Claim
Patent Images
1. A method comprising:
- identifying network pages based on whether the pages include image alternative text that indicates that the network pages contain links to pages that are translations of each other;
retrieving a plurality of pages and a plurality of respective uniform resource locators from a server associated with the domain name of the identified network pages;
using the uniform resource locators to identify a set of candidate parallel page pairs;
creating a set of features for each candidate parallel page pair; and
using the sets of features to identify parallel page pairs, wherein the pages in a parallel page pair are translations of each other.
2 Assignments
0 Petitions
Accused Products
Abstract
Network pages are identified based on whether the pages include image alternative text that indicates that the network pages contain links to pages that are translations of each other. A plurality of pages and a plurality of respective uniform resource locators are downloaded from a server associated with the domain name of the identified network pages. The uniform resource locators are used to identify a set of candidate parallel page pairs and a set of features are created for each candidate parallel page pair. The sets of features are used to identify parallel page pairs, wherein the pages in a parallel page pair are translations of each other.
89 Citations
20 Claims
-
1. A method comprising:
-
identifying network pages based on whether the pages include image alternative text that indicates that the network pages contain links to pages that are translations of each other; retrieving a plurality of pages and a plurality of respective uniform resource locators from a server associated with the domain name of the identified network pages; using the uniform resource locators to identify a set of candidate parallel page pairs; creating a set of features for each candidate parallel page pair; and using the sets of features to identify parallel page pairs, wherein the pages in a parallel page pair are translations of each other. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-readable medium having computer-executable instructions for performing steps comprising:
-
receiving a set of uniform resource locators; locating a first uniform resource locator that contains a base pattern in the set of uniform resource locators; modifying the first uniform resource locator by replacing the base pattern with an alternative pattern to form a modified uniform resource locator; locating at least one uniform resource locator in the set of uniform resource locators that is different from the modified uniform resource locator but is within an edit distance threshold of the modified uniform resource locator to identify a second uniform resource locator; and indicating that a page associated with the first uniform resource locator and a page associated with the second uniform resource locator are candidate parallel pages that are likely to represent the same content in two different languages. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method comprising:
-
determining a feature vector for a pair of documents comprising a document in a first language and a document in a second language; applying the feature vector to a k-nearest neighbor classifier to classify the pair of documents as either containing the same content in different languages or not containing the same content. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification