System and method for building diverse language models
First Claim
1. A method comprising:
- identifying vocabulary gaps in a current language model;
establishing a visitation policy based on a previous crawling cycle and the vocabulary gaps, wherein the visitation policy identifies web pages likely to have information capable of filling the vocabulary gaps in the current language model, and wherein the visitation policy comprises a crawling schedule based on predicted perplexity of the web pages with respect to the current language model;
crawling, via a crawler operating on a computing device, the web-pages according to the crawling schedule, to yield new vocabulary words; and
generating a diverse language model based on the current language model and the new vocabulary words.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for collecting web data in order to create diverse language models. A system configured to practice the method first crawls, such as via a crawler operating on a computing device, a set of documents in a network of interconnected devices according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from previous crawling cycles by crawling documents whose vocabulary considered likely to fill gaps in the current language model. A language model from a previous cycle can be used to guide the creation of a language model in the following cycle. The novelty regions can include documents with high perplexity values over the current language model.
-
Citations
19 Claims
-
1. A method comprising:
-
identifying vocabulary gaps in a current language model; establishing a visitation policy based on a previous crawling cycle and the vocabulary gaps, wherein the visitation policy identifies web pages likely to have information capable of filling the vocabulary gaps in the current language model, and wherein the visitation policy comprises a crawling schedule based on predicted perplexity of the web pages with respect to the current language model; crawling, via a crawler operating on a computing device, the web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model based on the current language model and the new vocabulary words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising; identifying vocabulary gaps in a current language model; establishing a visitation policy based on a previous crawling cycle and the vocabulary gaps, wherein the visitation policy identifies web pages likely to have information capable of filling the vocabulary gaps in the current language model, and wherein the visitation policy comprises a crawling schedule based on predicted perplexity of the web pages with respect to the current language model; crawling, via a crawler operating on a computing device, web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model based on the current language model and the new vocabulary words. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
-
identifying vocabulary gaps in a current language model; establishing a visitation policy based on a previous crawling cycle and the vocabulary gaps, wherein the visitation policy identifies web pages likely to have information capable of filling the vocabulary gaps in the current language model, and wherein the visitation policy comprises a crawling schedule based on predicted perplexity of the web pages with respect to the current language model; crawling, via a crawler operating on a computing device, web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model based on the current language model and the new vocabulary words. - View Dependent Claims (16, 17, 18, 19)
-
Specification