System and method for building diverse language models
First Claim
1. A method comprising:
- establishing a website visitation policy according to a previous crawling cycle and vocabulary gaps in a language model, wherein the website visitation policy identifies, according to a pattern of links, a likelihood of web pages to have information capable of filling the vocabulary gaps, and wherein the website visitation policy comprises a crawling schedule according to perplexity of the web pages with respect to the language model;
crawling, via a processor, the web-pages according to the crawling schedule, to yield new vocabulary words; and
generating a diverse language model according to the language model and the new vocabulary words.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for collecting web data in order to create diverse language models. A system configured to practice the method first crawls, such as via a crawler operating on a computing device, a set of documents in a network of interconnected devices according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from previous crawling cycles by crawling documents whose vocabulary considered likely to fill gaps in the current language model. A language model from a previous cycle can be used to guide the creation of a language model in the following cycle. The novelty regions can include documents with high perplexity values over the current language model.
26 Citations
20 Claims
-
1. A method comprising:
-
establishing a website visitation policy according to a previous crawling cycle and vocabulary gaps in a language model, wherein the website visitation policy identifies, according to a pattern of links, a likelihood of web pages to have information capable of filling the vocabulary gaps, and wherein the website visitation policy comprises a crawling schedule according to perplexity of the web pages with respect to the language model; crawling, via a processor, the web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model according to the language model and the new vocabulary words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform instructions comprising; establishing a website visitation policy according to a previous crawling cycle and vocabulary gaps in a language model, wherein the website visitation policy identifies, according to a pattern of links, a likelihood of web pages to have information capable of filling the vocabulary gaps, and wherein the website visitation policy comprises a crawling schedule according to perplexity of the web pages with respect to the language model; crawling the web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model according to the language model and the new vocabulary words. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
-
establishing a website visitation policy according to a previous crawling cycle and vocabulary gaps in a language model, wherein the website visitation policy identifies, according to a pattern of links, a likelihood of web pages to have information capable of filling the vocabulary gaps, and wherein the website visitation policy comprises a crawling schedule according to perplexity of the web pages with respect to the language model; crawling the web-pages according to the crawling schedule, to yield new vocabulary words; and generating a diverse language model according to the language model and the new vocabulary words. - View Dependent Claims (20)
-
Specification