System and method for building diverse language models
First Claim
1. A method comprising:
- establishing a crawling schedule configured to identify, according to a pattern of links, a likelihood of web pages to have information capable of filling vocabulary gaps, and wherein a website visitation policy comprises the crawling schedule according to perplexity of the web pages with respect to a language model;
crawling, via a processor, the web-pages based on the crawling schedule, to yield new vocabulary words; and
generating a new language model according to the language model and the new vocabulary words.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for collecting web data in order to create diverse language models. A system configured to practice the method first crawls, such as via a crawler operating on a computing device, a set of documents in a network of interconnected devices according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from previous crawling cycles by crawling documents whose vocabulary considered likely to fill gaps in the current language model. A language model from a previous cycle can be used to guide the creation of a language model in the following cycle. The novelty regions can include documents with high perplexity values over the current language model.
29 Citations
20 Claims
-
1. A method comprising:
-
establishing a crawling schedule configured to identify, according to a pattern of links, a likelihood of web pages to have information capable of filling vocabulary gaps, and wherein a website visitation policy comprises the crawling schedule according to perplexity of the web pages with respect to a language model; crawling, via a processor, the web-pages based on the crawling schedule, to yield new vocabulary words; and generating a new language model according to the language model and the new vocabulary words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform instructions comprising; establishing a crawling schedule configured to identify, according to a pattern of links, a likelihood of web pages to have information capable of filling vocabulary gaps, and wherein a website visitation policy comprises the crawling schedule according to perplexity of the web pages with respect to a language model; crawling, via a processor, the web-pages based on the crawling schedule, to yield new vocabulary words; and generating a new language model according to the language model and the new vocabulary words. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A non-transitory computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
-
establishing a crawling schedule configured to identify, according to a pattern of links, a likelihood of web pages to have information capable of filling vocabulary gaps, and wherein a website visitation policy comprises the crawling schedule according to perplexity of the web pages with respect to a language model; crawling, via a processor, the web-pages based on the crawling schedule, to yield new vocabulary words; and generating a new language model according to the language model and the new vocabulary words. - View Dependent Claims (20)
-
Specification