SYSTEM AND METHOD FOR BUILDING DIVERSE LANGUAGE MODELS
First Claim
1. A method of generating a diverse language model, the method comprising:
- crawling, via a crawler operating on a computing device, a plurality of documents in a network of interconnected documents according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from at least one previous crawling cycle, and wherein the visitation policy is based on a vocabulary considered likely to fill gaps in the current language model; and
generating a diverse language model based on the current language model and the plurality of documents.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for collecting web data in order to create diverse language models. A system configured to practice the method first crawls, such as via a crawler operating on a computing device, a set of documents in a network of interconnected devices according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from previous crawling cycles by crawling documents whose vocabulary considered likely to fill gaps in the current language model. A language model from a previous cycle can be used to guide the creation of a language model in the following cycle. The novelty regions can include documents with high perplexity values over the current language model.
-
Citations
20 Claims
-
1. A method of generating a diverse language model, the method comprising:
-
crawling, via a crawler operating on a computing device, a plurality of documents in a network of interconnected documents according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from at least one previous crawling cycle, and wherein the visitation policy is based on a vocabulary considered likely to fill gaps in the current language model; and generating a diverse language model based on the current language model and the plurality of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for recognizing speech, the system comprising:
-
a processor; a first module configured to control the processor to receive speech; a second module configured to control the processor to load a language model, wherein the language model is generated by crawling a plurality of documents in a network of interconnected documents according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from at least one previous crawling cycle, and wherein the visitation policy is based on a vocabulary considered likely to fill gaps in the current language model; and a third module configured to control the processor to recognize the speech using the language model. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A non-transitory computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to generate a diverse language model, the instructions comprising:
-
crawling a plurality of documents in a network of interconnected documents according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from at least one previous crawling cycle, and wherein the visitation policy is based on a vocabulary considered likely to fill gaps in the current language model; and generating a new language model based on the current language model and the plurality of documents. - View Dependent Claims (17, 18, 19, 20)
-
Specification