Domain-specific unstructured text retrieval
First Claim
1. An apparatus for retrieving unstructured text from the Internet related to a specified domain, the apparatus comprising:
- one or more processors; and
a memory having instructions stored therein, the instructions executable by the one or more processors to perform operations asa first classifier having been trained using training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data;
a similar web page retriever configured to retrieve, from the Internet, only web pages that include text that is unstructured and do not have at least some of the plurality of features of the training data, and where the retrieved web pages are similar to web pages classified by the first classifier; and
a second classifier having been trained using unstructured text examples which do not have at least one of the plurality of features;
wherein the second classifier is configured to label web pages retrieved by the similar web page retriever to select web pages which are relevant to the specified domain.
1 Assignment
0 Petitions
Accused Products
Abstract
Retrieving from the Internet unstructured text related to a specified domain is described. Training data is accessed; the training data comprises unstructured text related to the specified domain. A first classifier is trained using features of the training data. It is used to classify unstructured text having plurality of features, to obtain unstructured text examples related to the domain. The unstructured text examples are used to retrieve from the Internet similar examples which do not have at least some of the plurality of features. Optionally, a second classifier is trained using the similar examples. Additional unstructured text is retrieved from the Internet and the second classifier is used to label the additional unstructured text for domain relevance.
55 Citations
20 Claims
-
1. An apparatus for retrieving unstructured text from the Internet related to a specified domain, the apparatus comprising:
-
one or more processors; and a memory having instructions stored therein, the instructions executable by the one or more processors to perform operations as a first classifier having been trained using training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data; a similar web page retriever configured to retrieve, from the Internet, only web pages that include text that is unstructured and do not have at least some of the plurality of features of the training data, and where the retrieved web pages are similar to web pages classified by the first classifier; and a second classifier having been trained using unstructured text examples which do not have at least one of the plurality of features; wherein the second classifier is configured to label web pages retrieved by the similar web page retriever to select web pages which are relevant to the specified domain. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer-implemented method of retrieving unstructured text from the Internet related to a specified domain, the method comprising:
-
accessing training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data; training a first classifier using the training data including the unstructured text related to the specified domain and the plurality of features; using the trained first classifier to classify unstructured text having the plurality of features, to obtain unstructured text examples related to the domain; using the unstructured text examples to retrieve, from the Internet, only similar examples that include text that is unstructured and do not have at least one of the plurality of features of the training data; training a second classifier using at least some of the similar examples, retrieving additional unstructured text from the Internet; and using the second classifier to classify the additional unstructured text as being related to the specified domain or not. - View Dependent Claims (17, 18, 19)
-
-
20. An apparatus for retrieving unstructured text from the Internet related to a specified domain, the apparatus comprising:
-
one or more processors; and a memory having instructions stored therein, the instructions executable by the one or more processors to perform operations as a first classifier having been trained using training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data; and a similar web page retriever configured to retrieve from the Internet, only web pages that include text that is unstructured and do not have at least some of the plurality of features of the training data, and where the retrieved web pages are similar to web pages classified by the first classifier, the similar web page retriever being configured to assign confidence values to the similar web pages to indicate likelihood of being relevant to the specified domain.
-
Specification