Domain-specific unstructured text retrieval

US 10,318,564 B2
Filed: 09/28/2015
Issued: 06/11/2019
Est. Priority Date: 09/28/2015
Status: Active Grant

First Claim

Patent Images

1. An apparatus for retrieving unstructured text from the Internet related to a specified domain, the apparatus comprising:

one or more processors; and

a memory having instructions stored therein, the instructions executable by the one or more processors to perform operations asa first classifier having been trained using training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data;

a similar web page retriever configured to retrieve, from the Internet, only web pages that include text that is unstructured and do not have at least some of the plurality of features of the training data, and where the retrieved web pages are similar to web pages classified by the first classifier; and

a second classifier having been trained using unstructured text examples which do not have at least one of the plurality of features;

wherein the second classifier is configured to label web pages retrieved by the similar web page retriever to select web pages which are relevant to the specified domain.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Retrieving from the Internet unstructured text related to a specified domain is described. Training data is accessed; the training data comprises unstructured text related to the specified domain. A first classifier is trained using features of the training data. It is used to classify unstructured text having plurality of features, to obtain unstructured text examples related to the domain. The unstructured text examples are used to retrieve from the Internet similar examples which do not have at least some of the plurality of features. Optionally, a second classifier is trained using the similar examples. Additional unstructured text is retrieved from the Internet and the second classifier is used to label the additional unstructured text for domain relevance.

55 Citations

View as Search Results

20 Claims

1. An apparatus for retrieving unstructured text from the Internet related to a specified domain, the apparatus comprising:
- one or more processors; and
  
  a memory having instructions stored therein, the instructions executable by the one or more processors to perform operations asa first classifier having been trained using training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data;
  
  a similar web page retriever configured to retrieve, from the Internet, only web pages that include text that is unstructured and do not have at least some of the plurality of features of the training data, and where the retrieved web pages are similar to web pages classified by the first classifier; and
  
  a second classifier having been trained using unstructured text examples which do not have at least one of the plurality of features;
  
  wherein the second classifier is configured to label web pages retrieved by the similar web page retriever to select web pages which are relevant to the specified domain.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The apparatus of claim 1 wherein the similar web page retriever is configured to assign confidence values to the similar web pages to indicate likelihood of being relevant to the specified domain and wherein the second classifier has been trained using unstructured text examples from web pages retrieved by the similar web page retriever and selected according to confidence values.
  - 3. The apparatus of claim 1 wherein the similar web page retriever identifies the similar web pages from inbound links of web pages classified by the first classifier.
  - 4. The apparatus of claim 1 wherein the similar web page retriever identifies first similar web pages from inbound links of web pages classified by the first classifier and from inbound links of first similar web pages.
  - 5. The apparatus of claim 1 wherein the similar web page retriever accesses an index of web pages, the index comprising inbound link data of the indexed web pages.
  - 6. The apparatus of claim 1 wherein the similar web page retriever accesses a click log comprising a record of web pages observed as having been selected by different users in connection with a same query.
  - 7. The apparatus of claim 1 wherein the similar web page retriever accesses an impression log comprising a record of web pages occurring in results lists returned by an information retrieval system in response to a same query.
  - 8. The apparatus of claim 1 wherein the similar web page retriever is configured to assign a confidence value to a similar web page on the basis of a number of inbound links of the similar web page.
  - 9. The apparatus of claim 1 wherein the first classifier is configured to use features comprising one or more of:
    - a category of a web page, a title of a web page, metatags of a web page, an information box of a web page.
  - 10. The apparatus of claim 1 wherein the first classifier is configured to classify web pages of a public online encyclopedia as being relevant to the specified domain or not.
  - 11. The apparatus of claim 1 wherein the first classifier has been trained using training data retrieved from a source known to comprise web pages having the plurality of features and using queries comprising seed examples.
  - 12. The apparatus of claim 1 comprising a communications interface configured to enable the apparatus to be accessed as a web service.
  - 13. The apparatus of claim 1 further comprising a feature extractor configured to extract sentences from the web pages retained by the second classifier, where the extracted sentences are likely to comprise facts.
  - 14. The apparatus of claim 13 further comprising a clustering component configured. to cluster the extracted facts into relation clusters and assign confidence values to the clusters'"'"' facts.
  - 15. The apparatus of claim 14 further comprising a mapping component configured to map the relation clusters of extracted facts to an ontology of a knowledge store.

16. A computer-implemented method of retrieving unstructured text from the Internet related to a specified domain, the method comprising:
- accessing training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data;
  
  training a first classifier using the training data including the unstructured text related to the specified domain and the plurality of features;
  
  using the trained first classifier to classify unstructured text having the plurality of features, to obtain unstructured text examples related to the domain;
  
  using the unstructured text examples to retrieve, from the Internet, only similar examples that include text that is unstructured and do not have at least one of the plurality of features of the training data;
  
  training a second classifier using at least some of the similar examples,retrieving additional unstructured text from the Internet; and
  
  using the second classifier to classify the additional unstructured text as being related to the specified domain or not.
- View Dependent Claims (17, 18, 19)
- - 17. The method of claim 16 comprising assigning confidence values to the similar web pages to indicate likelihood of being relevant to the specified domain and training the second classifier using web pages selected according to the confidence values.
  - 18. The method of claim 16 comprising identifying the similar web pages from inbound links of web pages classified by the first classifier.
  - 19. The method of claim 16 comprising identifying the similar web pages from a cascade of inbound links of web pages classified by both the first and the second classifiers.

20. An apparatus for retrieving unstructured text from the Internet related to a specified domain, the apparatus comprising:
- one or more processors; and
  
  a memory having instructions stored therein, the instructions executable by the one or more processors to perform operations asa first classifier having been trained using training data comprising unstructured text related to the specified domain, the training data having a plurality of features, the unstructured text being separated from structured data and semi-structured data; and
  
  a similar web page retriever configured to retrieve from the Internet, only web pages that include text that is unstructured and do not have at least some of the plurality of features of the training data, and where the retrieved web pages are similar to web pages classified by the first classifier, the similar web page retriever being configured to assign confidence values to the similar web pages to indicate likelihood of being relevant to the specified domain.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Chalabi, Achraf Abdel Moneim Tawfik, Abdel-Reheem, Eslam Kamal Abdel-Aal, Abdelaziz, Sayed Hassan Sayed, Marton, Yuval Yehezkel, Gerguis, Michel Naim Naguib
Primary Examiner(s)
Le, Debbie M

Application Number

US14/867,620
Publication Number

US 20170091313A1
Time in Patent Office

1,352 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/334   Query execution G06F16/335 ...

G06F 16/35   Clustering; Classification

G06F 16/951   Indexing; Web crawling tech...

G06F 16/958   Organisation or management ...

G06N 20/00   Machine learning

Domain-specific unstructured text retrieval

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

55 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Domain-specific unstructured text retrieval

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

55 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links