SOURCE EXPANSION FOR INFORMATION RETRIEVAL AND INFORMATION EXTRACTION
First Claim
1. A method for automatically expanding existing data content that is included in a corpus comprising:
- automatically generating search queries to search for content related to existing data, the queries being generated based on existing data content;
automatically retrieving content from one or more data repositories;
automatically extracting units of text from the retrieved content;
automatically determining a relevance of the extracted units of text and their relatedness to the existing data; and
automatically selecting new sources of content and including them in the corpus based on the determined relevance.
1 Assignment
0 Petitions
Accused Products
Abstract
System, method and computer program product for 1) preparing queries for retrieving related content based on existing data content. For instance, titles of existing documents or entities extracted from documents can be used as queries. 2) Retrieving content from other repositories of unstructured, semi-structured, or structured data. For instance, web pages can be retrieved using existing search engines. 3) Extracting smaller units of text from the retrieved content. For instance, web pages can be split into coherent paragraphs of text. 4) Judging the quality of the smaller units of text and their relatedness to existing data. For instance, paragraphs can be scored using a statistical model based on lexico-syntactic features and topic models. 5) Synthesizing new sources from high-quality related text. For instance, paragraphs that score above a threshold can be concatenated into a new document.
-
Citations
29 Claims
-
1. A method for automatically expanding existing data content that is included in a corpus comprising:
-
automatically generating search queries to search for content related to existing data, the queries being generated based on existing data content; automatically retrieving content from one or more data repositories; automatically extracting units of text from the retrieved content; automatically determining a relevance of the extracted units of text and their relatedness to the existing data; and automatically selecting new sources of content and including them in the corpus based on the determined relevance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A system for automatically expanding existing data content that is included in a corpus comprising:
-
memory storage device; a processor in communication with said memory storage device configured to; automatically generate search queries to search for content related to existing data, the queries being generated based on existing data content; automatically retrieve content from one or more data repositories; automatically extract units of text from the retrieved content; automatically determine a relevance of the extracted units of text and their relatedness to the existing data; and automatically select new sources of content and include them in the corpus based on the determined relevance. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A computer program device for automatically expanding existing data content that is included in a corpus, the computer program device comprising a storage medium readable by a processing circuit and storing instructions run by the processing circuit for performing a method, the method comprising:
-
automatically generating search queries to search for content related to existing data, the queries being generated based on existing data content; automatically retrieving content from one or more data repositories; automatically extracting units of text from the retrieved content; automatically determining a relevance of the extracted units of text and their relatedness to the existing data; and automatically selecting new sources of content and including them in the corpus based on the determined relevance. - View Dependent Claims (27, 28, 29)
-
Specification