Source expansion for information retrieval and information extraction
First Claim
1. A method for automatically expanding existing data content that is included in a corpus comprising:
- automatically identifying a topic from existing data in said corpus;
automatically generating search queries to search for content related to said topic identified from said existing data, the queries being generated based on said topic identified from existing data content in said corpus;
using said generated search queries for automatically conducting a search in and retrieving content from one or more other data repositories not including said corpus;
automatically extracting units of text from the retrieved content;
automatically determining a relevance of the extracted units of text and their relatedness to the topic identified from the existing data;
automatically selecting new sources of content and including them in the corpus based on the determined relevance to said identified topic including compiling a new document from the most relevant extracted text units, said new document being searchable with said existing data content, wherein the existing data content includes one or more seed documents, said automatically identifying a topic comprising;
generating from said one or more seed documents, a topic name and a topic descriptor corresponding to units extracted from said one or more documents, said generated search queries including one or more;
said topic name or words and phrases extracted from said topic descriptor, andwherein said retrieving content includes;
running, using one or more search engines, said search queries against the one or more external data repositories, said content retrieved including one or more text passages or documents;
said extracting units of text comprising;
splitting the retrieved text passages or documents into smaller text units, said splitting using structural markup for demarcating text unit boundaries; and
said determining the relevance of the extracted text units from said retrieved passages or documents including;
scoring each said text unit using a statistical model based on a lexico-syntactic feature, said lexico-syntactic feature includes a topicality feature, a search feature and a surface feature;
wherein said automatically determining a relevance of the extracted units includes fitting a logistic regression (LR) model using said topicality, search and surface features and a generation level to estimate a relevance score of each independent text unit based on their relevance to said topic of the seed document; and
said scoring further including;
computing a likelihood ratio of a text unit estimated with a topic model and a background language model, said topic model being estimated from the seed document, and said background language model being estimated from a sample of documents from said corpus, wherein one or more process or units in communication with a memory storage device performs said generating, retrieving, extracting, relevance determining and selecting.
1 Assignment
0 Petitions
Accused Products
Abstract
System, method and computer program product for 1) preparing queries for retrieving related content based on existing data content. For instance, titles of existing documents or entities extracted from documents can be used as queries. 2) Retrieving content from other repositories of unstructured, semi-structured, or structured data. For instance, web pages can be retrieved using existing search engines. 3) Extracting smaller units of text from the retrieved content. For instance, web pages can be split into coherent paragraphs of text. 4) Judging the quality of the smaller units of text and their relatedness to existing data. For instance, paragraphs can be scored using a statistical model based on lexico-syntactic features and topic models. 5) Synthesizing new sources from high-quality related text. For instance, paragraphs that score above a threshold can be concatenated into a new document.
129 Citations
27 Claims
-
1. A method for automatically expanding existing data content that is included in a corpus comprising:
-
automatically identifying a topic from existing data in said corpus; automatically generating search queries to search for content related to said topic identified from said existing data, the queries being generated based on said topic identified from existing data content in said corpus; using said generated search queries for automatically conducting a search in and retrieving content from one or more other data repositories not including said corpus; automatically extracting units of text from the retrieved content; automatically determining a relevance of the extracted units of text and their relatedness to the topic identified from the existing data; automatically selecting new sources of content and including them in the corpus based on the determined relevance to said identified topic including compiling a new document from the most relevant extracted text units, said new document being searchable with said existing data content, wherein the existing data content includes one or more seed documents, said automatically identifying a topic comprising; generating from said one or more seed documents, a topic name and a topic descriptor corresponding to units extracted from said one or more documents, said generated search queries including one or more;
said topic name or words and phrases extracted from said topic descriptor, andwherein said retrieving content includes;
running, using one or more search engines, said search queries against the one or more external data repositories, said content retrieved including one or more text passages or documents;said extracting units of text comprising;
splitting the retrieved text passages or documents into smaller text units, said splitting using structural markup for demarcating text unit boundaries; andsaid determining the relevance of the extracted text units from said retrieved passages or documents including;
scoring each said text unit using a statistical model based on a lexico-syntactic feature, said lexico-syntactic feature includes a topicality feature, a search feature and a surface feature;wherein said automatically determining a relevance of the extracted units includes fitting a logistic regression (LR) model using said topicality, search and surface features and a generation level to estimate a relevance score of each independent text unit based on their relevance to said topic of the seed document; and said scoring further including;
computing a likelihood ratio of a text unit estimated with a topic model and a background language model, said topic model being estimated from the seed document, and said background language model being estimated from a sample of documents from said corpus, wherein one or more process or units in communication with a memory storage device performs said generating, retrieving, extracting, relevance determining and selecting. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for automatically expanding existing data content that is included in a corpus comprising:
-
memory storage device; a processor in communication with said memory storage device configured to; automatically identify a topic from existing data in said corpus, said existing data comprising one or more seed documents; automatically generate search queries to search for content related to said topic identified from said existing data, the queries being generated based on said topic identified from a seed document in said corpus; using said generated search queries to automatically conduct a search in and retrieve content from one or more other data repositories not including said corpus; automatically extract units of text from the retrieved content; automatically determine a relevance of the extracted units of text and their relatedness to the topic identified from the existing data; and automatically select new sources of content and include them in the corpus based on the determined relevance to said identified topic including compiling a new document from the most relevant extracted text units, said new document being searchable with said existing data content, wherein to automatically identify said topic, said processor is further configured to; generate from said one or more seed documents, a topic name and a topic descriptor corresponding to units extracted from said one or more documents, said generated search queries including one or more;
said topic name or words and phrases extracted from said topic descriptor, and wherein to retrieve content, said processor is further configured to;use search engines to run said search queries against the one or more external data repositories, said content retrieved including one or more text passages or documents; extract units of text by splitting the retrieved text passages or documents into smaller text units using structural markup for demarcating text unit boundaries; and
, said processor further configured to;determine the relevance of the text units from said retrieved passages or documents by scoring each said text unit using a statistical model based on a lexico-syntactic feature, said lexico-syntactic feature includes a topicality feature, a search feature and a surface feature; wherein to automatically determine a relevance of the extracted units includes fitting a logistic regression (LR) model using said topicality, search and surface features and a generation level to estimate a relevance score of each independent text unit based on their relevance to said topic of the seed document; and compute a score based further on a topicality feature by one of;
computing a likelihood ratio of a text unit estimated with a topic model and a background language model, said topic model being estimated from the seed document, and said background language model being estimated from a sample of documents from said corpus. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer program device for automatically expanding existing data content that is included in a corpus, the computer program device comprising a non-transitory storage medium readable by a processing circuit and storing instructions run by the processing circuit for performing a method, the method comprising:
-
automatically identifying a topic from existing data in said corpus;
automatically generating search queries to search for content related to said topic identified from said existing data, the queries being generated based on said topic identified from said existing data content in said corpus;using said generated search queries for automatically conducting a search in and retrieving content from one or more other data repositories not including said corpus;
automatically extracting units of text from the retrieved content;automatically determining a relevance of the extracted units of text and their relatedness to the topic identified from the existing data; and automatically selecting new sources of content and including them in the corpus based on the determined relevance to the identified topic including compiling a new document from the most relevant extracted text units, said new document being searchable with said existing data content, wherein the existing data content includes one or more seed documents, said automatically identifying a topic comprising; generating from said one or more seed documents, a topic name and a topic descriptor corresponding to units extracted from said one or more documents, said generated search queries including;
said topic name or words and phrases extracted from said topic descriptor, andwherein said retrieving content includes;
running, using one or more search engines, said search queries against the one or more external data repositories, said content retrieved including one or more text passages or documents;said extracting units of text comprising;
splitting the retrieved text passages or documents into smaller text units, said splitting using structural markup for demarcating text unit boundaries; andsaid determining the relevance of the text units from said retrieved passages or documents including;
scoring each said text unit using a statistical model based on a lexico-syntactic feature, said lexico-syntactic feature includes a topicality feature, a search feature and a surface feature;wherein said automatically determining a relevance of the extracted units includes fitting a logistic regression (LR) model using said topicality, search and surface features and a generation level to estimate a relevance score of each independent text unit based on their relevance to said topic of the seed document; and said scoring is based further on a topicality feature including;
computing a likelihood ratio of a text unit estimated with a topic model and a background language model, said topic model being estimated from a seed document, and said background language model being estimated from a sample of documents in said corpus. - View Dependent Claims (20, 21, 22)
-
-
23. A method for automatically expanding existing data content that is included in a corpus comprising:
-
automatically identifying a topic from existing data in said corpus; automatically generating search queries to search for content related to said topic identified from said existing data, the queries being generated based on said identified topic; using said generated search queries for automatically conducting a search in and retrieving content from one or more other data repositories not including said corpus; automatically extracting units of text from the retrieved content; automatically determining a relevance of the extracted units of text and their relatedness to the topic identified from said existing data; and automatically selecting new sources of content and including them in the corpus based on the determined relevance to said identified topic including compiling a new document from the most relevant extracted text units, said new document being searchable with said existing data content, wherein the existing data content includes one or more seed documents, said automatically identifying a topic comprising; generating from said one or more documents, a topic name and a topic descriptor corresponding to units extracted from said one or more documents, said generated search queries including;
said topic name or words and phrases extracted from said topic descriptor, andwherein said retrieving content includes;
running, using one or more search engines, said search queries against the one or more external data repositories, said content retrieved including one or more text passages or documents;said extracting units of text comprising;
splitting the retrieved text passages or documents into smaller text units, said splitting using structural markup for demarcating text unit boundaries; andsaid determining the relevance of the text units from said retrieved passages or documents including;
scoring each said text unit using a statistical model based on a lexico-syntactic feature, said lexico-syntactic feature includes a topicality feature, a search feature and a surface feature;wherein said automatically determining a relevance of the extracted units includes fitting a logistic regression (LR) model using said topicality, search and surface features and a generation level to estimate a relevance score of each independent text unit based on their relevance to said topic of the seed document; and said scoring is based further on a topicality feature including;
computing a likelihood ratio of a text unit estimated with a topic model and a background language model, said topic model being estimated from text units retrieved for a given topic, and said background language model being estimated from a sample of text units retrieved for different topics identified in documents of said corpus, wherein one or more processor units in communication with a memory storage device performs said generating, retrieving, extracting, relevance determining and selecting. - View Dependent Claims (24, 25, 26, 27)
-
Specification