×

Source expansion for information retrieval and information extraction

  • US 8,892,550 B2
  • Filed: 09/24/2010
  • Issued: 11/18/2014
  • Est. Priority Date: 09/24/2010
  • Status: Active Grant
First Claim
Patent Images

1. A method for automatically expanding existing data content that is included in a corpus comprising:

  • automatically identifying a topic from existing data in said corpus;

    automatically generating search queries to search for content related to said topic identified from said existing data, the queries being generated based on said topic identified from existing data content in said corpus;

    using said generated search queries for automatically conducting a search in and retrieving content from one or more other data repositories not including said corpus;

    automatically extracting units of text from the retrieved content;

    automatically determining a relevance of the extracted units of text and their relatedness to the topic identified from the existing data;

    automatically selecting new sources of content and including them in the corpus based on the determined relevance to said identified topic including compiling a new document from the most relevant extracted text units, said new document being searchable with said existing data content, wherein the existing data content includes one or more seed documents, said automatically identifying a topic comprising;

    generating from said one or more seed documents, a topic name and a topic descriptor corresponding to units extracted from said one or more documents, said generated search queries including one or more;

    said topic name or words and phrases extracted from said topic descriptor, andwherein said retrieving content includes;

    running, using one or more search engines, said search queries against the one or more external data repositories, said content retrieved including one or more text passages or documents;

    said extracting units of text comprising;

    splitting the retrieved text passages or documents into smaller text units, said splitting using structural markup for demarcating text unit boundaries; and

    said determining the relevance of the extracted text units from said retrieved passages or documents including;

    scoring each said text unit using a statistical model based on a lexico-syntactic feature, said lexico-syntactic feature includes a topicality feature, a search feature and a surface feature;

    wherein said automatically determining a relevance of the extracted units includes fitting a logistic regression (LR) model using said topicality, search and surface features and a generation level to estimate a relevance score of each independent text unit based on their relevance to said topic of the seed document; and

    said scoring further including;

    computing a likelihood ratio of a text unit estimated with a topic model and a background language model, said topic model being estimated from the seed document, and said background language model being estimated from a sample of documents from said corpus, wherein one or more process or units in communication with a memory storage device performs said generating, retrieving, extracting, relevance determining and selecting.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×