Please download the dossier by clicking on the dossier button x
×

Method for data and text mining and literature-based discovery

  • US 20040064438A1
  • Filed: 08/22/2003
  • Published: 04/01/2004
  • Est. Priority Date: 09/30/2002
  • Status: Active Grant
First Claim
Patent Images

1. A method of retrieving data, relevant to a topic of interest, from at least one collection of documents, comprising the steps of:

  • selecting at least one collection;

    applying a test query to said at least one collection, thereby retrieving a first set of documents from said at least one collection, said test query including at least one test query term;

    classifying each document in a representative sample of documents within said first set of documents according to their relevance to said topic;

    extracting all phrases from said first set of documents selecting high frequency, high technical content phrases from said extracted phrases;

    performing a phrase frequency analysis of at least group of said first set of documents having a greater relevance to said subject matter than other documents within said first set of documents to generate a list of phrases including phrase frequency data for each listed phrase;

    grouping said selected high frequency, high technical content phrases into thematic categories;

    identifying at least one anchor phrase within said phrase frequency analyzed documents for each of said thematic categories;

    analyzing phrase co-occurrence of phrases in said phrase frequency analyzed documents to generate a list of co-occurrence pairs, each said co-occurrence pair consisting of an anchor phrase and another listed phrase, to generate a list of co-occurrence pairs including co-occurrence data for each listed co-occurrence pair;

    combining said list of phrases with said list of co-occurrence pairs to form a list of candidate query terms;

    selecting a plurality of listed query terms from said list of candidates;

    applying an additional query to said at least one collection, said additional query being said plurality of said listed query terms, thereby retrieving an additional set of documents from said at least one collection;

    classifying at least a representative sample of documents within said additional set of documents according to their relevance to said topic;

    determining, based upon said classification of said representative sample of said additional set of documents, the ratio of relevant to non-relevant documents that are retrieved by each term of said selected plurality of listed query terms;

    building a narrowed query consisting of those listed query terms within said plurality of query terms for which said ratio is above a predetermined lower limit;

    applying said narrowed query to said collection, thereby retrieving another set of documents from said at least one collection.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×