Method for data and text mining and literature-based discovery
First Claim
1. A method of retrieving data, relevant to a topic of interest, from at least one collection of documents, comprising the steps of:
- selecting at least one collection;
applying a test query to said at least one collection, thereby retrieving a first set of documents from said at least one collection, said test query including at least one test query term;
classifying each document in a representative sample of documents within said first set of documents according to their relevance to said topic;
extracting all phrases from said first set of documents selecting high frequency, high technical content phrases from said extracted phrases;
performing a phrase frequency analysis of at least group of said first set of documents having a greater relevance to said subject matter than other documents within said first set of documents to generate a list of phrases including phrase frequency data for each listed phrase;
grouping said selected high frequency, high technical content phrases into thematic categories;
identifying at least one anchor phrase within said phrase frequency analyzed documents for each of said thematic categories;
analyzing phrase co-occurrence of phrases in said phrase frequency analyzed documents to generate a list of co-occurrence pairs, each said co-occurrence pair consisting of an anchor phrase and another listed phrase, to generate a list of co-occurrence pairs including co-occurrence data for each listed co-occurrence pair;
combining said list of phrases with said list of co-occurrence pairs to form a list of candidate query terms;
selecting a plurality of listed query terms from said list of candidates;
applying an additional query to said at least one collection, said additional query being said plurality of said listed query terms, thereby retrieving an additional set of documents from said at least one collection;
classifying at least a representative sample of documents within said additional set of documents according to their relevance to said topic;
determining, based upon said classification of said representative sample of said additional set of documents, the ratio of relevant to non-relevant documents that are retrieved by each term of said selected plurality of listed query terms;
building a narrowed query consisting of those listed query terms within said plurality of query terms for which said ratio is above a predetermined lower limit;
applying said narrowed query to said collection, thereby retrieving another set of documents from said at least one collection.
1 Assignment
0 Petitions
Accused Products
Abstract
Text searching is achieved by techniques including phrase frequency analysis and phrase-co-occurrence analysis. In many cases, factor matrix analysis is also advantageously applied to select high technical content phrases to be analyzed for possible inclusion within a new query. The described techniques may be used to retrieve data, determine levels of emphasis within a collection of data, determine the desirability of conflating search terms, detect symmetry or asymmetry between two text elements within a collection of documents, generate a taxonomy of documents within a collection, and perform literature-based problem solving. (This abstract is intended only to aid those searching patents, and is not intended to limit the disclosure of claims in any manner.)
-
Citations
26 Claims
-
1. A method of retrieving data, relevant to a topic of interest, from at least one collection of documents, comprising the steps of:
-
selecting at least one collection;
applying a test query to said at least one collection, thereby retrieving a first set of documents from said at least one collection, said test query including at least one test query term;
classifying each document in a representative sample of documents within said first set of documents according to their relevance to said topic;
extracting all phrases from said first set of documents selecting high frequency, high technical content phrases from said extracted phrases;
performing a phrase frequency analysis of at least group of said first set of documents having a greater relevance to said subject matter than other documents within said first set of documents to generate a list of phrases including phrase frequency data for each listed phrase;
grouping said selected high frequency, high technical content phrases into thematic categories;
identifying at least one anchor phrase within said phrase frequency analyzed documents for each of said thematic categories;
analyzing phrase co-occurrence of phrases in said phrase frequency analyzed documents to generate a list of co-occurrence pairs, each said co-occurrence pair consisting of an anchor phrase and another listed phrase, to generate a list of co-occurrence pairs including co-occurrence data for each listed co-occurrence pair;
combining said list of phrases with said list of co-occurrence pairs to form a list of candidate query terms;
selecting a plurality of listed query terms from said list of candidates;
applying an additional query to said at least one collection, said additional query being said plurality of said listed query terms, thereby retrieving an additional set of documents from said at least one collection;
classifying at least a representative sample of documents within said additional set of documents according to their relevance to said topic;
determining, based upon said classification of said representative sample of said additional set of documents, the ratio of relevant to non-relevant documents that are retrieved by each term of said selected plurality of listed query terms;
building a narrowed query consisting of those listed query terms within said plurality of query terms for which said ratio is above a predetermined lower limit;
applying said narrowed query to said collection, thereby retrieving another set of documents from said at least one collection. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of determining levels of emphasis, comprising the steps of:
-
selecting a collection of documents, each document containing at least one unstructured field;
extracting all phrases from said unstructured field;
filtering all extracted phrases to generate a list of high technical content phrases;
generating a co-occurrence matrix of high technical content phrases for said unstructured field;
normalizing matrix cell values of said co-occurrence matrix to generate a normalized matrix for said field;
grouping phrases from said unstructured field by clustering techniques on said normalized matrix;
summing the phrase frequencies of occurrence within each group, thereby indicating a level of emphasis for each group generated from said collection. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method of classifying documents retrieved during a collection search, comprising the steps of:
performing a phrase frequency analysis upon said documents to obtain theme and sub-theme relationships and taxonomies of all high technical content phrases in said documents.
-
21. A method of generating a taxonomy of a collection of documents, comprising the steps of:
-
selecting a collection of documents, each document containing at least one structured field;
extracting all phrases from said structured field;
factor matrix filtering all of said extracted phrases to generate a list of high technical content phrases;
generating a co-occurrence matrix of said listed phrases for said field;
normalizing cell values of said co-occurrence matrix to generate a normalized matrix for said field;
grouping said listed phrases for each said field using clustering techniques on said normalized matrix;
summing the frequencies of occurrence within each group, thereby indicating a level of emphasis for each group generated from said collection. - View Dependent Claims (22, 23, 24)
-
-
25. A method of literature-based problem solving, comprising the steps of:
-
identifying a problem;
selecting a source database comprising documents related to said problem, each of said documents including at least one unstructured field;
retrieving all documents relevant to said problem from said source database to form a set of initially retrieved documents;
extracting all phrases from said unstructured field of said set of initially retrieved documents;
factor matrix filtering all of said extracted phrases to generate a first list of high technical content phrases;
generating a co-occurrence matrix of said high technical content phrases from said first list;
normalizing matrix cell values of said co-occurrence matrix to generate a normalized matrix for said field;
grouping phrases from said unstructured field into thematic categories and subcategories by clustering techniques on said normalized matrix;
generating a directly related topical literature for each said subcategory by retrieving documents related to each of said subcategories, each said directly related topical literature being disjoint with said selected documents and with said directly related topical literature from said other subcategories, each document in each said directly related topical literature including at least one unstructured field;
extracting all phrases from said unstructured field of said directly related topical literature documents;
filtering all of said extracted phrases from said directly related topical literature documents to generate a second list of high technical content phrases;
generating a co-occurrence matrix of high technical content phrases from said second list;
normalizing matrix cell values of said co-occurrence matrix to generate a normalized matrix for said unstructured field from said topical literature documents;
grouping phrases from said unstructured field of said directly related topical literature documents into thematic categories by clustering techniques on said normalized matrix;
dividing said thematic categories into a first set of categories representing specific solutions to said problem and a second set of categories that do not represent specific solutions to said problem;
generating, for each of said second set of categories, a corresponding disjoint indirectly related literature;
extracting all phrases from each said indirectly related literature;
filtering all of said phrases extracted from said indirectly related literatures to generate a second list of high technical content phrases;
grouping said high technical content phrases from said second list into thematic categories for each said indirectly related literature, the set of categories consisting of said first set of categories and said high technical content phrases from said second list into thematic categories for each said indirectly related literature to form a set of basis categories;
generating, for each of said first set of categories, and for each of said indirectly related literature thematic categories that represent potential solutions to said problems, a third list of phrases, phrase combination, and phrase co-occurrences;
filtering said third lists to remove all phrases and phrase combinations that appear in said initially retrieved documents, thereby forming filtered third lists;
determining the number of categories and the sum of frequencies over all of said basis categories for each phrase and phrase co-occurrence on said filtered third list;
ranking said phrases and phrase co-occurrences on said filtered third list by the number of categories and the sum of frequencies over all of said basis categories. - View Dependent Claims (26)
-
Specification