Information mining using domain specific conceptual structures
First Claim
1. A method executed by a computer processor and stored on a computer readable medium for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, the method comprising the steps of:
- automatically generating a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the set of documents according to the taxonomy;
using domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents;
creating a refined taxonomy for the first set of documents according to the second partition so that the refined taxonomy incorporates the domain specific knowledge;
using the refined taxonomy to categorize the first e of documents into a first set of categories;
creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents;
constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and the contingency table includes a cell having trending information;
displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device;
comparing the expected value against the actual value of a cell to identify a category of interest;
computing a degree of significance for the actual value of the cell;
identifying a relationship between at least two different categories using the contingency table;
using the contingency table and trending information to identify a recent category with respect to some pre-determined date;
using an element of domain knowledge to re-categorize the first set of documents;
categorizing the second set of documents according to the first set of categories of the first set of documents, further including categorizing the second set of documents according to a criterion chosen from the group consisting of;
text within the second set of documents, structure within the second set of documents, and annotations derived from text within the second set of documents;
examining the first set of categories to identify correlations between categories; and
examining a category of the first set of categories to identify a document of interest, the document of interest being a representative document within the category.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and analytics tools for information mining incorporating domain specific knowledge and conceptual structures are disclosed, the method including: providing a first set of documents related to a first topic of interest; using a first taxonomy to categorize the first set of documents into a set of categories; providing a second set of documents related to a second topic of interest; categorizing the second set of documents according to the set of categories of the first set of documents; using an element of domain knowledge to re-categorize the first set of documents; and examining a category to identify a document of interest.
23 Citations
14 Claims
-
1. A method executed by a computer processor and stored on a computer readable medium for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, the method comprising the steps of:
-
automatically generating a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the set of documents according to the taxonomy; using domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents; creating a refined taxonomy for the first set of documents according to the second partition so that the refined taxonomy incorporates the domain specific knowledge; using the refined taxonomy to categorize the first e of documents into a first set of categories; creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents; constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and the contingency table includes a cell having trending information; displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device; comparing the expected value against the actual value of a cell to identify a category of interest; computing a degree of significance for the actual value of the cell; identifying a relationship between at least two different categories using the contingency table; using the contingency table and trending information to identify a recent category with respect to some pre-determined date; using an element of domain knowledge to re-categorize the first set of documents; categorizing the second set of documents according to the first set of categories of the first set of documents, further including categorizing the second set of documents according to a criterion chosen from the group consisting of;
text within the second set of documents, structure within the second set of documents, and annotations derived from text within the second set of documents;examining the first set of categories to identify correlations between categories; and examining a category of the first set of categories to identify a document of interest, the document of interest being a representative document within the category.
-
-
2. A method executed by a computer processor for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, comprising:
-
automatically generating a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the first set of documents according to the first taxonomy; using domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents; using a first taxonomy to categorize the first set of documents into a first set of categories; creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents; constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and includes a cell having trending information; displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device; comparing the expected value of a cell against the actual value of a cell to identify a category of interest; computing a degree of significance for the actual value of the cell; identifying a relationship between at least two different categories using the contingency table; using the contingency table and trending information to identify a recent category with respect to some pre-determined date; comparing the category of interest with the first taxonomy over time; categorizing the second set of documents according to the set of categories of the first set of documents; examining a category to identify a document of interest, the document of interest being identified as a document within a pre-specified distance of the centroid of a feature space derived from the first set of documents; creating a second taxonomy different from and independent of the first taxonomy; and combining the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy. - View Dependent Claims (3, 4, 5)
-
-
6. A method executed by a computer processor for use with a set of documents related to a first topic of interest, comprising:
-
automatically generating a first taxonomy through a feature space derived from the set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the set of documents, and the first taxonomy provides a first partition of the set of documents according to the first taxonomy; using domain-specific knowledge to re-partition the set of documents to provide a second partition of the set of documents; using a first taxonomy to categorize the set of documents into a first set of categories; creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents; constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, each category of the second set of categories being associated with a plurality of categories of the first set of categories, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and includes a cell having trending information; displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device; comparing the expected value of a cell against the actual value of a cell to identify a category of interest; computing a degree of significance for the actual value of the cell; using the contingency table and trending information to identify a recent category with respect to some pre-determined date; comparing each category in the first set of categories with each category in the second set of categories; identifying a relationship between at least two different categories using the contingency table; creating a second taxonomy different from and independent of the first taxonomy; and combining the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.
-
-
7. A method executed by a computer processor, comprising:
-
extracting a set of documents related to a specified topic from a data warehouse; automatically generating a first taxonomy through a feature space derived from the set of documents, wherein the feature space includes at least one of unstructured data;
structured data, and annotations derived from text of the set of documents, and the first taxonomy provides a first partition of the set of documents according to the first taxonomy;using domain-specific knowledge to re-partition the set of documents to provide a second partition of the set of documents; using a first taxonomy to categorize the set of documents into a first set of categories; creating a second set of categories of the set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the set of documents; constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and includes a cell having trending information; displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device; comparing the expected value of a cell against the actual value of a cell to identify a category of interest; computing a degree of significance for the actual value of the cell; identifying a relationship between at least two different categories using the contingency table; using the contingency table and trending information to identify a recent category with respect to some pre-determined date; creating a second taxonomy different from and independent of the first taxonomy for the set of documents according to the second partition so that the different second taxonomy incorporates the domain-specific knowledge; comparing each of a plurality of categories in the first partition of the set of documents with each of a plurality of categories in the second partition of the set of documents; and combining the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer program product for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, the computer program product comprising a non-transitory computer-readable storage medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
-
automatically generate a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the first set of documents according to the first taxonomy; use domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents; use a first taxonomy to categorize the first set of documents into a first set of categories; create a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents; construct a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and includes a cell having trending information; display the first set of categories along a first axis and the second set of categories along a second axis on a display device; compare the expected value of a cell against the actual value of a cell to identify a category of interest; compute a degree of significance for the actual value of the cell; identify a relationship between at least two different categories using the contingency table; use the contingency table and trending information to identify a recent category with respect to some pre-determined date; categorize the second set of documents according to the set of categories of the first set of documents; compare each of the set of categories in the first set of documents with each of the set of categories in the second set of documents; examine a category to identify a document of interest, wherein the document of interest typifies the category by being within a pre-specified distance of a centroid of a mathematical definition of the category; create a second taxonomy different from and independent of the first taxonomy; and combine the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.
-
-
14. A computer program product stored on a non-transitory computer readable storage medium, the computer program product including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
-
extract a set of documents related to a specified topic from a data warehouse; automatically generate a first taxonomy through a feature space derived from the set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the set of documents, and the first taxonomy provides a first partition of the set of documents according to the first taxonomy; use domain-specific knowledge to re-partition the set of documents to provide a second partition of the set of documents; use a first taxonomy to categorize the set of documents into a first set of categories; create a second set of categories of the set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the set of documents; construct a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and a cell having trending information; display the first set of categories along a first axis and the second set of categories along a second axis on a display device; compare the expected value of a cell against the actual value of a cell to identify a category of interest; compute a degree of significance for the actual value of the cell; identify a relationship between at least two different categories using the contingency table; use the contingency table and trending information to identify a recent category with respect to some pre-determined date; compare the specified topic with the first taxonomy over time; create a second taxonomy different from and independent of the first taxonomy for the same set of documents according to the second partition so that the different second taxonomy incorporates the domain-specific knowledge; and combine the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.
-
Specification