Information mining using domain specific conceptual structures

US 8,805,843 B2
Filed: 06/03/2008
Issued: 08/12/2014
Est. Priority Date: 02/13/2007
Status: Active Grant

First Claim

Patent Images

1. A method executed by a computer processor and stored on a computer readable medium for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, the method comprising the steps of:

automatically generating a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the set of documents according to the taxonomy;

using domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents;

creating a refined taxonomy for the first set of documents according to the second partition so that the refined taxonomy incorporates the domain specific knowledge;

using the refined taxonomy to categorize the first e of documents into a first set of categories;

creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents;

constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and the contingency table includes a cell having trending information;

displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device;

comparing the expected value against the actual value of a cell to identify a category of interest;

computing a degree of significance for the actual value of the cell;

identifying a relationship between at least two different categories using the contingency table;

using the contingency table and trending information to identify a recent category with respect to some pre-determined date;

using an element of domain knowledge to re-categorize the first set of documents;

categorizing the second set of documents according to the first set of categories of the first set of documents, further including categorizing the second set of documents according to a criterion chosen from the group consisting of;

text within the second set of documents, structure within the second set of documents, and annotations derived from text within the second set of documents;

examining the first set of categories to identify correlations between categories; and

examining a category of the first set of categories to identify a document of interest, the document of interest being a representative document within the category.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and analytics tools for information mining incorporating domain specific knowledge and conceptual structures are disclosed, the method including: providing a first set of documents related to a first topic of interest; using a first taxonomy to categorize the first set of documents into a set of categories; providing a second set of documents related to a second topic of interest; categorizing the second set of documents according to the set of categories of the first set of documents; using an element of domain knowledge to re-categorize the first set of documents; and examining a category to identify a document of interest.

23 Citations

View as Search Results

14 Claims

1. A method executed by a computer processor and stored on a computer readable medium for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, the method comprising the steps of:
- automatically generating a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the set of documents according to the taxonomy;
  
  using domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents;
  
  creating a refined taxonomy for the first set of documents according to the second partition so that the refined taxonomy incorporates the domain specific knowledge;
  
  using the refined taxonomy to categorize the first e of documents into a first set of categories;
  
  creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents;
  
  constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and the contingency table includes a cell having trending information;
  
  displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device;
  
  comparing the expected value against the actual value of a cell to identify a category of interest;
  
  computing a degree of significance for the actual value of the cell;
  
  identifying a relationship between at least two different categories using the contingency table;
  
  using the contingency table and trending information to identify a recent category with respect to some pre-determined date;
  
  using an element of domain knowledge to re-categorize the first set of documents;
  
  categorizing the second set of documents according to the first set of categories of the first set of documents, further including categorizing the second set of documents according to a criterion chosen from the group consisting of;
  
  text within the second set of documents, structure within the second set of documents, and annotations derived from text within the second set of documents;
  
  examining the first set of categories to identify correlations between categories; and
  
  examining a category of the first set of categories to identify a document of interest, the document of interest being a representative document within the category.

2. A method executed by a computer processor for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, comprising:
- automatically generating a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the first set of documents according to the first taxonomy;
  
  using domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents;
  
  using a first taxonomy to categorize the first set of documents into a first set of categories;
  
  creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents;
  
  constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency tableincludes cells having respective actual values and for which respective expected values are computed, andincludes a cell having trending information;
  
  displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device;
  
  comparing the expected value of a cell against the actual value of a cell to identify a category of interest;
  
  computing a degree of significance for the actual value of the cell;
  
  identifying a relationship between at least two different categories using the contingency table;
  
  using the contingency table and trending information to identify a recent category with respect to some pre-determined date;
  
  comparing the category of interest with the first taxonomy over time;
  
  categorizing the second set of documents according to the set of categories of the first set of documents;
  
  examining a category to identify a document of interest, the document of interest being identified as a document within a pre-specified distance of the centroid of a feature space derived from the first set of documents;
  
  creating a second taxonomy different from and independent of the first taxonomy; and
  
  combining the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.
- View Dependent Claims (3, 4, 5)
- - 3. The method of claim 2, including using an element of domain knowledge to re-categorize the first set of documents.
  - 4. The method of claim 2, wherein categorizing the second set of documents includes categorizing the second set of documents according to a criterion chosen from a group consisting of:
    - text within the second set of documents, structure within the second set of documents, and annotations derived from text within the second set of documents.
  - 5. The method of claim 2, including examining a set of mutually different categories to identify correlations between categories.

6. A method executed by a computer processor for use with a set of documents related to a first topic of interest, comprising:
- automatically generating a first taxonomy through a feature space derived from the set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the set of documents, and the first taxonomy provides a first partition of the set of documents according to the first taxonomy;
  
  using domain-specific knowledge to re-partition the set of documents to provide a second partition of the set of documents;
  
  using a first taxonomy to categorize the set of documents into a first set of categories;
  
  creating a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents;
  
  constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, each category of the second set of categories being associated with a plurality of categories of the first set of categories,wherein the contingency table includes cells having respective actual values and for which respective expected values are computed, and includes a cell having trending information;
  
  displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device;
  
  comparing the expected value of a cell against the actual value of a cell to identify a category of interest;
  
  computing a degree of significance for the actual value of the cell;
  
  using the contingency table and trending information to identify a recent category with respect to some pre-determined date;
  
  comparing each category in the first set of categories with each category in the second set of categories;
  
  identifying a relationship between at least two different categories using the contingency table;
  
  creating a second taxonomy different from and independent of the first taxonomy; and
  
  combining the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.

7. A method executed by a computer processor, comprising:
- extracting a set of documents related to a specified topic from a data warehouse;
  
  automatically generating a first taxonomy through a feature space derived from the set of documents, wherein the feature space includes at least one of unstructured data;
  
  structured data, and annotations derived from text of the set of documents, and the first taxonomy provides a first partition of the set of documents according to the first taxonomy;
  
  using domain-specific knowledge to re-partition the set of documents to provide a second partition of the set of documents;
  
  using a first taxonomy to categorize the set of documents into a first set of categories;
  
  creating a second set of categories of the set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the set of documents;
  
  constructing a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency tableincludes cells having respective actual values and for which respective expected values are computed, andincludes a cell having trending information;
  
  displaying the first set of categories along a first axis and the second set of categories along a second axis on a display device;
  
  comparing the expected value of a cell against the actual value of a cell to identify a category of interest;
  
  computing a degree of significance for the actual value of the cell;
  
  identifying a relationship between at least two different categories using the contingency table;
  
  using the contingency table and trending information to identify a recent category with respect to some pre-determined date;
  
  creating a second taxonomy different from and independent of the first taxonomy for the set of documents according to the second partition so that the different second taxonomy incorporates the domain-specific knowledge;
  
  comparing each of a plurality of categories in the first partition of the set of documents with each of a plurality of categories in the second partition of the set of documents; and
  
  combining the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method of claim 7, including:
    - classifying the set of documents into classification classes independent of the second partition; and
      
      generating a contingency table for comparing the categories of the different second taxonomy with the classification classes of the set of documents.
  - 9. The method of claim 7, including:
    - classifying the set of documents according to at least one of structured fields, annotations, and a second taxonomy for the set of documents; and
      
      generating a contingency table for comparing the categories of the different second taxonomy with classification classes of the set of documents.
  - 10. The method of claim 7, including:
    - classifying the set of documents into classes independent of the second partition;
      
      generating a contingency table for comparing the categories of the different second taxonomy with the classes of the set of documents; and
      
      identifying a set of most closely-related categories using the contingency table.
  - 11. The method of claim 7, including:
    - classifying the set of documents into classes independent of the second partition;
      
      generating a contingency table for comparing the categories of the different second taxonomy with the classes of the set of documents; and
      
      identifying a set of mutually different recent categories using the contingency table.
  - 12. The method of claim 7, including:
    - classifying the set of documents into a first set of classes independent of the second partition;
      
      generating a first contingency table for comparing the mutually different categories of the different second taxonomy with the first classes of the set of documents;
      
      classifying the set of documents into a second set of classes independent of the second partition; and
      
      generating a second contingency table for comparing the categories of the different second taxonomy with the second classes of the set of documents.

13. A computer program product for use with a first set of documents related to a first topic of interest and a second set of documents related to a second topic of interest, the computer program product comprising a non-transitory computer-readable storage medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
- automatically generate a first taxonomy through a feature space derived from the first set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the first set of documents, and the first taxonomy provides a first partition of the first set of documents according to the first taxonomy;
  
  use domain-specific knowledge to re-partition the first set of documents to provide a second partition of the first set of documents;
  
  use a first taxonomy to categorize the first set of documents into a first set of categories;
  
  create a second set of categories of the first set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the first set of documents;
  
  construct a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency tableincludes cells having respective actual values and for which respective expected values are computed, andincludes a cell having trending information;
  
  display the first set of categories along a first axis and the second set of categories along a second axis on a display device;
  
  compare the expected value of a cell against the actual value of a cell to identify a category of interest;
  
  compute a degree of significance for the actual value of the cell;
  
  identify a relationship between at least two different categories using the contingency table;
  
  use the contingency table and trending information to identify a recent category with respect to some pre-determined date;
  
  categorize the second set of documents according to the set of categories of the first set of documents;
  
  compare each of the set of categories in the first set of documents with each of the set of categories in the second set of documents;
  
  examine a category to identify a document of interest, wherein the document of interest typifies the category by being within a pre-specified distance of a centroid of a mathematical definition of the category;
  
  create a second taxonomy different from and independent of the first taxonomy; and
  
  combine the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.

14. A computer program product stored on a non-transitory computer readable storage medium, the computer program product including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
- extract a set of documents related to a specified topic from a data warehouse;
  
  automatically generate a first taxonomy through a feature space derived from the set of documents, wherein the feature space includes at least one of unstructured data, structured data, and annotations derived from text of the set of documents, and the first taxonomy provides a first partition of the set of documents according to the first taxonomy;
  
  use domain-specific knowledge to re-partition the set of documents to provide a second partition of the set of documents;
  
  use a first taxonomy to categorize the set of documents into a first set of categories;
  
  create a second set of categories of the set of documents, wherein the second set of categories are independent of the second partition based on at least one of unstructured data, structured data, and annotations derived from text in the set of documents;
  
  construct a contingency table having the first set of categories along a first axis and the second set of categories along a second axis, wherein the contingency table includescells having respective actual values and for which respective expected values are computed, anda cell having trending information;
  
  display the first set of categories along a first axis and the second set of categories along a second axis on a display device;
  
  compare the expected value of a cell against the actual value of a cell to identify a category of interest;
  
  compute a degree of significance for the actual value of the cell;
  
  identify a relationship between at least two different categories using the contingency table;
  
  use the contingency table and trending information to identify a recent category with respect to some pre-determined date;
  
  compare the specified topic with the first taxonomy over time;
  
  create a second taxonomy different from and independent of the first taxonomy for the same set of documents according to the second partition so that the different second taxonomy incorporates the domain-specific knowledge; and
  
  combine the first taxonomy with the second taxonomy by merging classes in the first taxonomy with classes in the second taxonomy.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chen, Ying, Kreulen, Jeffrey Thomas, Rhodes, James J., Spangler, William Scott
Primary Examiner(s)
Ahn, Sangwoo

Application Number

US12/132,515
Publication Number

US 20080243889A1
Time in Patent Office

2,261 Days
Field of Search

None
US Class Current

707/738
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/358   Browsing; Visualisation the...

G06N 5/022   Knowledge engineering; Know...

Information mining using domain specific conceptual structures

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

23 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

Information mining using domain specific conceptual structures

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others