Methodologies and analytics tools for locating experts with specific sets of expertise

US 8,577,834 B2
Filed: 06/05/2008
Issued: 11/05/2013
Est. Priority Date: 02/13/2007
Status: Active Grant

First Claim

Patent Images

1. A method for use with a collection of documents, the method comprising:

generating categories representing fields of expertise derived from the collection of documents, including generating a taxonomy from the collection of documents,wherein said generating of categories includes generation of a feature space from text of the collection of documents and comparison of the collection of documents using a measure of distance in the feature space;

converting each of the documents in the collection of documents into a numeric vector corresponding to word, feature, and information content of each of the collection of documents;

extracting structured fields from the collection of documents, the extraction being performed by a computer processor;

classifying the collection of documents based on the extracted structured fields;

refining the taxonomy by applying user domain knowledge to create the classifications of documents;

further refining the taxonomy by merging, deleting and adding classes to the taxonomy;

constructing a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories; and

using the contingency table to identify a set of experts having a related expertise; and

determining a statistical relationship between the identified set of experts and one of the extracted structured fields in the table.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and analytics tools for locating experts with specific sets of expertise are disclosed, the method including providing a collection of documents P₀; generating categories representing fields of expertise derived from the collection of documents P₀; refining the taxonomy of the categories by applying user domain knowledge; extracting structured fields from the collection of documents P₀; constructing a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories; and using the contingency table to identify a set of experts having a related expertise. The method may also include a network graph analysis that aids visualization of the relationship between people and expertise.

21 Citations

View as Search Results

16 Claims

1. A method for use with a collection of documents, the method comprising:
- generating categories representing fields of expertise derived from the collection of documents, including generating a taxonomy from the collection of documents,wherein said generating of categories includes generation of a feature space from text of the collection of documents and comparison of the collection of documents using a measure of distance in the feature space;
  
  converting each of the documents in the collection of documents into a numeric vector corresponding to word, feature, and information content of each of the collection of documents;
  
  extracting structured fields from the collection of documents, the extraction being performed by a computer processor;
  
  classifying the collection of documents based on the extracted structured fields;
  
  refining the taxonomy by applying user domain knowledge to create the classifications of documents;
  
  further refining the taxonomy by merging, deleting and adding classes to the taxonomy;
  
  constructing a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories; and
  
  using the contingency table to identify a set of experts having a related expertise; and
  
  determining a statistical relationship between the identified set of experts and one of the extracted structured fields in the table.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, including:
    - overlaying document time information with the categories of the fields of expertise; and
      
      comparing the recentness of the expertise of the set of experts.
  - 3. The method of claim 1, wherein:
    - the structured fields include a structured name field including author names for the collection of documents.
  - 4. The method of claim 1, further comprising:
    - using trending information in the contingency table to identify a recent, with respect to a pre-determined time period, expertise.

5. A method for use with a set of seed documents extracted from a data warehouse, the method comprising:
- searching the data warehouse to provide a set of additional documents similar to the set of seed documents, wherein said similarity is determined using a statistical method;
  
  generating an initial taxonomy for a combined document set that includes all documents from both the set of seed documents and the set of additional documentsgenerating categories for the initial taxonomy wherein said generating of categories includes generation of a feature space from text of the set of seed documents and comparison of the set of seed documents using a measure of distance in the feature space;
  
  iterating the processes of extracting, searching, and generating by the performance of a computer processor, and using domain knowledge to produce a refined taxonomy from the initial taxonomy;
  
  refining the taxonomy by creating new categories in the taxonomy based on a relationship between the categories in the taxonomy;
  
  classifying the combined document set using structured fields;
  
  using contingency analysis to generate a contingency table that compares categories of the refined taxonomy to the structured fields;
  
  calculating an expected value percentage for each of a plurality of cells in the contingency table as the expected percent of an author'"'"'s documents out of a total number for one of the categories, multiplied by a percent of documents in the one of the categories out of a total number of documents for all categories;
  
  determining a statistical relationship between the author and one of the categories of the refined taxonomy in the table; and
  
  shading each of the plurality of cells in the contingency table based on the degree to which the expected value percentage exceeds an actual value in each of the plurality of cells in the contingency table.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The method of claim 5, further comprising:
    - classifying the combined document set using terms obtained from the structured fields, wherein the structured fields are extracted from the combined document set.
  - 7. The method of claim 5, further comprising:
    - using a name annotator to extract names from documents of the combined document set to form a structured field; and
      
      classifying the combined document set using names obtained from the formed structured field.
  - 8. The method of claim 5, wherein:
    - said generating of the initial taxonomy includes an analysis using words, bags of words, and phrases.
  - 9. The method of claim 5, further comprising:
    - wherein in the classifying of the combined document set using structured fields, at least one structured field includes names of people;
      
      examining the contingency table to find a relationship between the categories and the names of people; and
      
      further refining the refined taxonomy by filtering out noise using user domain knowledge.

10. A computer program product stored in a non-transitory computer useable storage medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
- generate categories representing fields of expertise derived from a collection of documents, including generating a taxonomy from the collection of documents,wherein said generating of categories includes generation of a feature space from text of the collection of documents and comparison of the collection of documents using a measure of distance in the feature space;
  
  convert each of the documents in the collection of documents into a numeric vector corresponding to word, feature, and information content of each of the collection of documents;
  
  extract structured fields from the collection of documents, the extraction being performed by a computer processor;
  
  classify the collection of documents based on the extracted structured fields;
  
  refine the taxonomy by applying user domain knowledge to create the classifications of documents;
  
  further refine the taxonomy by merging, deleting and adding classes to the taxonomy;
  
  construct a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories;
  
  use the contingency table to identify a set of experts having a related expertise; and
  
  determine a statistical relationship between the identified set of experts and one of the extracted structured fields in the table.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The computer program product of claim 10, wherein:
    - a combined document set is classified in response to a skipped process.
  - 12. The computer program product of claim 10, whereinfind a relationship between the generated categories and a collection of names of people, whereineach of said names of people is the name of an author of a document from said collection of documents.
  - 13. The computer program product of claim 12, wherein:
    - said structured fields are used to identify a co-authoring relationship among the collection of names of people and the collection of documents.
  - 14. The computer program product of claim 12, wherein:
    - the computer finds said relationship using a contingency table generated from said categories and said collection of names of people; and
      
      the computer plots a network graph from said contingency table.
  - 15. The computer program product of claim 12, wherein:
    - the computer finds said relationship using a contingency table generated from said categories and said collection of names of people, each of a plurality of cells of said contingency table having an actual value;
      
      the computer calculates a likelihood value for each of the plurality of cells; and
      
      the computer determines a degree of significance for each of the plurality of cells based on a comparison of the actual value and the likelihood value for each of the plurality of cells.

16. A computer program product for use with a collection of documents including a set of seed documents in a data warehouse, the computer program product stored in a non-transitory computer useable storage medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
- search the data warehouse to provide a set of additional documents similar to the set of seed documents, wherein said similarity is determined using a statistical method;
  
  generate an initial taxonomy for a combined document set that includes all documents from both the set of seed documents and the set of additional documents;
  
  generate categories for the initial taxonomy wherein said generating of categories includes generation of a feature space from text of the set of seed documents and comparison of the set of seed documents using a measure of distance in the feature space;
  
  iterate the processes of extracting, searching, and generating by the performance of a computer processor, and using domain knowledge to produce a refined taxonomy from the initial taxonomy; and
  
  refine the taxonomy by creating new categories in the taxonomy based on a relationship between the categories in the taxonomy;
  
  classify the combined document set using structured fields;
  
  use contingency analysis to generate a contingency table that compares categories of the refined taxonomy to the structured fields;
  
  calculate an expected value percentage for each of a plurality of cells in the contingency table as the expected percent of an author'"'"'s documents out of a total number for one of the categories, multiplied by a percent of documents in the one of the categories out of a total number of documents for all categories;
  
  determine a statistical relationship between the author and one of the categories of the refined taxonomy in the table; and
  
  shade each of the plurality of cells in the contingency table based on the degree to which the expected value percentage exceeds an actual value in each of the plurality of cells in the contingency table.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chen, Ying, Kreulen, Jeffrey Thomas, Lelescu, Ana, Rhodes, James J., Spangler, William Scott
Primary Examiner(s)
Vital, Pierre
Assistant Examiner(s)
Mitiku, Berhanu

Application Number

US12/134,098
Publication Number

US 20080301105A1
Time in Patent Office

1,979 Days
Field of Search

None
US Class Current

707/603
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/36   Creation of semantic tools,...

G06F 16/9024   Graphs; Linked lists G06F16...

G06Q 10/06   Resources, workflows, human...

Methodologies and analytics tools for locating experts with specific sets of expertise

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

21 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Methodologies and analytics tools for locating experts with specific sets of expertise

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links