Methodologies and analytics tools for locating experts with specific sets of expertise
First Claim
1. A method for use with a collection of documents, the method comprising:
- generating categories representing fields of expertise derived from the collection of documents, including generating a taxonomy from the collection of documents,wherein said generating of categories includes generation of a feature space from text of the collection of documents and comparison of the collection of documents using a measure of distance in the feature space;
converting each of the documents in the collection of documents into a numeric vector corresponding to word, feature, and information content of each of the collection of documents;
extracting structured fields from the collection of documents, the extraction being performed by a computer processor;
classifying the collection of documents based on the extracted structured fields;
refining the taxonomy by applying user domain knowledge to create the classifications of documents;
further refining the taxonomy by merging, deleting and adding classes to the taxonomy;
constructing a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories; and
using the contingency table to identify a set of experts having a related expertise; and
determining a statistical relationship between the identified set of experts and one of the extracted structured fields in the table.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and analytics tools for locating experts with specific sets of expertise are disclosed, the method including providing a collection of documents P0; generating categories representing fields of expertise derived from the collection of documents P0; refining the taxonomy of the categories by applying user domain knowledge; extracting structured fields from the collection of documents P0; constructing a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories; and using the contingency table to identify a set of experts having a related expertise. The method may also include a network graph analysis that aids visualization of the relationship between people and expertise.
21 Citations
16 Claims
-
1. A method for use with a collection of documents, the method comprising:
-
generating categories representing fields of expertise derived from the collection of documents, including generating a taxonomy from the collection of documents, wherein said generating of categories includes generation of a feature space from text of the collection of documents and comparison of the collection of documents using a measure of distance in the feature space; converting each of the documents in the collection of documents into a numeric vector corresponding to word, feature, and information content of each of the collection of documents; extracting structured fields from the collection of documents, the extraction being performed by a computer processor; classifying the collection of documents based on the extracted structured fields; refining the taxonomy by applying user domain knowledge to create the classifications of documents; further refining the taxonomy by merging, deleting and adding classes to the taxonomy; constructing a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories; and using the contingency table to identify a set of experts having a related expertise; and determining a statistical relationship between the identified set of experts and one of the extracted structured fields in the table. - View Dependent Claims (2, 3, 4)
-
-
5. A method for use with a set of seed documents extracted from a data warehouse, the method comprising:
-
searching the data warehouse to provide a set of additional documents similar to the set of seed documents, wherein said similarity is determined using a statistical method; generating an initial taxonomy for a combined document set that includes all documents from both the set of seed documents and the set of additional documents generating categories for the initial taxonomy wherein said generating of categories includes generation of a feature space from text of the set of seed documents and comparison of the set of seed documents using a measure of distance in the feature space; iterating the processes of extracting, searching, and generating by the performance of a computer processor, and using domain knowledge to produce a refined taxonomy from the initial taxonomy; refining the taxonomy by creating new categories in the taxonomy based on a relationship between the categories in the taxonomy; classifying the combined document set using structured fields; using contingency analysis to generate a contingency table that compares categories of the refined taxonomy to the structured fields; calculating an expected value percentage for each of a plurality of cells in the contingency table as the expected percent of an author'"'"'s documents out of a total number for one of the categories, multiplied by a percent of documents in the one of the categories out of a total number of documents for all categories; determining a statistical relationship between the author and one of the categories of the refined taxonomy in the table; and shading each of the plurality of cells in the contingency table based on the degree to which the expected value percentage exceeds an actual value in each of the plurality of cells in the contingency table. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A computer program product stored in a non-transitory computer useable storage medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
-
generate categories representing fields of expertise derived from a collection of documents, including generating a taxonomy from the collection of documents, wherein said generating of categories includes generation of a feature space from text of the collection of documents and comparison of the collection of documents using a measure of distance in the feature space; convert each of the documents in the collection of documents into a numeric vector corresponding to word, feature, and information content of each of the collection of documents; extract structured fields from the collection of documents, the extraction being performed by a computer processor; classify the collection of documents based on the extracted structured fields; refine the taxonomy by applying user domain knowledge to create the classifications of documents; further refine the taxonomy by merging, deleting and adding classes to the taxonomy; construct a contingency table having a first axis defined by the extracted structured fields and a second axis defined by the categories; use the contingency table to identify a set of experts having a related expertise; and determine a statistical relationship between the identified set of experts and one of the extracted structured fields in the table. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computer program product for use with a collection of documents including a set of seed documents in a data warehouse, the computer program product stored in a non-transitory computer useable storage medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
-
search the data warehouse to provide a set of additional documents similar to the set of seed documents, wherein said similarity is determined using a statistical method; generate an initial taxonomy for a combined document set that includes all documents from both the set of seed documents and the set of additional documents; generate categories for the initial taxonomy wherein said generating of categories includes generation of a feature space from text of the set of seed documents and comparison of the set of seed documents using a measure of distance in the feature space; iterate the processes of extracting, searching, and generating by the performance of a computer processor, and using domain knowledge to produce a refined taxonomy from the initial taxonomy; and refine the taxonomy by creating new categories in the taxonomy based on a relationship between the categories in the taxonomy; classify the combined document set using structured fields; use contingency analysis to generate a contingency table that compares categories of the refined taxonomy to the structured fields; calculate an expected value percentage for each of a plurality of cells in the contingency table as the expected percent of an author'"'"'s documents out of a total number for one of the categories, multiplied by a percent of documents in the one of the categories out of a total number of documents for all categories; determine a statistical relationship between the author and one of the categories of the refined taxonomy in the table; and shade each of the plurality of cells in the contingency table based on the degree to which the expected value percentage exceeds an actual value in each of the plurality of cells in the contingency table.
-
Specification