METHOD AND SYSTEM FOR ACTIVE LEARNING SCREENING PROCESS WITH DYNAMIC INFORMATION MODELING

US 20090083200A1
Filed: 09/19/2008
Published: 03/26/2009
Est. Priority Date: 09/21/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-assisted method for screening information in a set of documents, comprising:

a) creating a set of concepts representing subject matter in the documents;

b) creating a plurality of status labels representing relevancy of the documents to an end use for the set of documents;

c) selecting a first subset of the set of documents;

d) identifying occurrences of concepts from the set of concepts in each document in the first subset based on subject matter in each document;

e) assigning a status label from the plurality of status labels to each document in the first subset based on relevancy of the document to the end use for the document;

f) processing, using a computer, the first subset of the set of documents to learn relationships among the set of concepts and the labels and create a classification model of the relationships;

g) modifying the set of concepts based on the learned relationships among the set of concepts and the status labels;

h) selecting a second subset of the set of documents;

i) assigning a status label from the plurality of status labels to each document in the second subset based on relevancy of the document to the end use for the document;

j) processing, using the computer, the second subset to further learn relationships among the set of concepts and the labels and refine the classification model of the relationships; and

k) if the classification model of the relationships is satisfactory, then screening, using the computer, the set of documents based on the classification model of the relationships, such that a subset of relevant documents is identified.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods consistent with the present invention improve manual screening processes for a group of documents by implementing an active learning screening process with dynamic information modeling. A classification algorithm is trained to recognize the relationships between concept tags applied to a subset of the documents and true or correct utility or relevance ratings applied to the subset of documents. Once adequately trained using several subsets of documents, the classification algorithm may be applied to the entire group of documents, screening out documents that are not relevant or important.

49 Citations

View as Search Results

20 Claims

1. A computer-assisted method for screening information in a set of documents, comprising:
- a) creating a set of concepts representing subject matter in the documents;
  
  b) creating a plurality of status labels representing relevancy of the documents to an end use for the set of documents;
  
  c) selecting a first subset of the set of documents;
  
  d) identifying occurrences of concepts from the set of concepts in each document in the first subset based on subject matter in each document;
  
  e) assigning a status label from the plurality of status labels to each document in the first subset based on relevancy of the document to the end use for the document;
  
  f) processing, using a computer, the first subset of the set of documents to learn relationships among the set of concepts and the labels and create a classification model of the relationships;
  
  g) modifying the set of concepts based on the learned relationships among the set of concepts and the status labels;
  
  h) selecting a second subset of the set of documents;
  
  i) assigning a status label from the plurality of status labels to each document in the second subset based on relevancy of the document to the end use for the document;
  
  j) processing, using the computer, the second subset to further learn relationships among the set of concepts and the labels and refine the classification model of the relationships; and
  
  k) if the classification model of the relationships is satisfactory, then screening, using the computer, the set of documents based on the classification model of the relationships, such that a subset of relevant documents is identified.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-assisted method of claim 1, further comprising:
    - repeating d) through j) until the classification model is satisfactory.
  - 3. The computer-assisted method of claim 1, wherein f) processing the first subset of the set of documents to learn relationships among the set of concepts and the labels further comprises:
    - assigning a classification label to each document in the first subset; and
      
      wherein j) processing the second subset to further learn relationships among the set of concepts and the status labels further comprises;
      
      assigning a classification label to each document in the second subset.
  - 4. The computer-assisted method of claim 3, further comprising:
    - analyzing, using the computer, a correlation between the classification labels for the documents in the second subset and the status labels for the documents in the second subset; and
      
      determining that the classification model is satisfactory if there is a high correlation.
  - 5. The computer-assisted method of claim 1, wherein c) selecting a first subset further comprises:
    - selecting a first random subset of documents from the set of documents; and
      
      wherein h) selecting a second subset further comprises;
      
      selecting a second random subset of documents from the set of documents.
  - 6. The computer-assisted method of claim 1, wherein h) selecting a second subset further comprises:
    - selecting a second subset of documents by oversampling documents predicted to have a high or low likelihood of relevance by the classification model of the relationships.
  - 7. The computer-assisted method of claim 1, wherein g) modifying the set of concepts further comprises:
    - analyzing documents that were misclassified by the classification model to identify a concept common to each misclassified document.
  - 8. The computer-assisted method of claim 1, further comprising:
    - utilizing the subset of relevant documents for the end use.
  - 9. The computer-assisted method of claim 1, wherein the end use is a systematic review of literature.
  - 10. The computer-assisted method of claim 1,wherein the plurality of status labels comprises a first status label indicating a document is relevant and a second status label indicating a document is irrelevant;
    - andwherein the computer-assisted method further comprises;
      
      analyzing, using the computer, a first percentage of documents corresponding to the first status label and a second percentage of documents corresponding to the second status label, according to the classification model; and
      
      determining that the classification model is satisfactory if a sum of the first percentage and the second percentage exceeds a predetermined threshold.

11. A computer-assisted method for identifying relevant documents among a set of documents, comprising:
- a) tagging each document in the set of documents with a concept from a set of concepts representing subject matter found in each document;
  
  b) selecting a subset of documents from the set of documents;
  
  c) assigning a status label to each document in the subset, the status label representing a degree of importance of the document;
  
  d) training a classification algorithm using the subset of documents, where the classification algorithm relates the status labels assigned to the documents to the concepts tagged to the documents, using a computer;
  
  e) applying the classification algorithm to the subset of documents to create a classification label for each document in the subset, using the computer, the classification label representing a predicted degree of importance of the document;
  
  f) comparing the classification labels to the status labels, using the computer;
  
  g) if the classification labels substantially correspond to the status labels, then applying the classification algorithm to the set of documents to create a classification label for each document, using the computer; and
  
  h) if the classification labels do not substantially correspond to the status labels, then modifying the set of concepts representing subject matter found in the documents and repeating a) through h) using the modified set of concepts.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The computer-assisted method of claim 11, further comprising:
    - utilizing the documents having a classification label representing a high predicted degree of importance for an end use.
  - 13. The computer-assisted method of claim 12, wherein the end use is a systematic review of literature.
  - 14. The computer-assisted method of claim 11, wherein the classification labels substantially correspond to the status labels if a predetermined number of iterations of b) through h) have been performed.
  - 15. The computer-assisted method of claim 11, wherein the classification labels substantially correspond to the status labels if a specified number of classification labels are the same as the status labels for the subset of documents.
  - 16. The computer-assisted method of claim 11, further comprising:
    - applying a diagnostic algorithm to the output of the classification algorithm to create a ranking of the set of concepts representing a contribution of each concept to the classification label; and
      
      wherein modifying the set of concepts representing subject matter found in the documents further comprises;
      
      modifying the set of concepts based on the ranking.
  - 17. The computer-assisted method of claim 11, further comprising:
    - applying a diagnostic algorithm to the output of the classification algorithm to identify a characterizing concept common to misclassified documents; and
      
      wherein modifying the set of concepts representing subject matter found in the documents further comprises;
      
      modifying the set of concepts based on the characterizing concept.
  - 18. The computer-assisted method of claim 11, wherein b) selecting a subset further comprises:
    - selecting a random subset of documents from the set of documents.
  - 19. The computer-assisted method of claim 11, wherein b) selecting a subset further comprises:
    - initially selecting a random subset of documents from the set of documents; and
      
      subsequently selecting a subset of documents by oversampling documents predicted to have a high or low likelihood of relevance according to the classification model of the relationships.

20. A system for identifying relevant documents among a set of documents, comprising;
- a memory device containing document data for each document in the set of documents, concept data associated with the document data, status label data associated with the document data and classification label data associated with the document data; and
  
  a processor, communicatively connected to the memory device, that executes code for performing operations comprising;
  
  a) for the document data for each document in the set of documents, storing in the memory device concept data corresponding to a set of concepts representing subject matter found in the document;
  
  b) selecting a subset of documents from the set of documents;
  
  c) for the document data for each document in the subset, storing in the memory device status label data for each document in the subset, the status label data representing a degree of importance of the document;
  
  d) executing a classification algorithm that is trained using the subset of documents, where the classification algorithm relates the status label data associated with the document data for each document in the subset to the associated concept data;
  
  e) for the document data for each document in the subset, storing in the memory device classification label data for each document in the subset, the classification label data representing a predicted degree of importance of the document;
  
  f) for the document data for each document in the subset, comparing the classification label data to the status label data;
  
  g) if the classification label data substantially corresponds to the status label data, then executing the classification algorithm using the set of documents as input to create classification label data associated with the document data for each document in the set of documents; and
  
  h) if the classification label data does not substantially correspond to the status label data, then modifying the set of concepts representing subject matter found in the documents and repeating b) through h) using the modified set of concepts.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Noblis, Inc.
Original Assignee
Noblis, Inc.
Inventors
Schmid, Christopher H., Lau, Joseph, Pollara, Victor J.

Granted Patent

US 8,126,826 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/14
CPC Class Codes

G06N 20/00 Machine learning

METHOD AND SYSTEM FOR ACTIVE LEARNING SCREENING PROCESS WITH DYNAMIC INFORMATION MODELING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

49 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND SYSTEM FOR ACTIVE LEARNING SCREENING PROCESS WITH DYNAMIC INFORMATION MODELING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

49 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links