Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents

US 10,062,039 B1
Filed: 06/28/2017
Issued: 08/28/2018
Est. Priority Date: 06/28/2017
Status: Active Grant

First Claim

Patent Images

1. A non-transitory medium storing code representing a plurality of processor-executable instructions, the code comprising code to cause the processor to:

execute a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system;

receive, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from the data corpus and at least one tag target from a non-empty set of tag targets;

the code to execute includes;

select a seed set from a first set of data objects upon a determination that a number of data objects from the first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set; and

train a machine learning model based on the seed set to identify further data objects from the data corpus predicted to have a membership relation with the single tag target;

receive, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target;

the code to execute includes;

calculate a membership score for each data object from the second set of data objects, the membership score corresponding to a predicted membership degree with respect to the single tag target;

divide a membership scale of the single tag target into a number of 2b non-overlapping intervals of equal length with b positive non-overlapping intervals defined by a pair of positive endpoint numbers and b negative non-overlapping intervals defined by a pair of negative endpoint numbers, b corresponding to a number of score buckets of a histogram distribution;

partition the second set of data objects into a number of training subsets equal to 2b+1, the training subsets including;

(1) a training subset having all data objects from the second set of data objects whose membership relation with respect to the single tag target is undefined, (2) a first set of training subsets with b training subsets, each training subset from the first set of training subsets having data objects with membership scores within a positive non-overlapping interval from the b positive non-overlapping intervals, (3) a second set of training subsets with b training subsets, each training subset from the second set of training subsets having data objects with membership scores within a negative non-overlapping interval from the b negative non-overlapping intervals; and

re-train the machine learning model based on data objects included in the training subset, the first set of training subsets, and the second set of training subsets;

display at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and

enable a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A machine learning system continuously receives tag signals indicating membership relations between data objects from a data corpus and tag targets. The machine learning system is asynchronously and iteratively trained with the received tag signals to identify further data objects from the data corpus predicted to have a membership relation with the single tag target. The machine learning system constantly improves its predictive accuracy in short time by the continuous training of a backend machine learning model based on implicit and explicit tag signals gathered from a non-intrusive monitoring of user interactions during a review process of the data corpus.

46 Citations

View as Search Results

21 Claims

1. A non-transitory medium storing code representing a plurality of processor-executable instructions, the code comprising code to cause the processor to:
- execute a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system;
  
  receive, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from the data corpus and at least one tag target from a non-empty set of tag targets;
  
  the code to execute includes;
  
  select a seed set from a first set of data objects upon a determination that a number of data objects from the first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set; and
  
  train a machine learning model based on the seed set to identify further data objects from the data corpus predicted to have a membership relation with the single tag target;
  
  receive, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target;
  
  the code to execute includes;
  
  calculate a membership score for each data object from the second set of data objects, the membership score corresponding to a predicted membership degree with respect to the single tag target;
  
  divide a membership scale of the single tag target into a number of 2b non-overlapping intervals of equal length with b positive non-overlapping intervals defined by a pair of positive endpoint numbers and b negative non-overlapping intervals defined by a pair of negative endpoint numbers, b corresponding to a number of score buckets of a histogram distribution;
  
  partition the second set of data objects into a number of training subsets equal to 2b+1, the training subsets including;
  
  (1) a training subset having all data objects from the second set of data objects whose membership relation with respect to the single tag target is undefined, (2) a first set of training subsets with b training subsets, each training subset from the first set of training subsets having data objects with membership scores within a positive non-overlapping interval from the b positive non-overlapping intervals, (3) a second set of training subsets with b training subsets, each training subset from the second set of training subsets having data objects with membership scores within a negative non-overlapping interval from the b negative non-overlapping intervals; and
  
  re-train the machine learning model based on data objects included in the training subset, the first set of training subsets, and the second set of training subsets;
  
  display at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and
  
  enable a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The non-transitory medium of claim 1, wherein the predetermined threshold is a first predetermined threshold, the non-transitory computer-readable medium further causes the processor to:
    - re-train the machine learning model based on the second set of data objects upon a determination that a number of elements of the second set of data objects matched with the single tag target has reached a second predetermined threshold corresponding to a number of elements of the training set, the second predetermined threshold greater than the first predetermined threshold.
  - 3. The non-transitory medium of claim 1, wherein the single tag target is a first single tag target, the non-transitory computer-readable medium further causes the processor to:
    - re-train the machine learning model based on the second set of data objects upon a determination that a number of elements of the second set of data objects matched with a second single tag target from the non-empty set of tag targets has reached the predetermined threshold corresponding to a number of elements of the training set, the first single tag target different from the second single tag target.
  - 4. The non-transitory medium of claim 1, wherein the code comprising code to cause the processor to train the machine learning model includes code to further cause the processor to:
    - divide each data object from the seed set into a set of pages;
      
      produce a set of three-dimensional tensor objects, each tensor object from the set of three-dimensional tensor objects (a) representing a data object from the seed set, and (b) including a first dimension with a value corresponding to a number of pages of that data object, a second dimension with a value corresponding to a page size of that data object, and a third dimension with a vector having a set of values indicating relationships between an indexed term included in that data object and a set of terms from a vocabulary, the page size corresponding to a fixed size memory region;
      
      produce a single tensor by stacking the set of three-dimensional tensor objects along the first dimension of each tensor object from the set of three-dimensional tensor objects;
      
      produce a set of equally sized mini-batches by dividing the single stacked tensor along the first dimension, each mini-batch from the set of equally sized mini-batches containing a same number of pages and corresponding to an equally sized memory region; and
      
      train the machine learning model with the set of equally sized mini-batches.
  - 5. The non-transitory medium of claim 1, wherein the code comprising code to cause the processor to re-train the machine learning model includes code to further cause the processor to:
    - calculate a membership score for each data object from the second set of data objects that corresponds to a predicted membership degree with respect to the single tag target;
      
      calculate a probability value for each data object from the second set of data objects such that data objects with positive and lower membership scores have a higher probability for their inclusion in the training set than data objects with positive and higher membership scores, the data objects with positive and lower membership scores predicted as members of a first semantically-distinct data object, the data objects with positive and higher membership scores predicted as members of a second semantically-distinct data object; and
      
      re-train the machine learning model with the training set including data objects based on their respective probabilities.
  - 6. The non-transitory medium of claim 1, wherein the code comprising code to cause the processor to train the machine learning model includes code to further cause the processor to:
    - generate, for each data object from the seed set, a sequence of numbers, each number in the sequence of numbers corresponding to a vocabulary index value associated with a non-empty set of terms in a vocabulary.
  - 7. The non-transitory medium of claim 1, wherein the code comprising code to cause the processor to train the machine learning model includes code to further cause the processor to:
    - produce a set of two-dimensional tensor objects including a two-dimensional tensor object for each data object from the seed set, each two-dimensional tensor object including a first tensor dimension corresponding to a term index, and a second tensor dimension corresponding to a numeric vector indicating a relationship between the term index and a set of terms from a vocabulary; and
      
      train a convolutional neural network at least in part with the two-dimensional tensor objects.
  - 8. The non-transitory medium of claim 1, wherein the membership relation of the number of data objects from the seed set having the membership relation with the single tag target from the non-empty set of tag targets indicates a positive membership relation.
  - 9. The non-transitory medium of claim 1, wherein the membership relation of the at least one data object predicted by the machine learning model as having the membership relation with the single tag target indicates a positive membership relation.
  - 10. The non-transitory medium of claim 1, wherein the membership relation of the at least one data object predicted by the machine learning model as having the membership relation with the single tag target indicates a negative membership relation.

11. A method, comprising:
- executing a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system;
  
  receiving at a processor, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from the data corpus and at least one tag target from a non-empty set of tag targets;
  
  selecting a seed set form a first set of data objects upon a determination that a number of data objects from a first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set;
  
  training a machine learning model based on the seed set to identify further data objects from the data corpus predicted to have a membership relation with the single tag target;
  
  receiving at the processor, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target;
  
  calculating a membership score for each data object from the second set of data objects and the membership score corresponding to a predicted membership degree with respect to the single tag target;
  
  dividing a membership scale of the single tag target into a number of 2b non-overlapping intervals of equal length with b positive non-overlapping intervals defined by a pair of positive endpoint numbers and b negative non-overlapping intervals defined by a pair of negative endpoint numbers, b corresponding to a number of score buckets of a histogram distribution;
  
  partitioning the second set of data objects into a number of training subsets equal to 2b+1, the training subsets including;
  
  (1) a training subset having all data objects from the second set of data objects whose membership relation with respect to the single tag target is undefined, (2) a first set of training subsets with b training subsets, each training subset from the first set of training subsets having data objects with membership scores within a positive non-overlapping interval from the b positive non-overlapping intervals, (3) a second set of training subsets with b training subsets, each training subset from the second set of training subsets having data objects with membership scores within a negative non-overlapping interval from the b negative non-overlapping intervals;
  
  re-training the machine learning model based on data objects included in the training subset, the first set of training subsets, and the second set of training subsets;
  
  displaying at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and
  
  enabling a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The method of claim 11, wherein the predetermined threshold is a first predetermined threshold, the method further comprising:
    - re-training the machine learning model based on the second set of data objects upon a determination that a number of elements of the second set of data objects matched with the single tag target has reached a second predetermined threshold corresponding to a number of elements of the training set, the second predetermined threshold greater than the first predetermined threshold.
  - 13. The method of claim 11, wherein the single tag target is a first single tag target, the method further comprising:
    - re-training the machine learning model based on the second set of data objects upon a determination that a number of elements of the second set of data objects matched with a second single tag target from the non-empty set of tag targets has reached the predetermined threshold corresponding to a number of elements of the training set, the first single tag target different from the second single tag target.
  - 14. The method of claim 11, wherein training the machine learning model includes:
    - dividing each data object from the seed set into a set of pages;
      
      producing a set of three-dimensional tensor objects, each tensor object from the set of three-dimensional tensor objects (a) representing a data object from the seed set, and (b) including a first dimension with a value corresponding to a number of pages of that data object, a second dimension with a value corresponding to a page size of that data object, and a third dimension with a vector having a set of values indicating relationships between an indexed term included in that data object and a set of terms from a vocabulary, the page size corresponding to a fixed size memory region;
      
      producing a single stacked tensor by stacking the set of three-dimensional tensor objects along the first dimension of each tensor object from the set of three-dimensional tensor objects;
      
      producing a set of equally sized mini-batches by dividing the single stacked tensor along the first dimension, each mini-batch from the set of equally sized mini-batches containing a same number of pages and corresponding to an equally sized memory region; and
      
      training the machine learning model with the set of equally sized mini-batches.
  - 15. The method of claim 11, wherein re-training the machine learning model includes:
    - calculating a membership score for each data object from the second set of data objects that corresponds to a predicted membership degree with respect to the single tag target;
      
      calculating a probability value for each data object from the second set of data objects such that data objects with positive and lower membership scores have a higher probability for their inclusion in the training set than data objects with positive and higher membership scores, the data objects with positive and lower membership scores predicted as members of a first semantically-distinct data object, the data objects with positive and higher membership scores predicted as members of a second semantically-distinct data object; and
      
      re-training the machine learning model with the training set including data objects based on their respective probabilities.
  - 16. The method of claim 11, wherein training the machine learning model includes:
    - generating, for each data object from the seed set, a sequence of numbers, each number in the sequence of numbers corresponding to a vocabulary index value associated with a non-empty set of terms in a vocabulary.
  - 17. The method of claim 11, wherein training the machine learning model includes:
    - producing a set of two-dimensional tensor objects including a two-dimensional tensor object for each data object from the seed set, each two-dimensional tensor object including a first tensor dimension corresponding to a term index, and a second tensor dimension corresponding to a numeric vector indicating a relationship between the term index and a set of terms from a vocabulary; and
      
      training a convolutional neural network at least in part with the two-dimensional tensor objects.
  - 18. The method of claim 11, wherein the membership relation of the number of data objects from the seed set having the membership relation with the single tag target from the non-empty set of tag targets indicates a positive membership relation.
  - 19. The method of claim 11, wherein the membership relation of the at least one data object predicted by the machine learning model as having the membership relation with the single tag target indicates a positive membership relation.
  - 20. The method of claim 11, wherein the membership relation of the at least one data object predicted by the machine learning model as having the membership relation with the single tag target indicates a negative membership relation.

21. An apparatus, comprising:
- a processor; and
  
  a memory storing instructions which, when executed by the processor, causes the processors to;
  
  execute a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system;
  
  receive, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from a data corpus and at least one tag target from a non-empty set of tag targets;
  
  the code to execute includes;
  
  select a seed set from a first set of data objects upon a determination that a number of data objects from the first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set;
  
  divide each data object from the seed set into a set of pages each page from the set of pages having a page size corresponding to a fixed size memory region;
  
  produce a set of three-dimensional tensor objects, each tensor object from the set of three-dimensional tensor objects (a) representing a data object from the seed set, and (b) including (i) a first dimension with a value corresponding to a number of pages of that data object, (ii) a second dimension with a value corresponding to a page size of that data object, and (iii) a third dimension with a vector having a set of values indicating relationships between an indexed term included in that data object and a set of terms from a vocabulary, the page size corresponding to a fixed size memory region;
  
  produce a single tensor by stacking the set of three-dimensional tensor objects along the first dimension of each tensor object from the set of three-dimensional tensor objects;
  
  produce a set of equally sized mini-batches by dividing the single stacked tensor along the first dimension, each mini-batch from the set of equally sized mini-batches containing a same number of pages and corresponding to an equally sized memory region; and
  
  train the machine learning model with the set of equally sized mini-batches to identify further data objects from the data corpus predicted to have a membership relation with the single tag target;
  
  receive, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target;
  
  the code to execute includes;
  
  re-train the machine learning model based on the second set of data objects;
  
  display at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and
  
  enable a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CS Disco, Inc.
Original Assignee
CS Disco, Inc.
Inventors
Lockett, Alan
Primary Examiner(s)
Misir, Dave

Application Number

US15/635,361
Time in Patent Office

426 Days
Field of Search

706 11
US Class Current
CPC Class Codes

G06N 20/00   Machine learning

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/084   Backpropagation, e.g. using...

Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

46 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

46 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others