Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents
First Claim
1. A non-transitory medium storing code representing a plurality of processor-executable instructions, the code comprising code to cause the processor to:
- execute a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system;
receive, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from the data corpus and at least one tag target from a non-empty set of tag targets;
the code to execute includes;
select a seed set from a first set of data objects upon a determination that a number of data objects from the first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set; and
train a machine learning model based on the seed set to identify further data objects from the data corpus predicted to have a membership relation with the single tag target;
receive, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target;
the code to execute includes;
calculate a membership score for each data object from the second set of data objects, the membership score corresponding to a predicted membership degree with respect to the single tag target;
divide a membership scale of the single tag target into a number of 2b non-overlapping intervals of equal length with b positive non-overlapping intervals defined by a pair of positive endpoint numbers and b negative non-overlapping intervals defined by a pair of negative endpoint numbers, b corresponding to a number of score buckets of a histogram distribution;
partition the second set of data objects into a number of training subsets equal to 2b+1, the training subsets including;
(1) a training subset having all data objects from the second set of data objects whose membership relation with respect to the single tag target is undefined, (2) a first set of training subsets with b training subsets, each training subset from the first set of training subsets having data objects with membership scores within a positive non-overlapping interval from the b positive non-overlapping intervals, (3) a second set of training subsets with b training subsets, each training subset from the second set of training subsets having data objects with membership scores within a negative non-overlapping interval from the b negative non-overlapping intervals; and
re-train the machine learning model based on data objects included in the training subset, the first set of training subsets, and the second set of training subsets;
display at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and
enable a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback.
2 Assignments
0 Petitions
Accused Products
Abstract
A machine learning system continuously receives tag signals indicating membership relations between data objects from a data corpus and tag targets. The machine learning system is asynchronously and iteratively trained with the received tag signals to identify further data objects from the data corpus predicted to have a membership relation with the single tag target. The machine learning system constantly improves its predictive accuracy in short time by the continuous training of a backend machine learning model based on implicit and explicit tag signals gathered from a non-intrusive monitoring of user interactions during a review process of the data corpus.
46 Citations
21 Claims
-
1. A non-transitory medium storing code representing a plurality of processor-executable instructions, the code comprising code to cause the processor to:
-
execute a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system; receive, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from the data corpus and at least one tag target from a non-empty set of tag targets; the code to execute includes; select a seed set from a first set of data objects upon a determination that a number of data objects from the first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set; and train a machine learning model based on the seed set to identify further data objects from the data corpus predicted to have a membership relation with the single tag target; receive, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target; the code to execute includes; calculate a membership score for each data object from the second set of data objects, the membership score corresponding to a predicted membership degree with respect to the single tag target; divide a membership scale of the single tag target into a number of 2b non-overlapping intervals of equal length with b positive non-overlapping intervals defined by a pair of positive endpoint numbers and b negative non-overlapping intervals defined by a pair of negative endpoint numbers, b corresponding to a number of score buckets of a histogram distribution; partition the second set of data objects into a number of training subsets equal to 2b+1, the training subsets including;
(1) a training subset having all data objects from the second set of data objects whose membership relation with respect to the single tag target is undefined, (2) a first set of training subsets with b training subsets, each training subset from the first set of training subsets having data objects with membership scores within a positive non-overlapping interval from the b positive non-overlapping intervals, (3) a second set of training subsets with b training subsets, each training subset from the second set of training subsets having data objects with membership scores within a negative non-overlapping interval from the b negative non-overlapping intervals; andre-train the machine learning model based on data objects included in the training subset, the first set of training subsets, and the second set of training subsets; display at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and enable a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method, comprising:
-
executing a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system; receiving at a processor, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from the data corpus and at least one tag target from a non-empty set of tag targets; selecting a seed set form a first set of data objects upon a determination that a number of data objects from a first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set; training a machine learning model based on the seed set to identify further data objects from the data corpus predicted to have a membership relation with the single tag target; receiving at the processor, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target; calculating a membership score for each data object from the second set of data objects and the membership score corresponding to a predicted membership degree with respect to the single tag target; dividing a membership scale of the single tag target into a number of 2b non-overlapping intervals of equal length with b positive non-overlapping intervals defined by a pair of positive endpoint numbers and b negative non-overlapping intervals defined by a pair of negative endpoint numbers, b corresponding to a number of score buckets of a histogram distribution; partitioning the second set of data objects into a number of training subsets equal to 2b+1, the training subsets including;
(1) a training subset having all data objects from the second set of data objects whose membership relation with respect to the single tag target is undefined, (2) a first set of training subsets with b training subsets, each training subset from the first set of training subsets having data objects with membership scores within a positive non-overlapping interval from the b positive non-overlapping intervals, (3) a second set of training subsets with b training subsets, each training subset from the second set of training subsets having data objects with membership scores within a negative non-overlapping interval from the b negative non-overlapping intervals;re-training the machine learning model based on data objects included in the training subset, the first set of training subsets, and the second set of training subsets; displaying at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and enabling a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. An apparatus, comprising:
-
a processor; and a memory storing instructions which, when executed by the processor, causes the processors to; execute a machine-assisted iterative search over a data corpus via an asynchronous and interactive machine learning system; receive, via a user interface, a first series of tag signals, each tag signal from the first series indicating a membership relation between at least one data object from a data corpus and at least one tag target from a non-empty set of tag targets; the code to execute includes; select a seed set from a first set of data objects upon a determination that a number of data objects from the first set of data objects having a membership relation with a single tag target from the non-empty set of tag targets has reached a predetermined threshold corresponding to a number of elements of a training set; divide each data object from the seed set into a set of pages each page from the set of pages having a page size corresponding to a fixed size memory region; produce a set of three-dimensional tensor objects, each tensor object from the set of three-dimensional tensor objects (a) representing a data object from the seed set, and (b) including (i) a first dimension with a value corresponding to a number of pages of that data object, (ii) a second dimension with a value corresponding to a page size of that data object, and (iii) a third dimension with a vector having a set of values indicating relationships between an indexed term included in that data object and a set of terms from a vocabulary, the page size corresponding to a fixed size memory region; produce a single tensor by stacking the set of three-dimensional tensor objects along the first dimension of each tensor object from the set of three-dimensional tensor objects; produce a set of equally sized mini-batches by dividing the single stacked tensor along the first dimension, each mini-batch from the set of equally sized mini-batches containing a same number of pages and corresponding to an equally sized memory region; and train the machine learning model with the set of equally sized mini-batches to identify further data objects from the data corpus predicted to have a membership relation with the single tag target; receive, via the user interface, a second series of tag signals, each tag signal from the second series indicating a membership relation between at least one data object from a second set of data objects and at least one tag target from the non-empty set of tag targets, the second set of data objects includes at least one data object predicted by the machine learning model as having a membership relation with the single tag target; the code to execute includes;
re-train the machine learning model based on the second set of data objects;display at the user interface, via the asynchronous and interactive machine learning system and based on the re-trained machine learning model, a document object from the data corpus with a magnitude value corresponding to a membership degree between the document object and at least one tag target from the non-empty set of tag targets; and enable a user, via the asynchronous and interactive machine learning system, to provide feedback to the machine learning model via an accept input, a dismiss input, an input to modify sections in the document object or an input to modify magnitude values corresponding to membership degrees causing the machine learning model to improve based on the feedback.
-
Specification