System and method for selecting data sample groups for machine learning of context of data fields for various document types and/or for test data generation for quality assurance systems
First Claim
1. A computing system implemented method for efficiently learning new forms in an electronic document preparation system, the method comprising:
- receiving form data related to a new form having a plurality of data fields;
gathering training set data related to previously filled forms, each previously filled form having one or more completed data fields that correspond to a respective data field of the new form;
deleting from the training set data one or more sets of data of a previously filled form where a first set of data of the previously filled form matched a second set of data of the previously filled form and the deleted training set data includes the second set of data;
generating, for a first selected data field, dependency data indicating one or more possible dependencies for an acceptable function, the possible dependencies including one or more data fields of the new form other than the first selected data field, the possible dependencies further including one or more constants of the first selected data field, the possible dependencies further including one or more values of data fields from a form other than the new form;
generating, for a first selected data field of the plurality of data fields of the new form and based on the dependency data, candidate function data including a plurality of candidate functions;
generating, for the first selected data field and based on the dependency data, grouping data by forming a plurality of groups from the training set data based on respective categories and assigning each of a plurality of the previously filled forms to a respective one of the groups based on the categories;
generating, for the first selected data field, sampling data by selecting one or more previously filled forms from each group;
generating, for each candidate function, test data by applying the candidate function to a portion of the training set data corresponding to the sampling data related to the candidate function;
identifying one or more candidate functions of the plurality of candidate functions that have associated test data that are a best match to the training set data as compared with other candidate functions of the plurality of candidate functions;
generating one or more additional candidate functions, the additional candidate functions being based on the identified one or more candidate functions that have associated test data that are a best match;
repeatedly identifying generated candidate functions that have associated test data that are a best match to the training set data and generating one or more additional candidate functions, the additional candidate functions being based on the identified one or more candidate functions that have associated test data that are a best match until one or more candidate functions are determined to have associated test data that matches the training set data with a predetermined tolerance;
identifying, from the plurality of candidate functions, an acceptable function for the first selected data field by comparing the test data to the training set data and identifying test data that matches the training set data within a predetermined tolerance, the identified acceptable function being a candidate function associated with the matching test data; and
generating and outputting results data indicating the acceptable function for the first data field of the new form.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system learns new forms to be incorporated into an electronic document preparation system. The method and system receive form data related to a new form having a plurality of data fields that expect data values based on specific functions. The method and system gather training set data including previously filled forms having completed data fields corresponding to the data fields of the new form. The method and system group the training set data into groups and sample the groups. The method and system utilize machine learning in conjunction with the sampled training set data to identify an acceptable function for each of the data fields of the new form. The grouped and sampled training set data can also be passed to a quality assurance system.
62 Citations
31 Claims
-
1. A computing system implemented method for efficiently learning new forms in an electronic document preparation system, the method comprising:
-
receiving form data related to a new form having a plurality of data fields; gathering training set data related to previously filled forms, each previously filled form having one or more completed data fields that correspond to a respective data field of the new form; deleting from the training set data one or more sets of data of a previously filled form where a first set of data of the previously filled form matched a second set of data of the previously filled form and the deleted training set data includes the second set of data; generating, for a first selected data field, dependency data indicating one or more possible dependencies for an acceptable function, the possible dependencies including one or more data fields of the new form other than the first selected data field, the possible dependencies further including one or more constants of the first selected data field, the possible dependencies further including one or more values of data fields from a form other than the new form; generating, for a first selected data field of the plurality of data fields of the new form and based on the dependency data, candidate function data including a plurality of candidate functions; generating, for the first selected data field and based on the dependency data, grouping data by forming a plurality of groups from the training set data based on respective categories and assigning each of a plurality of the previously filled forms to a respective one of the groups based on the categories; generating, for the first selected data field, sampling data by selecting one or more previously filled forms from each group; generating, for each candidate function, test data by applying the candidate function to a portion of the training set data corresponding to the sampling data related to the candidate function; identifying one or more candidate functions of the plurality of candidate functions that have associated test data that are a best match to the training set data as compared with other candidate functions of the plurality of candidate functions; generating one or more additional candidate functions, the additional candidate functions being based on the identified one or more candidate functions that have associated test data that are a best match; repeatedly identifying generated candidate functions that have associated test data that are a best match to the training set data and generating one or more additional candidate functions, the additional candidate functions being based on the identified one or more candidate functions that have associated test data that are a best match until one or more candidate functions are determined to have associated test data that matches the training set data with a predetermined tolerance; identifying, from the plurality of candidate functions, an acceptable function for the first selected data field by comparing the test data to the training set data and identifying test data that matches the training set data within a predetermined tolerance, the identified acceptable function being a candidate function associated with the matching test data; and generating and outputting results data indicating the acceptable function for the first data field of the new form. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system for efficiently learning new forms in an electronic document preparation system, the system comprising:
at least one processor; and at least one memory coupled to the at least one processor, the at least one memory having stored therein instructions which, when executed by any set of the at least one processors, perform a process including; receiving, with an interface module of a computing system, form data related to a new form having a plurality of data fields; gathering, with a data acquisition module of a computing system, training set data related to previously filled forms, each previously filled form having completed data fields that each correspond to a respective data field of the new form; deleting from the training set data one or more sets of data of a previously filled form where a first set of data of the previously filled form matched a second set of data of the previously filled form and the deleted training set data includes the second set of data; generating, for a first selected data field, dependency data indicating one or more possible dependencies for an acceptable function, the possible dependencies including one or more data fields of the new form other than the first selected data field, the possible dependencies further including one or more constants of the first selected data field, the possible dependencies further including one or more values of data fields from a form other than the new form; generating, with a grouping module of a computing system and for a first selected data field of the new form and based on the dependency data, grouping data by forming a plurality of groups from the training set data based on respective categories and assigning each of a plurality of the previously filled forms to a respective one of the groups; generating, with a sampling module of a computing system, sampling data by selecting one or more previously filled forms from each group; generating, with a machine learning module of a computing system, for the first selected data field and based on the dependency data, candidate function data including a plurality of candidate functions; generating, with the machine learning module and for each candidate function, test data by applying the candidate function to a portion of the training set data corresponding to the sampling data; identifying one or more candidate functions of the plurality of candidate functions that have associated test data that are a best match to the training set data as compared with other candidate functions of the plurality of candidate functions; generating one or more additional candidate functions, the additional candidate functions being based on the identified one or more candidate functions that have associated test data that are a best match; repeatedly identifying generated candidate functions that have associated test data that are a best match to the training set data and generating one or more additional candidate functions, the additional candidate functions being based on the identified one or more candidate functions that have associated test data that are a best match until one or more candidate functions are determined to have associated test data that matches the training set data with a predetermined tolerance; identifying, with the machine learning module and from the plurality of candidate functions, an acceptable candidate for the first selected data field, by comparing the test data to the training set data and identifying test data that matches the training set data within a predetermined tolerance, the identified acceptable function being a candidate function associated with the matching test data; generating, with the machine learning module, results data indicating the acceptable function for the first data field of the new form; and outputting, with the interface module, the results data. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
Specification