Template-free extraction of data from documents
First Claim
Patent Images
1. A computer-implemented method for processing data, comprising:
- obtaining text from a document associated with a user, wherein the document was generated based on a template and includes template text;
without removing any of the obtained text, applying a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a category and at least one of the rules specifies a regular expression for a character sequence matching the determined context;
applying an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms;
extracting one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template;
storing each extracted term in one of a plurality of data elements according to the determined context; and
enabling use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications.
1 Assignment
0 Petitions
Accused Products
Abstract
The disclosed embodiments provide a system that processes data. During operation, the system obtains text from a document associated with a user. Next, the system applies a set of rules to each word in the text to determine a context associated with the word. The system then extracts data associated with the context from the text. Finally, the system enables use of the data with one or more applications without requiring manual input of the data into the one or more applications.
-
Citations
20 Claims
-
1. A computer-implemented method for processing data, comprising:
-
obtaining text from a document associated with a user, wherein the document was generated based on a template and includes template text; without removing any of the obtained text, applying a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a category and at least one of the rules specifies a regular expression for a character sequence matching the determined context; applying an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms; extracting one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template; storing each extracted term in one of a plurality of data elements according to the determined context; and enabling use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for processing data, comprising:
-
a memory; a processor; and a non-transitory computer-readable storage medium storing instructions that, when executed on the processor, cause the processor to instantiate; a document-processing apparatus configured to obtain text from a document associated with a user, wherein the document was generated based on a template and includes template text; an extraction apparatus configured to; without removing any of the obtained text, apply a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a category and at least one of the rules specifies a regular expression for a character sequence matching the determined context; apply an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms; extract one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template; and store each extracted term in one of a plurality of data elements according to the determined context; and a management apparatus configured to enable use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for processing data, the method comprising:
-
obtaining text from a document associated with a user, wherein the document was generated based on a template and includes template text; without removing any of the obtained text, applying a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a and at least one of the rules specifies a regular expression for a character sequence matching the determined context; applying an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms; extracting one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template; storing each extracted term in one of a plurality of data elements according to the determined context; and enabling use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification