Template-free extraction of data from documents

US 10,019,535 B1
Filed: 08/06/2013
Issued: 07/10/2018
Est. Priority Date: 08/06/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for processing data, comprising:

obtaining text from a document associated with a user, wherein the document was generated based on a template and includes template text;

without removing any of the obtained text, applying a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a category and at least one of the rules specifies a regular expression for a character sequence matching the determined context;

applying an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms;

extracting one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template;

storing each extracted term in one of a plurality of data elements according to the determined context; and

enabling use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The disclosed embodiments provide a system that processes data. During operation, the system obtains text from a document associated with a user. Next, the system applies a set of rules to each word in the text to determine a context associated with the word. The system then extracts data associated with the context from the text. Finally, the system enables use of the data with one or more applications without requiring manual input of the data into the one or more applications.

Citations

20 Claims

1. A computer-implemented method for processing data, comprising:
- obtaining text from a document associated with a user, wherein the document was generated based on a template and includes template text;
  
  without removing any of the obtained text, applying a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a category and at least one of the rules specifies a regular expression for a character sequence matching the determined context;
  
  applying an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms;
  
  extracting one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template;
  
  storing each extracted term in one of a plurality of data elements according to the determined context; and
  
  enabling use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-implemented method of claim 1, further comprising:
    - obtaining a modification to the determined context for one of the extracted terms from the user; and
      
      using the modification to update the set of rules.
  - 3. The computer-implemented method of claim 2, wherein obtaining the modification to the determined context for the one of the extracted terms from the user involves:
    - obtaining an updated location in the document of the one of the extracted terms.
  - 4. The computer-implemented method of claim 1, wherein applying the set of rules to each term in the obtained text to determine the context associated with the term involves:
    - categorizing the term based on at least one of a character type and a character sequence in the term; and
      
      determining the context based on the categorized term and a categorization of one or more terms in proximity to the term.
  - 5. The computer-implemented method of claim 4, wherein applying the set of rules to each term in the obtained text to determine the context associated with the term further involves:
    - determining the context based on a location of the term in the document.
  - 6. The computer-implemented method of claim 4, wherein the character type is at least one of:
    - a numeric character type;
      
      an alphabetic character type;
      
      an alphanumeric character type; and
      
      a special character type.
  - 7. The computer-implemented method of claim 1, further comprising:
    - creating, for each data element, one or more tags representing the context.
  - 8. The computer-implemented method of claim 7, wherein enabling use of each data element with the one or more applications without requiring manual input of the extracted terms into the one or more applications involves:
    - obtaining, from an application, a request for data associated with a tag from the one or more tags;
      
      matching the tag to one of the data elements; and
      
      providing the one of the data elements to the application.

9. A system for processing data, comprising:
- a memory;
  
  a processor; and
  
  a non-transitory computer-readable storage medium storing instructions that, when executed on the processor, cause the processor to instantiate;
  
  a document-processing apparatus configured to obtain text from a document associated with a user, wherein the document was generated based on a template and includes template text;
  
  an extraction apparatus configured to;
  
  without removing any of the obtained text, apply a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a category and at least one of the rules specifies a regular expression for a character sequence matching the determined context;
  
  apply an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms;
  
  extract one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template; and
  
  store each extracted term in one of a plurality of data elements according to the determined context; and
  
  a management apparatus configured to enable use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The system of claim 9, wherein the extraction apparatus is further configured to:
    - obtain a modification to the determined context for one of the extracted terms from the user; and
      
      use the modification to update the set of rules.
  - 11. The system of claim 9, wherein applying the set of rules to each term in the obtained text to determine the context associated with the term involves:
    - categorizing the term based on at least one of a character type and a character sequence in the term; and
      
      determining the context based on at least one of the categorized term, a categorization of one or more terms in proximity to the term, and a location of the term in the document.
  - 12. The system of claim 11, wherein the character type is at least one of:
    - a numeric character type;
      
      an alphabetic character type;
      
      an alphanumeric character type; and
      
      a special character type.
  - 13. The system of claim 9, wherein the extraction apparatus is further configured to:
    - create, for each data element, one or more tags representing the context.
  - 14. The system of claim 13, wherein enabling use of each data element with the one or more applications without requiring manual input of the extracted terms into the one or more applications involves:
    - obtaining, from an application, a request for data associated with a tag from the one or more tags;
      
      matching the tag to one of the data elements; and
      
      providing the one of the data elements to the application.

15. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for processing data, the method comprising:
- obtaining text from a document associated with a user, wherein the document was generated based on a template and includes template text;
  
  without removing any of the obtained text, applying a set of rules to each term in the obtained text to determine a context associated with the term, wherein the determined context includes a and at least one of the rules specifies a regular expression for a character sequence matching the determined context;
  
  applying an additional set of rules to refine a broad category of a plurality of terms in the obtained text to a refined category of fewer terms based on a location in the document of at least one term in the broad category of the plurality of terms;
  
  extracting one or more terms from the obtained text without removing any of the template text from the obtained text and without extracting the one or more terms using code developed to process only documents generated based on the template;
  
  storing each extracted term in one of a plurality of data elements according to the determined context; and
  
  enabling use of the plurality of data elements with one or more applications without requiring manual input of the extracted terms into the one or more applications.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer-readable storage medium of claim 15, the method further comprising:
    - obtaining a modification to the determined context for one of the extracted terms from the user; and
      
      using the modification to update the set of rules.
  - 17. The non-transitory computer-readable storage medium of claim 15, wherein applying the set of rules to each term in the obtained text to determine the context associated with the term involves:
    - categorizing the term based on at least one of a character type and a character sequence in the term; and
      
      determining the context based on at least one of the categorized term, a categorization of one or more terms in proximity to the term, and a location of the term in the document.
  - 18. The non-transitory computer-readable storage medium of claim 17, wherein the character type is at least one of:
    - a numeric character type;
      
      an alphabetic character type;
      
      an alphanumeric character type; and
      
      a special character type.
  - 19. The non-transitory computer-readable storage medium of claim 15, the method further comprising:
    - creating, for each data element, one or more tags representing the context.
  - 20. The non-transitory computer-readable storage medium of claim 19, wherein enabling use of each data element with the one or more applications without requiring manual input of the extracted terms into the one or more applications involves:
    - obtaining, from an application, a request for data associated with a tag from the one or more tags;
      
      matching the tag to one of the data elements; and
      
      providing the one of the data elements to the application.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intuit, Inc.
Original Assignee
Intuit, Inc.
Inventors
Madhani, Sunil H., Sreepathy, Anu, Kakkar, Samir Revti
Primary Examiner(s)
Bibbee, Jared M

Application Number

US13/960,093
Time in Patent Office

1,799 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/313 Selection or weighting of t...

G06F 16/90 Details of database functio...

Template-free extraction of data from documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Template-free extraction of data from documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links