Information extraction and annotation systems and methods for documents

US 10,387,557 B2
Filed: 05/09/2018
Issued: 08/20/2019
Est. Priority Date: 07/22/2013
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving, by a context analysis module, annotated documents, the annotated documents comprising annotated fields;

analyzing, by the context analysis module, the annotated documents to determine contextual information for each of the annotated fields;

determining discriminative sequences using the contextual information by;

determining, by a contiguity heuristics module, contiguous common subsequences between aligned pairs of strings of the annotated documents;

determining, by the contiguity heuristics module, a frequency of occurrence of similar contiguous common subsequences; and

wherein the contiguity heuristics module generates a proposed rule from contiguous common subsequences having a desired frequency of occurrence;

providing, by the context analysis module, the proposed rule to a document annotator; and

applying, by the document annotator, the proposed rule to a target document to annotate the target document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Information extraction and annotation systems and methods for use in annotating and determining annotation instances are provided herein. Exemplary methods include receiving annotated documents, the annotated documents comprising annotated fields, analyzing the annotated documents to determine contextual information for each of the annotated fields, determining discriminative sequences using the contextual information, generating a proposed rule or a feature set using the discriminative sequences and annotated fields, and providing the proposed rule or the feature set to a document annotator.

37 Citations

View as Search Results

21 Claims

1. A method, comprising:
- receiving, by a context analysis module, annotated documents, the annotated documents comprising annotated fields;
  
  analyzing, by the context analysis module, the annotated documents to determine contextual information for each of the annotated fields;
  
  determining discriminative sequences using the contextual information by;
  
  determining, by a contiguity heuristics module, contiguous common subsequences between aligned pairs of strings of the annotated documents;
  
  determining, by the contiguity heuristics module, a frequency of occurrence of similar contiguous common subsequences; and
  
  wherein the contiguity heuristics module generates a proposed rule from contiguous common subsequences having a desired frequency of occurrence;
  
  providing, by the context analysis module, the proposed rule to a document annotator; and
  
  applying, by the document annotator, the proposed rule to a target document to annotate the target document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the discriminative sequences are identified within documents and used to develop both rule-based extractors and feature-based extractors.
  - 3. The method of claim 1, wherein sequence alignment is utilized to identify the discriminative sequences in the context of a specified field.
  - 4. The method of claim 1, wherein the discriminative sequences are used as an identifier providing a context for an annotated field with which they are associated.
  - 5. The method of claim 1, further comprising:
    - executing, by the context analysis module, a base annotation of original documents to create documents with base annotations, the base annotations comprising basic categories of words or groups of characters; and
      
      providing the documents with base annotations to the document annotator via a user interface.
  - 6. The method of claim 5, wherein the contiguous common subsequences are the longest.
  - 7. The method of claim 1, further comprising receiving feedback from the document annotator;
    - and using the feedback to any of;
      
      approve the proposed rule, modify the proposed rule, and reject the proposed rule.
  - 8. The method of claim 1, further comprising converting, by a rule-based extractor generator, the proposed rule into a rule-based extractor.
  - 9. The method of claim 8, further comprising applying, by a rule-based annotator, the rule-based extractor to a target document to create the annotated document.
  - 10. The method of claim 1, wherein determining discriminative sequences further comprises:
    - determining, by the contiguity heuristics module, longest contiguous common subsequences between aligned pairs of strings of the annotated documents;
      
      determining, by the contiguity heuristics module, a frequency of occurrence of similar longest contiguous common subsequences; and
      
      wherein the contiguity heuristics module generates the proposed rule from the longest contiguous common subsequences having a desired frequency of occurrence.
  - 11. The method of claim 10, wherein determining longest contiguous common subsequences comprises:
    - aligning pairs of strings having possible contextual matches;
      
      normalizing the pairs of strings by extracting matching segments having a given length; and
      
      aggregating the normalized pairs of strings.
  - 12. The method of claim 11, further comprising applying a greedy contiguity heuristic to the aggregated normalized pairs of strings.
  - 13. The method of claim 12, wherein the greedy contiguity heuristic evaluates any of a number of matching segments, a number of gaps between segments, and variances between segment lengths.

14. A system, comprising:
- a processor; and
  
  logic encoded in one or more tangible media for execution by the processor, the logic when executed by the processor causing the system to perform operations comprising;
  
  receiving annotated documents comprising annotated fields;
  
  analyzing the annotated documents to determine contextual information for each of the annotated fields;
  
  determining discriminative sequences using the contextual information by;
  
  determining, by a contiguity heuristics module, longest contiguous common subsequences between aligned pairs of strings of the annotated documents;
  
  determining, by the contiguity heuristics module, a frequency of occurrence of similar longest contiguous common subsequences; and
  
  wherein the contiguity heuristics module generates a proposed rule from longest contiguous common subsequences having a desired frequency of occurrence;
  
  providing the proposed rule to a document annotator; and
  
  applying, by the document annotator, the proposed rule to a target document to automatically annotate the target document.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The system of claim 14, wherein the discriminative sequences are identified within documents and used to develop both rule-based extractors and feature-based extractors.
  - 16. The system of claim 14, wherein the processor further executes the logic to perform operations of:
    - executing a base annotation of original documents to create documents with base annotations, the base annotations comprising basic categories of words or groups of characters; and
      
      providing the documents with base annotations to a document annotator via a user interface.
  - 17. The system of claim 16, wherein the processor further executes the logic to perform operations of highlighting each of the base annotations within the user interface.
  - 18. The system of claim 14, wherein the processor further executes the logic to perform operations of receiving feedback from a document annotator;
    - and using the feedback to any of approve the proposed rule, modify the proposed rule, and reject the proposed rule.
  - 19. The system of claim 14, further comprising a rule-based extractor generator that converts the proposed rule into a rule-based extractor.
  - 20. The system according to claim 19, further comprising a rule-based annotator that applies the rule-based extractor to a target document to create an annotated document.

21. A method, comprising:
- receiving, by a context analysis module, annotated documents, the annotated documents comprising annotated fields;
  
  analyzing, by the context analysis module, the annotated documents to determine contextual information for each of the annotated fields;
  
  determining discriminative sequences using the contextual information by;
  
  determining, by a contiguity heuristics module, longest contiguous common subsequences between aligned pairs of strings of the annotated documents;
  
  determining, by the contiguity heuristics module, a frequency of occurrence of similar longest contiguous common subsequences; and
  
  wherein the contiguity heuristics module generates a proposed rule from longest contiguous common subsequences having a desired frequency of occurrence;
  
  providing, by the context analysis module, the proposed rule to a document annotator; and
  
  applying, by the document annotator, the proposed rule to a target document to automatically annotate the target document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Open Text Holdings, Inc. (Open Text Corporation)
Original Assignee
Open Text Holdings, Inc. (Open Text Corporation)
Inventors
Riediger, Julian Markus, Horng, Andy
Primary Examiner(s)
Hong, Stephen S
Assistant Examiner(s)
Ludwig, Matthew J

Application Number

US15/975,511
Publication Number

US 20180260370A1
Time in Patent Office

468 Days
Field of Search

715230, 715254, 715255, 704 4, 704 9
US Class Current
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 17/00   Digital computing or data p...

G06F 40/169   Annotation, e.g. comment da...

Information extraction and annotation systems and methods for documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Information extraction and annotation systems and methods for documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links