×

Method and apparatus for forming a structured document from unstructured information

  • US 10,055,391 B2
  • Filed: 12/28/2015
  • Issued: 08/21/2018
  • Est. Priority Date: 09/06/2011
  • Status: Active Grant
First Claim
Patent Images

1. A method, comprising:

  • receiving, by a computer, an unstructured input document;

    extracting, by the computer, a plurality of tokens from the input document, each token of the plurality of tokens having a corresponding visual style of a plurality of visual styles;

    producing, by the computer for a first token of the plurality of tokens, a first probability distribution of the first token, the first probability distribution comprising a plurality of first probabilities each indicating a probability that the first token belongs to a corresponding class of a plurality of classes that are each;

    related to information conveyed by the plurality of tokens; and

    specific to a type of unstructured data items of the input document;

    determining, by the computer from the plurality of tokens, a plurality of surrounding tokens that occur near the first token within the input document;

    determining, by the computer, a first classification probability of the plurality of surrounding tokens, the first classification probability identifying the class in which the plurality of surrounding tokens are most likely to be classified;

    modifying, by the computer based on the class identified by the first classification probability, each of the plurality of first probabilities to produce a corresponding second probability of a plurality of second probabilities in a second probability distribution;

    producing, by the computer based on the visual style of the first token and the second probability distribution, a third probability distribution comprising a plurality of third probabilities each associated with a corresponding second probability of the plurality of second probabilities;

    determining, by the computer based at least on the third probability distribution, a classification of the first token into one of the plurality of classes; and

    forming, by the computer, a structured document from the first token and the classification.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×