×

Method and Apparatus for Forming a Structured Document from Unstructured Information

  • US 20130067319A1
  • Filed: 09/06/2012
  • Published: 03/14/2013
  • Est. Priority Date: 09/06/2011
  • Status: Active Grant
First Claim
Patent Images

1. A method of forming a structured document from an unstructured input document, the method comprising:

  • receiving the input document from a data communication network;

    storing the received input document in a storage system;

    in a first computer process, extracting a plurality of textual tokens from the input document, each extracted token having a visual style;

    in a second computer process, applying a content classifier to the plurality of tokens to produce, for each token therein, a first probability distribution of the given token with respect to a plurality of textual classes;

    in a third computer process, redistributing the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes;

    in a fourth computer process, applying a visual style classifier to each token based on its visual style, thereby producing a third probability distribution of the given token with respect to the plurality of textual classes;

    determining a classification for each token into one of the plurality of textual classes as a function of the second and third probability distributions; and

    in the storage system, forming a structured document from the plurality of classified tokens.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×