Method and Apparatus for Forming a Structured Document from Unstructured Information
First Claim
1. A method of forming a structured document from an unstructured input document, the method comprising:
- receiving the input document from a data communication network;
storing the received input document in a storage system;
in a first computer process, extracting a plurality of textual tokens from the input document, each extracted token having a visual style;
in a second computer process, applying a content classifier to the plurality of tokens to produce, for each token therein, a first probability distribution of the given token with respect to a plurality of textual classes;
in a third computer process, redistributing the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes;
in a fourth computer process, applying a visual style classifier to each token based on its visual style, thereby producing a third probability distribution of the given token with respect to the plurality of textual classes;
determining a classification for each token into one of the plurality of textual classes as a function of the second and third probability distributions; and
in the storage system, forming a structured document from the plurality of classified tokens.
3 Assignments
0 Petitions
Accused Products
Abstract
Illustrative embodiments improve upon prior machine learning techniques by introducing an additional classification layer that mimics human visual pattern recognition. Building upon classification passes that extract contextual information, illustrative embodiments look for hints of high-level semantic categorization that manifest as visual artifacts in the document, such as font family, font weight, text color, text justification, white space, or CSS class name. An improved lightweight markup language enables display of machine-categorized tokens on a screen for human correction, thereby providing ground truths for further machine classification.
56 Citations
19 Claims
-
1. A method of forming a structured document from an unstructured input document, the method comprising:
-
receiving the input document from a data communication network; storing the received input document in a storage system; in a first computer process, extracting a plurality of textual tokens from the input document, each extracted token having a visual style; in a second computer process, applying a content classifier to the plurality of tokens to produce, for each token therein, a first probability distribution of the given token with respect to a plurality of textual classes; in a third computer process, redistributing the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes; in a fourth computer process, applying a visual style classifier to each token based on its visual style, thereby producing a third probability distribution of the given token with respect to the plurality of textual classes; determining a classification for each token into one of the plurality of textual classes as a function of the second and third probability distributions; and in the storage system, forming a structured document from the plurality of classified tokens. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer readable medium on which is stored program code for forming a structured document from an unstructured input document, the program code comprising:
-
program code for receiving the input document from a data communication network; program code for storing the received input document in a storage system; program code for extracting a plurality of textual tokens from the input document, each extracted token having a visual style; program code for applying a content classifier to the plurality of tokens to produce, for each token therein, a first probability distribution of the given token with respect to a plurality of textual classes; program code for redistributing the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes; program code for applying a visual style classifier to each token based on its visual style, thereby producing a third probability distribution of the given token with respect to the plurality of textual classes; program code for determining a classification for each token into one of the plurality of textual classes as a function of the second and third probability distributions; and program code for forming a structured document from the plurality of classified tokens in the storage system. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for forming a structured document from an unstructured input document, the system comprising:
-
a network connection that is configured to receive the input document from a data communication network; a network address classifier, coupled to the network connection, that is configured to determine whether data retrieved from the data communication network is possibly relevant; a translator, coupled to the network connection, that is configured to extract a plurality of textual tokens from the input document, each extracted token having a visual style; a storage system for storing the extracted textual tokens and their visual styles; a content classifier that operates on textual tokens and is configured to produce, for each token, a first probability distribution of the given token with respect to a plurality of textual classes; a context classifier that operates on the textual tokens and is configured to redistribute the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes; and a visual style classifier that operates on the textual tokens and is configured to produce, for each token based on its visual style, a third probability distribution of the given token with respect to the plurality of textual classes, wherein each textual token is classified into one of the plurality of textual classes as a function of the second and third probability distributions. - View Dependent Claims (18, 19)
-
Specification