Method and Apparatus for Forming a Structured Document from Unstructured Information

US 20130067319A1
Filed: 09/06/2012
Published: 03/14/2013
Est. Priority Date: 09/06/2011
Status: Active Grant

First Claim

Patent Images

1. A method of forming a structured document from an unstructured input document, the method comprising:

receiving the input document from a data communication network;

storing the received input document in a storage system;

in a first computer process, extracting a plurality of textual tokens from the input document, each extracted token having a visual style;

in a second computer process, applying a content classifier to the plurality of tokens to produce, for each token therein, a first probability distribution of the given token with respect to a plurality of textual classes;

in a third computer process, redistributing the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes;

in a fourth computer process, applying a visual style classifier to each token based on its visual style, thereby producing a third probability distribution of the given token with respect to the plurality of textual classes;

determining a classification for each token into one of the plurality of textual classes as a function of the second and third probability distributions; and

in the storage system, forming a structured document from the plurality of classified tokens.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Illustrative embodiments improve upon prior machine learning techniques by introducing an additional classification layer that mimics human visual pattern recognition. Building upon classification passes that extract contextual information, illustrative embodiments look for hints of high-level semantic categorization that manifest as visual artifacts in the document, such as font family, font weight, text color, text justification, white space, or CSS class name. An improved lightweight markup language enables display of machine-categorized tokens on a screen for human correction, thereby providing ground truths for further machine classification.

56 Citations

View as Search Results

19 Claims

1. A method of forming a structured document from an unstructured input document, the method comprising:
- receiving the input document from a data communication network;
  
  storing the received input document in a storage system;
  
  in a first computer process, extracting a plurality of textual tokens from the input document, each extracted token having a visual style;
  
  in a second computer process, applying a content classifier to the plurality of tokens to produce, for each token therein, a first probability distribution of the given token with respect to a plurality of textual classes;
  
  in a third computer process, redistributing the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes;
  
  in a fourth computer process, applying a visual style classifier to each token based on its visual style, thereby producing a third probability distribution of the given token with respect to the plurality of textual classes;
  
  determining a classification for each token into one of the plurality of textual classes as a function of the second and third probability distributions; and
  
  in the storage system, forming a structured document from the plurality of classified tokens.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method according to claim 1, wherein the input document comprises a markup language, and extracting the plurality of textual tokens comprises parsing the markup language.
  - 3. A method according to claim 1, wherein the input document comprises an image, and extracting the plurality of textual tokens comprises detecting a column in the image, optionally correcting the perspective of the image, super-sampling the image, or performing optical character recognition on the image.
  - 4. A method according to claim 1, wherein the visual style includes one or more of the group consisting of:
    - font name, font family, font weight, font size, text color, vertical alignment, horizontal alignment, text justification, text indentation, capitalization type, link type, amount of surrounding white space, and CSS class name.
  - 5. A method according to claim 1, wherein the style classifier does not classify any token based on a visual style that is not found in the input document.
  - 6. A method according to claim 1, wherein the input document is an HTML page, and the style classifier does not classify any token based on a relationship between HTML tags.
  - 7. A method according to claim 1, further comprising:
    - displaying the tokens on a video display;
      
      receiving an indication from an individual viewing the video display that a token has been misclassified; and
      
      reclassifying the token into a different textual class according to the indication.
  - 8. A method according to claim 1, wherein the input document comprises a restaurant menu.

9. A non-transitory computer readable medium on which is stored program code for forming a structured document from an unstructured input document, the program code comprising:
- program code for receiving the input document from a data communication network;
  
  program code for storing the received input document in a storage system;
  
  program code for extracting a plurality of textual tokens from the input document, each extracted token having a visual style;
  
  program code for applying a content classifier to the plurality of tokens to produce, for each token therein, a first probability distribution of the given token with respect to a plurality of textual classes;
  
  program code for redistributing the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes;
  
  program code for applying a visual style classifier to each token based on its visual style, thereby producing a third probability distribution of the given token with respect to the plurality of textual classes;
  
  program code for determining a classification for each token into one of the plurality of textual classes as a function of the second and third probability distributions; and
  
  program code for forming a structured document from the plurality of classified tokens in the storage system.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. A medium according to claim 9, wherein the input document comprises a markup language, and the program code for extracting the plurality of textual tokens comprises program code for parsing the markup language.
  - 11. A medium according to claim 9, wherein the input document comprises an image, and the program code for extracting the plurality of textual tokens comprises program code for detecting a column in the image, optionally correcting the perspective of the image, super-sampling the image, or performing optical character recognition on the image.
  - 12. A medium according to claim 9, wherein the visual style includes one or more of the group consisting of:
    - font name, font family, font weight, font size, text color, vertical alignment, horizontal alignment, text justification, text indentation, capitalization type, link type, amount of surrounding white space, and CSS class name.
  - 13. A medium according to claim 9, wherein the program code for the style classifier does not classify any token based on a visual style that is not found in the input document.
  - 14. A medium according to claim 9, wherein the input document is an HTML page, and the program code for the style classifier does not classify any token based on a relationship between HTML tags.
  - 15. A medium according to claim 9, further comprising:
    - program code for displaying the tokens on a video display;
      
      program code for receiving an indication from an individual viewing the video display that a token has been misclassified; and
      
      program code for reclassifying the token into a different textual class according to the indication.
  - 16. A medium according to claim 9, wherein the input document comprises a restaurant menu.

17. A system for forming a structured document from an unstructured input document, the system comprising:
- a network connection that is configured to receive the input document from a data communication network;
  
  a network address classifier, coupled to the network connection, that is configured to determine whether data retrieved from the data communication network is possibly relevant;
  
  a translator, coupled to the network connection, that is configured to extract a plurality of textual tokens from the input document, each extracted token having a visual style;
  
  a storage system for storing the extracted textual tokens and their visual styles;
  
  a content classifier that operates on textual tokens and is configured to produce, for each token, a first probability distribution of the given token with respect to a plurality of textual classes;
  
  a context classifier that operates on the textual tokens and is configured to redistribute the probabilities of each token, based on the classification of its surrounding tokens in context, thereby producing a second probability distribution of the given token with respect to the plurality of textual classes; and
  
  a visual style classifier that operates on the textual tokens and is configured to produce, for each token based on its visual style, a third probability distribution of the given token with respect to the plurality of textual classes,wherein each textual token is classified into one of the plurality of textual classes as a function of the second and third probability distributions.
- View Dependent Claims (18, 19)
- - 18. A system according to claim 17, wherein at least one of the network address classifier, content classifier, and context classifier is implemented as a Bayesian filter.
  - 19. A system according to claim 17, wherein the visual style classifier is configured to be trained on the second probability distribution.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Locu Incorporated (GoDaddy, Inc.)
Original Assignee
Locu Incorporated (GoDaddy, Inc.)
Inventors
Sidiroglou, Stylianos, Olszewski, Marek, Ansel, Jason, Piette, Marc, Reinsberg, Rene

Granted Patent

US 9,280,525 B2
Time in Patent Office

Days
Field of Search
US Class Current

715/234
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 40/106   Display of layout of docume...

G06F 40/117   Tagging; Marking up details...

G06F 40/143   Markup, e.g. Standard Gener...

G06F 40/284   Lexical analysis, e.g. toke...

Method and Apparatus for Forming a Structured Document from Unstructured Information

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

56 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Method and Apparatus for Forming a Structured Document from Unstructured Information

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others