Method and apparatus for forming a structured document from unstructured information

US 10,055,391 B2
Filed: 12/28/2015
Issued: 08/21/2018
Est. Priority Date: 09/06/2011
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving, by a computer, an unstructured input document;

extracting, by the computer, a plurality of tokens from the input document, each token of the plurality of tokens having a corresponding visual style of a plurality of visual styles;

producing, by the computer for a first token of the plurality of tokens, a first probability distribution of the first token, the first probability distribution comprising a plurality of first probabilities each indicating a probability that the first token belongs to a corresponding class of a plurality of classes that are each;

related to information conveyed by the plurality of tokens; and

specific to a type of unstructured data items of the input document;

determining, by the computer from the plurality of tokens, a plurality of surrounding tokens that occur near the first token within the input document;

determining, by the computer, a first classification probability of the plurality of surrounding tokens, the first classification probability identifying the class in which the plurality of surrounding tokens are most likely to be classified;

modifying, by the computer based on the class identified by the first classification probability, each of the plurality of first probabilities to produce a corresponding second probability of a plurality of second probabilities in a second probability distribution;

producing, by the computer based on the visual style of the first token and the second probability distribution, a third probability distribution comprising a plurality of third probabilities each associated with a corresponding second probability of the plurality of second probabilities;

determining, by the computer based at least on the third probability distribution, a classification of the first token into one of the plurality of classes; and

forming, by the computer, a structured document from the first token and the classification.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Illustrative embodiments improve upon prior machine learning techniques by introducing an additional classification layer that mimics human visual pattern recognition. Building upon classification passes that extract contextual information, illustrative embodiments look for hints of high-level semantic categorization that manifest as visual artifacts in the document, such as font family, font weight, text color, text justification, white space, or CSS class name. An improved lightweight markup language enables display of machine-categorized tokens on a screen for human correction, thereby providing ground truths for further machine classification.

Citations

18 Claims

1. A method, comprising:
- receiving, by a computer, an unstructured input document;
  
  extracting, by the computer, a plurality of tokens from the input document, each token of the plurality of tokens having a corresponding visual style of a plurality of visual styles;
  
  producing, by the computer for a first token of the plurality of tokens, a first probability distribution of the first token, the first probability distribution comprising a plurality of first probabilities each indicating a probability that the first token belongs to a corresponding class of a plurality of classes that are each;
  
  related to information conveyed by the plurality of tokens; and
  
  specific to a type of unstructured data items of the input document;
  
  determining, by the computer from the plurality of tokens, a plurality of surrounding tokens that occur near the first token within the input document;
  
  determining, by the computer, a first classification probability of the plurality of surrounding tokens, the first classification probability identifying the class in which the plurality of surrounding tokens are most likely to be classified;
  
  modifying, by the computer based on the class identified by the first classification probability, each of the plurality of first probabilities to produce a corresponding second probability of a plurality of second probabilities in a second probability distribution;
  
  producing, by the computer based on the visual style of the first token and the second probability distribution, a third probability distribution comprising a plurality of third probabilities each associated with a corresponding second probability of the plurality of second probabilities;
  
  determining, by the computer based at least on the third probability distribution, a classification of the first token into one of the plurality of classes; and
  
  forming, by the computer, a structured document from the first token and the classification.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the input document comprises an image, and extracting the plurality of tokens comprises detecting a column in the image, correcting the perspective of the image, super-sampling the image, or performing optical character recognition on the image.
  - 3. The method of claim 1, wherein the input document comprises a markup language, and extracting the plurality of tokens comprises parsing the markup language.
  - 4. The method of claim 1, wherein the input document is an HTML page, and the computer does not produce any of the first probability distribution, the second probability distribution, or the third probability distribution for the first token based on a relationship between HTML tags.
  - 5. The method of claim 1, wherein the input document comprises a restaurant menu, and forming the structured document comprises generating a structured web page representing the restaurant menu.
  - 6. The method of claim 1, wherein modifying each of the plurality of second probabilities comprises:
    - determining a plurality of input document visual styles from the corresponding visual style of each of the tokens;
      
      determining a second classification probability identifying the class in which the visual style of the first token is most likely to be classified; and
      
      generating each third probability of the plurality of third probabilities from the corresponding second probability and the second classification probability.
  - 7. The method of claim 1, wherein determining the classification of the first token is based on a function of the second probability distribution and the third probability distribution.
  - 8. The method of claim 1, wherein determining the classification of the first token comprises producing a plurality of relative likelihoods each associated with a corresponding class of the plurality of classes, a first relative likelihood (RL) of the plurality of relative likelihoods being associated with a first class of the plurality of classes and being calculated from the corresponding second probability (C) of the first class and the corresponding third probability (S) of the first class using the formula RL=C*S⁴.

9. A method, comprising:
- determining, by a computer, a first token of a plurality of tokens in an unstructured input document, the first token having a visual style;
  
  producing, by the computer, a first probability distribution of the first token across a plurality of classes, each class of the plurality of classes being related to a corresponding content of one or more of the plurality of tokens;
  
  modifying, by the computer, the first probability distribution to produce a second probability distribution of the first token across the plurality of classes, the second probability distribution being based on one or more classes of the plurality of classes, the one or more classes being likely to contain a plurality of surrounding tokens appearing near the first token in context of the input document;
  
  producing, by the computer, a third probability distribution of the first token across the plurality of classes, the third probability distribution being based on the visual style of the first token and the second probability distribution;
  
  determining, by the computer based at least on the third probability distribution, a classification of the first token into one of the plurality of classes; and
  
  forming, by the computer, a structured document from the first token and the classification.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The method of claim 9, wherein the visual style is selected from the group consisting of:
    - font name, font family, font weight, font size, text color, vertical alignment, horizontal alignment, text justification, text indentation, capitalization type, link type, amount of surrounding white space, and CSS class name.
  - 11. The method of claim 9, further comprising:
    - displaying, by the computer, the plurality of tokens on a video display;
      
      receiving, by the computer, an indication from an individual viewing the video display that a second token of the plurality of tokens has been misclassified; and
      
      reclassifying the second token into a different class according to the indication.
  - 12. The method of claim 9, wherein the input document comprises an image, and determining the plurality of tokens comprises detecting a column in the image, correcting the perspective of the image, super-sampling the image, or performing optical character recognition on the image.
  - 13. The method of claim 9, wherein the input document is an HTML page, and the computer does not produce any of the first probability distribution, the second probability distribution, or the third probability distribution for the first token based on a relationship between HTML tags.
  - 14. The method of claim 9, wherein the input document comprises a restaurant menu, and forming the structured document comprises generating a structured web page representing the restaurant menu.

15. A device for forming a structured document from an unstructured input document, the device comprising:
- memory storing program logic; and
  
  a processor in electrical communication with the memory and executing the program logic to;
  
  extract a plurality of tokens from the input document, each token of the plurality of tokens having a visual style;
  
  produce, for each token of the plurality of tokens, a corresponding first probability distribution across a plurality of classes each being related to information conveyed by the tokens;
  
  produce, for each token of the plurality of tokens, a corresponding second probability distribution across the plurality of classes, the corresponding second probability distribution being based at least in part on the class, of the plurality of classes, in which the token'"'"'s surrounding tokens in context are most likely to be classified; and
  
  produce, for each token of the plurality of tokens, a corresponding third probability distribution across the plurality of classes, the corresponding third probability distribution being based at least in part on the corresponding visual style of the token; and
  
  classify each token of the plurality of tokens into one of the plurality of classes as a function of one or more of the first probability distribution, the second probability distribution, and the third probability distribution, wherein to classify each token, the processor executes the program logic to determine, for each class of the plurality of classes, a relative likelihood (RL) of token belonging to the class, by calculating the RL from the token'"'"'s corresponding second probability distribution for the class (C) and the token'"'"'s corresponding third probability distribution for the class (S) according to the function;
  
  RL=C*S⁴.
- View Dependent Claims (16, 17, 18)
- - 16. The device of claim 15, wherein the second probability distribution of a corresponding token of the plurality of tokens is further based on the first probability distribution of the corresponding token.
  - 17. The device of claim 15, wherein the third probability distribution of a corresponding token of the plurality of tokens is further based on the second probability distribution of the corresponding token.
  - 18. The device of claim 15, wherein the input document comprises an image, and to extract the plurality of tokens from the input document, the processor executes the program logic to detect text in the image and extract the text as one or more of the plurality of tokens.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Locu Incorporated (GoDaddy, Inc.)
Original Assignee
Locu Incorporated (GoDaddy, Inc.)
Inventors
Olszewski, Marek, Sidiroglou, Stylianos, Ansel, Jason, Piette, Marc, Reinsberg, Rene
Primary Examiner(s)
McIntosh, Andrew T

Application Number

US14/980,998
Publication Number

US 20160117295A1
Time in Patent Office

967 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 40/106   Display of layout of docume...

G06F 40/117   Tagging; Marking up details...

G06F 40/143   Markup, e.g. Standard Gener...

G06F 40/284   Lexical analysis, e.g. toke...

Method and apparatus for forming a structured document from unstructured information

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for forming a structured document from unstructured information

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links