Probabilistic learning method for XML annotation of documents

US 8,543,906 B2
Filed: 06/29/2005
Issued: 09/24/2013
Est. Priority Date: 06/29/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A document processor stored in a non-transitory medium comprising:

a probabilistic classifier that classifies fragments of an input document respective to a set of terminal elements by assigning probability values for the fragments corresponding to elements of the set of terminal elements;

a parser that defines a parsed document structure associating the input document fragments with terminal elements connected by links of non-terminal elements conforming with a probabilistic grammar defining transformation rules operating on elements selected from the set of terminal elements and a set of non-terminal elements, the parsed document structure being used to organize the input document, the parser including a joint probability optimizer that optimizes the parsed document structure respective to a joint probability of (i) the probability values of the associated terminal elements and (ii) probabilities of the connecting links of non-terminal elements derived from the probabilistic grammar;

a classifier trainer that trains the probabilistic classifier respective to a set of training documents having pre-classified fragments; and

a grammar derivation module that derives the probabilistic grammar from the set of training documents, each training document having a pre-assigned parsed document structure associating fragments of the training document with terminal elements connected by links of non-terminal elements.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document processor includes a parser that parses a document using a grammar having a set of terminal elements for labeling leaves, a set of non terminal elements for labeling nodes, and a set of transformation rules. The parsing generates a parsed document structure including terminal element labels for fragments of the document and a nodes tree linking the terminal element labels and conforming with the transformation rules. An annotator-annotates the document with structural information based on the parsed document structure.

10 Citations

View as Search Results

4 Claims

1. A document processor stored in a non-transitory medium comprising:
- a probabilistic classifier that classifies fragments of an input document respective to a set of terminal elements by assigning probability values for the fragments corresponding to elements of the set of terminal elements;
  
  a parser that defines a parsed document structure associating the input document fragments with terminal elements connected by links of non-terminal elements conforming with a probabilistic grammar defining transformation rules operating on elements selected from the set of terminal elements and a set of non-terminal elements, the parsed document structure being used to organize the input document, the parser including a joint probability optimizer that optimizes the parsed document structure respective to a joint probability of (i) the probability values of the associated terminal elements and (ii) probabilities of the connecting links of non-terminal elements derived from the probabilistic grammar;
  
  a classifier trainer that trains the probabilistic classifier respective to a set of training documents having pre-classified fragments; and
  
  a grammar derivation module that derives the probabilistic grammar from the set of training documents, each training document having a pre-assigned parsed document structure associating fragments of the training document with terminal elements connected by links of non-terminal elements.
- View Dependent Claims (2, 3, 4)
- - 2. The document processor as set forth in claim 1, wherein the probabilistic grammar is a probabilistic context-free grammar and the joint probability optimizer employs a modified inside/outside optimization.
  - 3. The document processor as set forth in claim 1, wherein the computer is further programmed to implement:
    - an XML document converter that converts the input document to an XML document having an XML structure generated in accordance with the parsed document structure.
  - 4. The document processor as set forth in claim 3, wherein the XML document includes a DTD based on the probabilistic grammar.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Chidlovskii, Boris, Fuselier, Jerome
Primary Examiner(s)
NGUYEN, CHAU T

Application Number

US11/170,542
Publication Number

US 20070022373A1
Time in Patent Office

3,009 Days
Field of Search

715/205, 715/234, 715/236, 715/239, 715248-249
US Class Current

715/234
CPC Class Codes

G06F 40/143 Markup, e.g. Standard Gener...

G06F 40/169 Annotation, e.g. comment da...

Probabilistic learning method for XML annotation of documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

4 Claims

Specification

Use Cases

Quick Links

Others

Probabilistic learning method for XML annotation of documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

4 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others