Probabilistic learning method for XML annotation of documents
First Claim
Patent Images
1. A document processor stored in a non-transitory medium comprising:
- a probabilistic classifier that classifies fragments of an input document respective to a set of terminal elements by assigning probability values for the fragments corresponding to elements of the set of terminal elements;
a parser that defines a parsed document structure associating the input document fragments with terminal elements connected by links of non-terminal elements conforming with a probabilistic grammar defining transformation rules operating on elements selected from the set of terminal elements and a set of non-terminal elements, the parsed document structure being used to organize the input document, the parser including a joint probability optimizer that optimizes the parsed document structure respective to a joint probability of (i) the probability values of the associated terminal elements and (ii) probabilities of the connecting links of non-terminal elements derived from the probabilistic grammar;
a classifier trainer that trains the probabilistic classifier respective to a set of training documents having pre-classified fragments; and
a grammar derivation module that derives the probabilistic grammar from the set of training documents, each training document having a pre-assigned parsed document structure associating fragments of the training document with terminal elements connected by links of non-terminal elements.
1 Assignment
0 Petitions
Accused Products
Abstract
A document processor includes a parser that parses a document using a grammar having a set of terminal elements for labeling leaves, a set of non terminal elements for labeling nodes, and a set of transformation rules. The parsing generates a parsed document structure including terminal element labels for fragments of the document and a nodes tree linking the terminal element labels and conforming with the transformation rules. An annotator-annotates the document with structural information based on the parsed document structure.
10 Citations
4 Claims
-
1. A document processor stored in a non-transitory medium comprising:
-
a probabilistic classifier that classifies fragments of an input document respective to a set of terminal elements by assigning probability values for the fragments corresponding to elements of the set of terminal elements; a parser that defines a parsed document structure associating the input document fragments with terminal elements connected by links of non-terminal elements conforming with a probabilistic grammar defining transformation rules operating on elements selected from the set of terminal elements and a set of non-terminal elements, the parsed document structure being used to organize the input document, the parser including a joint probability optimizer that optimizes the parsed document structure respective to a joint probability of (i) the probability values of the associated terminal elements and (ii) probabilities of the connecting links of non-terminal elements derived from the probabilistic grammar; a classifier trainer that trains the probabilistic classifier respective to a set of training documents having pre-classified fragments; and a grammar derivation module that derives the probabilistic grammar from the set of training documents, each training document having a pre-assigned parsed document structure associating fragments of the training document with terminal elements connected by links of non-terminal elements. - View Dependent Claims (2, 3, 4)
-
Specification