Document information processing apparatus
First Claim
1. A document information processing apparatus comprising:
- a plain document input unit for inputting a plain document;
a dictionary storage unit for storing a dictionary used for form element analysis and syntactic analysis;
a form element analyzer for performing a form element analysis on the plain document inputted from said plain document input unit by using the dictionary stored in said dictionary storage unit so as to decompose the plain document into tokens;
a syntax analyzer for analyzing a part of speech of each of the tokens obtained by said form element analyzer based on a syntax of said plain document so as to generate a structured document containing meaningful words;
a data storage unit for storing data used for a markup process;
an element refinement processing unit for performing the markup process of reading each of the meaningful words in the structured document and automatically adding content to the structured document in association with at least one of the meaningful words in order to generate a markup document; and
a markup document output unit for outputting the markup document generated by said element refinement processing unit,whereinthe added content is different from the markup tags in the markup document, andthe added content includes at least one of;
data, which is related to at least one of the meaningful words and which is read from the data storage unit, anddata, which is related to at least one of the meaningful words and which is generated according to a determined attribute of the at least one of the meaningful words.
1 Assignment
0 Petitions
Accused Products
Abstract
A document information processing apparatus includes a form element analyzer (12) for performing a form element analysis on a plain document inputted from a plain document input unit (10) by using a dictionary stored in a dictionary storage unit so as to decompose the plain document into tokens, a syntax analyzer (13) for analyzing the part of speech of each of the tokens obtained by the form element analyzer so as to generate a structured document containing meaningful words, an element refinement processing unit (15) for performing a markup process of adding data associated with each of the meaningful words included in the structured document generated by the syntax analyzer and stored in a data storage unit (14) to each of the meaningful words so as to generate a markup document, and a markup document output unit (17) for outputting the markup document generated by the element refinement processing unit.
25 Citations
21 Claims
-
1. A document information processing apparatus comprising:
-
a plain document input unit for inputting a plain document; a dictionary storage unit for storing a dictionary used for form element analysis and syntactic analysis; a form element analyzer for performing a form element analysis on the plain document inputted from said plain document input unit by using the dictionary stored in said dictionary storage unit so as to decompose the plain document into tokens; a syntax analyzer for analyzing a part of speech of each of the tokens obtained by said form element analyzer based on a syntax of said plain document so as to generate a structured document containing meaningful words; a data storage unit for storing data used for a markup process; an element refinement processing unit for performing the markup process of reading each of the meaningful words in the structured document and automatically adding content to the structured document in association with at least one of the meaningful words in order to generate a markup document; and a markup document output unit for outputting the markup document generated by said element refinement processing unit, wherein the added content is different from the markup tags in the markup document, and the added content includes at least one of; data, which is related to at least one of the meaningful words and which is read from the data storage unit, and data, which is related to at least one of the meaningful words and which is generated according to a determined attribute of the at least one of the meaningful words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
Specification