Method for content mining of semi-structured documents
First Claim
1. A method for content mining of semi-structured documents comprising:
- receiving a semi-structured document;
converting said semi-structured document to a document-type independent format;
analyzing formatting information of said semi-structured document;
adding information to said semi-structured document describing said semi-structured document'"'"'s structure, based upon said analyzing; and
mining said semi-structured document for specified information, wherein said added information facilitates said content mining.
9 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the present invention are directed to a method for content mining of semi-structured documents. In one embodiment, a semi-structured document is first converted from a document-type specific format such as HTML or PDF, to a document-type independent format such as XML. The document formatting, which contains basic level information about the document'"'"'s structure, is then analyzed by a series of modules to develop a higher level understanding of the document'"'"'s structure. These modules append information to the document describing the features which collectively comprise the higher level document structure. The appended information facilitates finding specified information within the document when content mining is performed.
61 Citations
21 Claims
-
1. A method for content mining of semi-structured documents comprising:
-
receiving a semi-structured document;
converting said semi-structured document to a document-type independent format;
analyzing formatting information of said semi-structured document;
adding information to said semi-structured document describing said semi-structured document'"'"'s structure, based upon said analyzing; and
mining said semi-structured document for specified information, wherein said added information facilitates said content mining. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer system comprising:
-
a bus;
a memory unit coupled to said bus; and
a processor coupled to said bus, said processor for executing a method for content mining of semi-structured documents, said method comprising;
receiving a semi-structured document;
converting said semi-structured document to a document-type independent format;
analyzing formatting information of said semi-structured document;
adding information to said semi-structured document describing said semi-structured document'"'"'s structure, based upon said analyzing; and
mining said semi-structured document for specified information, wherein said added information facilitates said content mining. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-usable medium having computer-readable program code embodied therein for causing a computer system to perform a method for content mining of semi-structured documents comprising:
-
receiving a semi-structured document;
converting said semi-structured document to a document-type independent format;
analyzing formatting information of said semi-structured document;
adding information to said semi-structured document describing said semi-structured document'"'"'s structure, based upon said analyzing; and
mining said semi-structured document for specified information, wherein said added information facilitates said content mining. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification