SYSTEM AND METHOD TO EXTRACT MODELS FROM SEMI-STRUCTURED DOCUMENTS
First Claim
Patent Images
1. A method for producing a global model describing a collection of documents comprising:
- accessing a collection of documents, the collection of documents comprising labeled documents and unlabeled documents;
receiving input identifying indicative words for classifications;
generating a classification model;
classifying documents of the collection of documents to produce classified documents of one or more types;
extracting concepts from the classified documents;
generating a global model from the concepts; and
outputting the global model.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and associated methods for automated and semi-automated building of domain models for documents are described. Embodiments provide an approach to discover an information model by mining documentation about a particular domain captured in the documents. Embodiments classify the documents into one or more types corresponding to concepts using indicative words, identify candidate model elements (concepts) for document types, identify relationships both within and across document types, and consolidate and learn a global model for the domain.
18 Citations
20 Claims
-
1. A method for producing a global model describing a collection of documents comprising:
-
accessing a collection of documents, the collection of documents comprising labeled documents and unlabeled documents; receiving input identifying indicative words for classifications; generating a classification model; classifying documents of the collection of documents to produce classified documents of one or more types; extracting concepts from the classified documents; generating a global model from the concepts; and outputting the global model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer program product for producing a global model describing a collection of documents comprising:
-
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising; computer readable program code configured to access a collection of documents, the collection of documents comprising labeled documents and unlabeled documents; computer readable program code configured to ascertain input identifying indicative words for classifications; computer readable program code configured to generate a classification model; computer readable program code configured to classify documents of the collection of documents to produce classified documents of one or more types; computer readable program code configured to extract concepts from the classified documents; computer readable program code configured to generate a global model from the concepts; and computer readable program code configured to output the global model. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system for producing a global model describing a collection of documents comprising:
-
one or more processors; and a memory operatively connected to the one or more processors; wherein, responsive to execution of computer readable program code accessible to the one or more processors, the one or more processors are configured to; access a collection of documents, the collection of documents comprising labeled documents and unlabeled documents; receive input identifying indicative words for classifications; generate a classification model; classify documents of the collection of documents to produce classified documents of one or more types; extract concepts from the classified documents; generate a global model from the concepts; and output the global model. - View Dependent Claims (20)
-
Specification