×

System and method to extract models from semi-structured documents

  • US 10,089,390 B2
  • Filed: 09/24/2010
  • Issued: 10/02/2018
  • Est. Priority Date: 09/24/2010
  • Status: Active Grant
First Claim
Patent Images

1. A method for producing a global model describing a collection of documents comprising:

  • executing with one or more processors one or more modules of computer program code configured for accessing a collection of documents, the collection of documents comprising labeled documents and unlabeled documents;

    receiving input of at least one indicative word, wherein the at least one indicative word comprises a descriptive word for classification and wherein the at least one indicative word indicates a probability of belonging to a classification based upon the indicative word occurring in a document during classification;

    classifying both labeled documents and unlabeled documents of the collection of documents to produce classified documents of one or more types, wherein the classifying comprises producing a domain sub-model for each document type, wherein the domain sub-model represents a graphical representation of a set of concepts contained within each document type and wherein the domain sub-model is generated using the labeled documents and the at least one indicative word;

    wherein the producing a domain sub-model for each document type comprises extracting concepts from each of the documents and determining relationships between the concepts, wherein the extracting concepts comprises producing concept pairs by identifying, within the collection of documents, co-occurring candidate concepts and wherein the determining relationships between the concepts comprises identifying relationship links between source and destination candidate concepts, wherein the identifying relationship links comprises extracting, from each of the documents of the collection of documents, a hierarchical structure, searching for adjacent container pairs within the hierarchical structures, and inferring directed relationships between elements within the adjacent container pairs;

    thereupon generating a global domain model for the documents of the collection of the documents by merging the produced domain sub-models, based on the relationships between the concepts;

    said generating of a global domain model comprising aggregating identified relationship links and corresponding concepts of each of the domain sub-models across the produced domain sub-models, wherein the relationship links and corresponding concepts selected for aggregation are based upon a strategy identified based upon a level of manual review;

    thereupon outputting the global model as a graphical representation comprising the aggregated concepts and relationship links between concepts;

    ascertaining one or more changes to the collection of documents; and

    generating a new global model based on the one or more changes to the collection of documents by reclassifying the collection of documents and generating a new global model using the new domain sub-models generated during reclassification of the collection of documents.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×