×

Decision-support expert system and methods for real-time exploitation of documents in non-english languages

  • US 8,078,551 B2
  • Filed: 08/31/2006
  • Issued: 12/13/2011
  • Est. Priority Date: 08/31/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method for automatic real-time summarization and information extraction from one or more documents in a source language in order to present their content in a summary in a chosen target language and to input the content as categorized entities, the method comprising:

  • a. selecting said source language;

    b. processing an input document into a textual format document while preserving visual information;

    c. performing linguistic analysis based on models of a linguistic-register associated with the analyzed document;

    d. extracting lexical instances corresponding to at least one of a pre-determined domain-specific lexicon and run-time lexical instances built according to pre-determined syntactic rules;

    e. extracting ontology elements from the lexical instances to obtain a set of ontological instances comprising instances from a pre-determined ontology and refined ontological instances, said refined ontological instances comprising a combination of pre-determined ontological instances;

    f. creating a document digest (DD) from the set of ontological instances, the DD being presented in a relationary map which retains information on features, context and linguistic origin of the components of the DD;

    g. using domain-specific statistical models relating to categorization clusters and categories within said categorization clusters to determine the most likely categorization of the document in relation to a spectrum of categories by comparing the updated DD to each category model;

    wherein the step of using domain-specific models includes;

    i. creating a vector representing each DD using a pre-categorized corpus of domain-specific documents;

    ii. analyzing those documents to extract a model of a document digest which corresponds to or contradicts each category of document in the domain;

    iii. using statistical algorithms to create the models and creating mixed models of the different results which are specific to each category;

    iv. creating a mixed model of the results of analysis of each text;

    v. calculating a similarity result between each of a plurality of statistical models obtained offline of said categories and the vector generated by the DD of the input text and determining the most likely category in each cluster;

    vi. adding ontological instances to the DD based on the input of the categorization upon which rules from the rule base are applied; and

    vii. analyzing the winner model to extract a confidence score related to the DD;

    h. creating in said target language a document summary for each of said one or more documents based on said DD and summarization rules.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×