Decision-support expert system and methods for real-time exploitation of documents in non-english languages
First Claim
1. A method for automatic real-time summarization and information extraction from one or more documents in a source language in order to present their content in a summary in a chosen target language and to input the content as categorized entities, the method comprising:
- a. selecting said source language;
b. processing an input document into a textual format document while preserving visual information;
c. performing linguistic analysis based on models of a linguistic-register associated with the analyzed document;
d. extracting lexical instances corresponding to at least one of a pre-determined domain-specific lexicon and run-time lexical instances built according to pre-determined syntactic rules;
e. extracting ontology elements from the lexical instances to obtain a set of ontological instances comprising instances from a pre-determined ontology and refined ontological instances, said refined ontological instances comprising a combination of pre-determined ontological instances;
f. creating a document digest (DD) from the set of ontological instances, the DD being presented in a relationary map which retains information on features, context and linguistic origin of the components of the DD;
g. using domain-specific statistical models relating to categorization clusters and categories within said categorization clusters to determine the most likely categorization of the document in relation to a spectrum of categories by comparing the updated DD to each category model;
wherein the step of using domain-specific models includes;
i. creating a vector representing each DD using a pre-categorized corpus of domain-specific documents;
ii. analyzing those documents to extract a model of a document digest which corresponds to or contradicts each category of document in the domain;
iii. using statistical algorithms to create the models and creating mixed models of the different results which are specific to each category;
iv. creating a mixed model of the results of analysis of each text;
v. calculating a similarity result between each of a plurality of statistical models obtained offline of said categories and the vector generated by the DD of the input text and determining the most likely category in each cluster;
vi. adding ontological instances to the DD based on the input of the categorization upon which rules from the rule base are applied; and
vii. analyzing the winner model to extract a confidence score related to the DD;
h. creating in said target language a document summary for each of said one or more documents based on said DD and summarization rules.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for real-time exploitation of documents in non-English languages includes processing an input document in into a processed input document, extracting ontology elements from the processed input document to obtain a document digest (DD), statistically scoring each DD to obtain a DD with category scores, refining the DD and the category scores to obtain a summary of each document in the form of a refined DD with refined category scores. The summary allows a user to estimate in real-time if the input document warrants added attention.
16 Citations
14 Claims
-
1. A method for automatic real-time summarization and information extraction from one or more documents in a source language in order to present their content in a summary in a chosen target language and to input the content as categorized entities, the method comprising:
-
a. selecting said source language; b. processing an input document into a textual format document while preserving visual information; c. performing linguistic analysis based on models of a linguistic-register associated with the analyzed document; d. extracting lexical instances corresponding to at least one of a pre-determined domain-specific lexicon and run-time lexical instances built according to pre-determined syntactic rules; e. extracting ontology elements from the lexical instances to obtain a set of ontological instances comprising instances from a pre-determined ontology and refined ontological instances, said refined ontological instances comprising a combination of pre-determined ontological instances; f. creating a document digest (DD) from the set of ontological instances, the DD being presented in a relationary map which retains information on features, context and linguistic origin of the components of the DD; g. using domain-specific statistical models relating to categorization clusters and categories within said categorization clusters to determine the most likely categorization of the document in relation to a spectrum of categories by comparing the updated DD to each category model;
wherein the step of using domain-specific models includes;i. creating a vector representing each DD using a pre-categorized corpus of domain-specific documents; ii. analyzing those documents to extract a model of a document digest which corresponds to or contradicts each category of document in the domain; iii. using statistical algorithms to create the models and creating mixed models of the different results which are specific to each category; iv. creating a mixed model of the results of analysis of each text; v. calculating a similarity result between each of a plurality of statistical models obtained offline of said categories and the vector generated by the DD of the input text and determining the most likely category in each cluster; vi. adding ontological instances to the DD based on the input of the categorization upon which rules from the rule base are applied; and vii. analyzing the winner model to extract a confidence score related to the DD; h. creating in said target language a document summary for each of said one or more documents based on said DD and summarization rules. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
Specification