Method for preserving conceptual distance within unstructured documents
First Claim
1. A computer-implemented method for characterizing content of documents by conceptual relationships, comprising:
- applying natural language processing (NLP) to content in a plurality of documents to identify topics and subjects;
applying analytic analysis to the topics and subjects to identify a conceptual relationships of the content in the plurality of documents;
partitioning the content in each of the plurality of documents into a first structured hierarchy, preserving at least one structure in each document inherent in the each document; and
providing access to content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy; and
whereinthe content is characterized by optimizing a vector space model representation of the documents, the optimization performed by a system capable of answering questions, where;
the content from the plurality of documents is ingested by the system;
natural language processing is applied to the content in the plurality of documents to identify terms, topics, subjects and concepts;
the content is partitioned according to a semantic parse distance to identify a context for partitioned content;
the content and context is represented, by the system, utilizing a vector space model;
entries in the vector space model are eliminated based on a difference criteria; and
an iterative genetic algorithm is applied to optimize features of the vector space model.
1 Assignment
0 Petitions
Accused Products
Abstract
A method, system and computer-usable medium are disclosed for preserving conceptual distance within unstructured documents by characterizing conceptual relationships. Natural language processing is applied to content in a plurality of documents to identify topics and subjects. Analytic analysis is then applied to the identified topics and subjects to identify concepts. The content in each of the plurality of documents is partitioned into a first structured hierarchy, preserving at least one structure in each document inherent in the each document. Access is then provided to the content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy. The conceptual relationship criteria are based upon a directed graph with weights based upon a similarity and a distance based upon concepts.
-
Citations
5 Claims
-
1. A computer-implemented method for characterizing content of documents by conceptual relationships, comprising:
-
applying natural language processing (NLP) to content in a plurality of documents to identify topics and subjects; applying analytic analysis to the topics and subjects to identify a conceptual relationships of the content in the plurality of documents; partitioning the content in each of the plurality of documents into a first structured hierarchy, preserving at least one structure in each document inherent in the each document; and providing access to content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy; and
whereinthe content is characterized by optimizing a vector space model representation of the documents, the optimization performed by a system capable of answering questions, where; the content from the plurality of documents is ingested by the system; natural language processing is applied to the content in the plurality of documents to identify terms, topics, subjects and concepts; the content is partitioned according to a semantic parse distance to identify a context for partitioned content; the content and context is represented, by the system, utilizing a vector space model; entries in the vector space model are eliminated based on a difference criteria; and an iterative genetic algorithm is applied to optimize features of the vector space model. - View Dependent Claims (2, 3, 4, 5)
-
Specification