Method for preserving conceptual distance within unstructured documents

US 9,424,299 B2
Filed: 03/09/2015
Issued: 08/23/2016
Est. Priority Date: 10/07/2014
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for characterizing content of documents by conceptual relationships, comprising:

applying natural language processing (NLP) to content in a plurality of documents to identify topics and subjects;

applying analytic analysis to the topics and subjects to identify a conceptual relationships of the content in the plurality of documents;

partitioning the content in each of the plurality of documents into a first structured hierarchy, preserving at least one structure in each document inherent in the each document; and

providing access to content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy; and

whereinthe content is characterized by optimizing a vector space model representation of the documents, the optimization performed by a system capable of answering questions, where;

the content from the plurality of documents is ingested by the system;

natural language processing is applied to the content in the plurality of documents to identify terms, topics, subjects and concepts;

the content is partitioned according to a semantic parse distance to identify a context for partitioned content;

the content and context is represented, by the system, utilizing a vector space model;

entries in the vector space model are eliminated based on a difference criteria; and

an iterative genetic algorithm is applied to optimize features of the vector space model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, system and computer-usable medium are disclosed for preserving conceptual distance within unstructured documents by characterizing conceptual relationships. Natural language processing is applied to content in a plurality of documents to identify topics and subjects. Analytic analysis is then applied to the identified topics and subjects to identify concepts. The content in each of the plurality of documents is partitioned into a first structured hierarchy, preserving at least one structure in each document inherent in the each document. Access is then provided to the content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy. The conceptual relationship criteria are based upon a directed graph with weights based upon a similarity and a distance based upon concepts.

Citations

5 Claims

1. A computer-implemented method for characterizing content of documents by conceptual relationships, comprising:
- applying natural language processing (NLP) to content in a plurality of documents to identify topics and subjects;
  
  applying analytic analysis to the topics and subjects to identify a conceptual relationships of the content in the plurality of documents;
  
  partitioning the content in each of the plurality of documents into a first structured hierarchy, preserving at least one structure in each document inherent in the each document; and
  
  providing access to content through a first index based upon utilizing the first structured hierarchy and through a second index utilizing a second structured hierarchy; and
  
  whereinthe content is characterized by optimizing a vector space model representation of the documents, the optimization performed by a system capable of answering questions, where;
  
  the content from the plurality of documents is ingested by the system;
  
  natural language processing is applied to the content in the plurality of documents to identify terms, topics, subjects and concepts;
  
  the content is partitioned according to a semantic parse distance to identify a context for partitioned content;
  
  the content and context is represented, by the system, utilizing a vector space model;
  
  entries in the vector space model are eliminated based on a difference criteria; and
  
  an iterative genetic algorithm is applied to optimize features of the vector space model.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein:
    - the conceptual relationship is based upon a directed graph with weights based upon a similarity and a distance based upon concepts.
  - 3. The method of claim 2, wherein:
    - the distance is based upon a topic hierarchy.
  - 4. The method of claim 1, wherein:
    - a ground truth is an optimized feature.
  - 5. The method of claim 4, wherein:
    - the genetic algorithm determines which features are used during the ingesting and has weighting based on semantic distance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bufe, John P., Winkler, Timothy P.
Primary Examiner(s)
RIES, LAURIE ANNE

Application Number

US14/641,527
Publication Number

US 20160098398A1
Time in Patent Office

533 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/2272   Management thereof

G06F 16/313   Selection or weighting of t...

G06F 16/93   Document management systems

G06F 40/247   Thesauruses; Synonyms

G06F 40/295   Named entity recognition

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

Method for preserving conceptual distance within unstructured documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Method for preserving conceptual distance within unstructured documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links