Identifying a stale data source to improve NLP accuracy

US 10,394,863 B2
Filed: 03/12/2013
Issued: 08/27/2019
Est. Priority Date: 03/12/2013
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

a computer processor; and

a memory containing a program that, when executed on the computer processor, performs an operation comprising;

receiving a query for processing by a natural language processing (NLP) system comprising a corpus containing data ingested from a plurality of data sources, wherein the data is formatted and stored into one or more objects and organized based on topic changes, and wherein the ingestion is performed by at least one hardware resource of the NLP system;

identifying a data source expected to contain an answer to the query using NLP, by;

dividing words in the query into different elements,generating an annotation for each of the elements using the NLP system by determining a particular topic describing each of the elements, andidentifying a previously ingested data source in the corpus that is associated with previously-generated annotations matching the generated annotations for the elements;

upon determining that the previously ingested data in the corpus does not contain the answer to the query, determining whether new material has been added to the identified data source since the last time the identified data source was ingested into the corpus;

upon determining that new material has been added to the identified data source since the last time the identified data source was ingested into the corpus;

re-ingesting the identified data source whereby the new material is inserted into the corpus; and

processing the query to determine a lexical answer type for the query, based at least in part on a concept assigned to each of the elements, wherein the concepts were determined and assigned using NLP, and wherein the lexical answer type is a word or noun phrase that predicts a type of an answer to the query; and

generating an answer to the query based on the new material inserted into the corpus and based on the lexical answer type.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In some NLP systems, queries are compared to different data sources stored in a corpus to provide an answer to the query. However, the best data sources for answering the query may not currently be contained within the corpus or the data sources in the corpus may contain stale data that provides an inaccurate answer. When receiving a query, the NLP system may evaluate the query to identify a data source that is likely to contain an answer to the query. If the data source is not currently contained within the corpus, the NLP system may ingest the data source. If the data source is already within the corpus, however, the NLP may determine a time-sensitivity value associated with at least some portion of the query. This value may then be used to determine whether the data source should be re-ingested—e.g., the information contained in the corpus is stale.

Citations

15 Claims

1. A system, comprising:
- a computer processor; and
  
  a memory containing a program that, when executed on the computer processor, performs an operation comprising;
  
  receiving a query for processing by a natural language processing (NLP) system comprising a corpus containing data ingested from a plurality of data sources, wherein the data is formatted and stored into one or more objects and organized based on topic changes, and wherein the ingestion is performed by at least one hardware resource of the NLP system;
  
  identifying a data source expected to contain an answer to the query using NLP, by;
  
  dividing words in the query into different elements,generating an annotation for each of the elements using the NLP system by determining a particular topic describing each of the elements, andidentifying a previously ingested data source in the corpus that is associated with previously-generated annotations matching the generated annotations for the elements;
  
  upon determining that the previously ingested data in the corpus does not contain the answer to the query, determining whether new material has been added to the identified data source since the last time the identified data source was ingested into the corpus;
  
  upon determining that new material has been added to the identified data source since the last time the identified data source was ingested into the corpus;
  
  re-ingesting the identified data source whereby the new material is inserted into the corpus; and
  
  processing the query to determine a lexical answer type for the query, based at least in part on a concept assigned to each of the elements, wherein the concepts were determined and assigned using NLP, and wherein the lexical answer type is a word or noun phrase that predicts a type of an answer to the query; and
  
  generating an answer to the query based on the new material inserted into the corpus and based on the lexical answer type.
- View Dependent Claims (2, 3, 4, 5, 6, 14)
- - 2. The system of claim 1, wherein determining that the ingested data in the corpus does not contain the answer to the query further comprises:
    - processing the query to determine if any of the data sources included in the corpus contain the answer to the query.
  - 3. The system of claim 2, wherein determining whether the new material has been added to the identified data source since the last time the identified data source was ingested into the corpus is performed by the NLP system only if no answer is found when processing the query to determine if any of the data sources included in the corpus contain the answer to the query.
  - 4. The system of claim 2, wherein determining that the ingested data in the corpus does not contain the answer to the query further comprises:
    - upon determining that the identified data source contains a possible answer to the query, determining a confidence score associated with the possible answer; and
      
      upon determining that the confidence score does not satisfy a confidence threshold, determining whether the new material has been added to the identified data source since the last time the identified data source was ingested into the corpus.
  - 5. The system of claim 1, wherein identifying the data source expected to contain the answer to the query further comprises:
    - performing a concept mapping to assign a concept to at least one element in the query; and
      
      determining that the assigned concept of the at least one element in the query is related to a concept associated with the ingested data in the corpus from the data source.
  - 6. The system of claim 5, wherein the corpus stores metadata describing each of the plurality of data sources, wherein the assigned concept of the at least one element in the query is related to the metadata description of the identified data source.
  - 14. The system of claim 1, wherein organizing the data based on topic changes comprises identifying one or more features indicative of a topic change based on one or more changes in text formatting.

7. A computer program product for maintaining a corpus in a natural language processing (NLP) system, the computer program product comprising:
- a computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code comprising computer-readable program code configured to;
  
  receive a query for processing by the NLP system comprising a corpus containing data ingested from a plurality of data sources, wherein the data is formatted and stored into one or more objects and organized based on topic changes, and wherein the ingestion is performed by at least one hardware resource of the NLP system;
  
  identify a data source expected to contain an answer to the query using NLP, by;
  
  dividing words in the query into different elements,generating an annotation for each of the elements using the NLP system by determining a particular topic describing each of the elements, andidentifying a previously ingested data source in the corpus that is associated with previously-generated annotations matching the generated annotations for the elements;
  
  upon determining that the previously ingested data in the corpus does not contain the answer to the query, determine whether new material has been added to the identified data source since the last time the identified data source was ingested into the corpus;
  
  upon determining that new material has been added to the identified data source since the last time the identified data source was ingested into the corpus;
  
  re-ingesting the identified data source whereby the new material is inserted into the corpus; and
  
  processing the query to determine a lexical answer type for the query, based at least in part on a concept assigned to each of the elements, wherein the concepts were determined and assigned using NLP, and wherein the lexical answer type is a word or noun phrase that predicts a type of an answer to the query; and
  
  generate an answer to the query based on the new material inserted into the corpus and based on the lexical answer type.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 15)
- - 8. The computer program product of claim 7, wherein determining that the ingested data in the corpus does not contain the answer to the query further comprises computer-readable program code configured to:
    - process the query to determine if any of the data sources included in the corpus contain the answer to the query.
  - 9. The computer program product of claim 8, wherein determining whether the new material has been added to the identified data source since the last time the identified data source was ingested into the corpus is performed by the NLP system only if no answer is found when processing the query to determine if any of the data sources included in the corpus contain the answer to the query.
  - 10. The computer program product of claim 8, wherein determining that the ingested data in the corpus does not contain the answer to the query further comprises computer-readable program code configured to:
    - upon determining that the identified data source contains a possible answer to the query, determine a confidence score associated with the possible answer; and
      
      upon determining that the confidence score does not satisfy a confidence threshold, determine whether the new material has been added to the identified data source since the last time the identified data source was ingested into the corpus.
  - 11. The computer program product of claim 7, wherein identifying the data source in the corpus expected to contain the answer to the query further comprises computer-readable program code configured to:
    - perform a concept mapping to assign a concept to at least one element in the query; and
      
      determine that the assigned concept of the at least one element in the query is related to a concept associated with the ingested data in the corpus from the identified data source.
  - 12. The computer program product of claim 11, wherein the corpus stores metadata describing each of the plurality of data sources, wherein the assigned concept of the at least one element in the query is related to the metadata description of the identified data source.
  - 13. The computer program product of claim 7, wherein the corpus comprises a plurality of data from different data sources, wherein the data from the different data sources is organized based on a common format of the corpus.
  - 15. The computer program product of claim 7, wherein organizing the data based on topic changes comprises identifying one or more features indicative of a topic change based on one or more changes in text formatting.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Clark, Adam T., Dubbels, Joel C., Huebert, Jeffrey K., Petri, John E.
Primary Examiner(s)
Neway, Samuel G

Application Number

US13/796,616
Publication Number

US 20140278352A1
Time in Patent Office

2,359 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/3344 using natural language anal...

G06F 40/30 Semantic analysis

Identifying a stale data source to improve NLP accuracy

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying a stale data source to improve NLP accuracy

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links