Identifying content and content relationship information associated with the content for ingestion into a corpus
First Claim
1. A method, in a data processing system comprising a processor and a memory configured to implement a natural language processing (NLP) system, for identifying content relationship for content copied by a content identification mechanism, the method comprising:
- executing, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device;
identifying, by the content identification mechanism in the data processing system, the content from a website on another data processing system via a network using natural language processing (NLP);
generating, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a hierarchy and a set of cross reference information for the hierarchy;
populating, by the content identification mechanism, the file structure with path information for the content on the other data processing system that identifies a path to a current web page of the website;
identifying, by the content identification mechanism, relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website;
modifying, by the content identification mechanism, the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content;
identifying, by the content identification mechanism, one or more classification identifiers associated with the web page in order to classify the content from the website;
ingesting, by the content identification mechanism, the content from the website on the other data processing system via the network;
transmitting, by the content identification mechanism, the content and the file structure associated with the content to a specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content;
responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, updating, by the content identification mechanism, the file structure associated with the content thereby forming an updated file structure;
transmitting, by the content identification mechanism, the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content;
receiving, by a Question Answering (QA) system, a first question from a first user;
processing the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
generating, by the QA system, one or more potential candidate answers for answering the first question;
generating, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms;
generating a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers;
storing the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user;
receiving, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user;
processing the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
generating, by the QA system, one or more potential candidate answers for answering the second question;
generating, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms;
generating a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question;
comparing, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and
identifying, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers.
1 Assignment
0 Petitions
Accused Products
Abstract
A mechanism is provided, in a data processing system comprising a processor and a memory configured to implement a natural language processing (NLP) system, for identifying content relationship for content copied by a content identification mechanism. The content identification mechanism identifies content from a website and then identifies relationship content information associated with a current web page where the content is found. The content identification mechanism modifies a file structure associated with the content with the relationship content information. The content identification mechanism identifies one or more classification identifiers in order to classify the content. Finally, the content identification mechanism transmits the content and the file structure to a specific corpus based on the one or more classification identifiers.
23 Citations
19 Claims
-
1. A method, in a data processing system comprising a processor and a memory configured to implement a natural language processing (NLP) system, for identifying content relationship for content copied by a content identification mechanism, the method comprising:
-
executing, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device; identifying, by the content identification mechanism in the data processing system, the content from a website on another data processing system via a network using natural language processing (NLP); generating, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a hierarchy and a set of cross reference information for the hierarchy; populating, by the content identification mechanism, the file structure with path information for the content on the other data processing system that identifies a path to a current web page of the website; identifying, by the content identification mechanism, relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website; modifying, by the content identification mechanism, the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content; identifying, by the content identification mechanism, one or more classification identifiers associated with the web page in order to classify the content from the website; ingesting, by the content identification mechanism, the content from the website on the other data processing system via the network; transmitting, by the content identification mechanism, the content and the file structure associated with the content to a specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content; responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, updating, by the content identification mechanism, the file structure associated with the content thereby forming an updated file structure; transmitting, by the content identification mechanism, the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content; receiving, by a Question Answering (QA) system, a first question from a first user; processing the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain; generating, by the QA system, one or more potential candidate answers for answering the first question; generating, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms; generating a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers; storing the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user; receiving, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user; processing the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain; generating, by the QA system, one or more potential candidate answers for answering the second question; generating, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms; generating a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question; comparing, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and identifying, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a computing device, causes the computing device to:
-
execute, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device; identify content from a website on another computing device via a network using natural language processing (NLP); generating, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a and a set of cross reference information for the hierarchy; populate the file structure with path information for the content on the other data processing system that identifies a path to a current web page of the website; identify relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website; modify the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content and; identify one or more classification identifiers associated with the web page in order to classify the content from the website; ingesting, by the content identification mechanism, the content from the website on the other computing device via the network; transmit the content and the file structure associated with the content to a specific corpus in a NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content; responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, update the file structure associated with the content thereby forming an updated file structure; and transmit the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content; receive, by a Question Answering (QA) system, a first question from a first user; process the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain; generate, by the QA system, one or more potential candidate answers for answering the first question; generate, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms; generate a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers; store the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user; receive, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user; process the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain; generate, by the QA system, one or more potential candidate answers for answering the second question; generate, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms; generate a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question; compare, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and identify, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. An apparatus comprising:
-
a processor; and a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to; execute, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device; identify content from a website on another apparatus via a network using natural language processing (NLP); generate, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a hierarchy and a set of cross reference information for the hierarchy; populate the file structure with path information for the content on the other apparatus that identifies a path to a current web page of the website; identify relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website; modify the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content; identify one or more classification identifiers associated with the web page in order to classify the content from the website; transmit the content and the file structure associated with the content to a specific corpus in a NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content; responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, update the file structure associated with the content thereby forming an updated file structure; transmit the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content; receive, by a Question Answering (QA) system, a first question from a first user; process the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain; generate, by the QA system, one or more potential candidate answers for answering the first question; generate, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms; generate a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers; store the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user; receive, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user; process the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain; generate, by the QA system, one or more potential candidate answers for answering the second question; generate, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms; generate a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question; compare, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and identify, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification