Identifying content and content relationship information associated with the content for ingestion into a corpus

US 10,642,935 B2
Filed: 05/12/2014
Issued: 05/05/2020
Est. Priority Date: 05/12/2014
Status: Active Grant

First Claim

Patent Images

1. A method, in a data processing system comprising a processor and a memory configured to implement a natural language processing (NLP) system, for identifying content relationship for content copied by a content identification mechanism, the method comprising:

executing, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device;

identifying, by the content identification mechanism in the data processing system, the content from a website on another data processing system via a network using natural language processing (NLP);

generating, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a hierarchy and a set of cross reference information for the hierarchy;

populating, by the content identification mechanism, the file structure with path information for the content on the other data processing system that identifies a path to a current web page of the website;

identifying, by the content identification mechanism, relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website;

modifying, by the content identification mechanism, the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content;

identifying, by the content identification mechanism, one or more classification identifiers associated with the web page in order to classify the content from the website;

ingesting, by the content identification mechanism, the content from the website on the other data processing system via the network;

transmitting, by the content identification mechanism, the content and the file structure associated with the content to a specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content;

responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, updating, by the content identification mechanism, the file structure associated with the content thereby forming an updated file structure;

transmitting, by the content identification mechanism, the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content;

receiving, by a Question Answering (QA) system, a first question from a first user;

processing the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;

generating, by the QA system, one or more potential candidate answers for answering the first question;

generating, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms;

generating a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers;

storing the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user;

receiving, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user;

processing the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;

generating, by the QA system, one or more potential candidate answers for answering the second question;

generating, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms;

generating a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question;

comparing, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and

identifying, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A mechanism is provided, in a data processing system comprising a processor and a memory configured to implement a natural language processing (NLP) system, for identifying content relationship for content copied by a content identification mechanism. The content identification mechanism identifies content from a website and then identifies relationship content information associated with a current web page where the content is found. The content identification mechanism modifies a file structure associated with the content with the relationship content information. The content identification mechanism identifies one or more classification identifiers in order to classify the content. Finally, the content identification mechanism transmits the content and the file structure to a specific corpus based on the one or more classification identifiers.

23 Citations

19 Claims

1. A method, in a data processing system comprising a processor and a memory configured to implement a natural language processing (NLP) system, for identifying content relationship for content copied by a content identification mechanism, the method comprising:
- executing, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device;
  
  identifying, by the content identification mechanism in the data processing system, the content from a website on another data processing system via a network using natural language processing (NLP);
  
  generating, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a hierarchy and a set of cross reference information for the hierarchy;
  
  populating, by the content identification mechanism, the file structure with path information for the content on the other data processing system that identifies a path to a current web page of the website;
  
  identifying, by the content identification mechanism, relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website;
  
  modifying, by the content identification mechanism, the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content;
  
  identifying, by the content identification mechanism, one or more classification identifiers associated with the web page in order to classify the content from the website;
  
  ingesting, by the content identification mechanism, the content from the website on the other data processing system via the network;
  
  transmitting, by the content identification mechanism, the content and the file structure associated with the content to a specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content;
  
  responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, updating, by the content identification mechanism, the file structure associated with the content thereby forming an updated file structure;
  
  transmitting, by the content identification mechanism, the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content;
  
  receiving, by a Question Answering (QA) system, a first question from a first user;
  
  processing the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
  
  generating, by the QA system, one or more potential candidate answers for answering the first question;
  
  generating, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms;
  
  generating a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers;
  
  storing the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user;
  
  receiving, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user;
  
  processing the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
  
  generating, by the QA system, one or more potential candidate answers for answering the second question;
  
  generating, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms;
  
  generating a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question;
  
  comparing, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and
  
  identifying, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - identifying, by the content identification mechanism, other content on the website on the other data processing system via the network;
      
      identifying, by the content identification mechanism, cross reference information between the content and the other content;
      
      updating, by the content identification mechanism, the file structure associated with content with the cross reference information; and
      
      transmitting, by the content identification mechanism, the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers.
  - 3. The method of claim 2, wherein the file structure of the other content is updated with the cross reference information associated with the content.
  - 4. The method of claim 2, wherein the cross reference information is identified using at least one of the group consisting of:
    - parsing, structural analysis, hierarchical analysis, or concept extraction.
  - 5. The method of claim 1, wherein the relationship content information is identified from a Uniform Resource Locator (URL) of the web page or an HyperText Markup Language (HTML) of the web page and wherein the relationship content information is utilized to determine document information directly identified in the content or associated with the content.
  - 6. The method of claim 1, wherein each content is selected from a group comprising a document, a video, an audio file, a recording, a picture, an artifact, an entry, or data.
  - 7. The method of claim 1, wherein the content identification mechanism comprises at least one of the group consisting of:
    - an Internet bot, a web crawler, a web scraper, a web spider, an ant, or an automatic indexer.

8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a computing device, causes the computing device to:
- execute, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device;
  
  identify content from a website on another computing device via a network using natural language processing (NLP);
  
  generating, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a and a set of cross reference information for the hierarchy;
  
  populate the file structure with path information for the content on the other data processing system that identifies a path to a current web page of the website;
  
  identify relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website;
  
  modify the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content and;
  
  identify one or more classification identifiers associated with the web page in order to classify the content from the website;
  
  ingesting, by the content identification mechanism, the content from the website on the other computing device via the network;
  
  transmit the content and the file structure associated with the content to a specific corpus in a NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content;
  
  responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, update the file structure associated with the content thereby forming an updated file structure; and
  
  transmit the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content;
  
  receive, by a Question Answering (QA) system, a first question from a first user;
  
  process the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
  
  generate, by the QA system, one or more potential candidate answers for answering the first question;
  
  generate, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms;
  
  generate a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers;
  
  store the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user;
  
  receive, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user;
  
  process the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
  
  generate, by the QA system, one or more potential candidate answers for answering the second question;
  
  generate, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms;
  
  generate a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question;
  
  compare, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and
  
  identify, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The computer program product of claim 8, wherein the computer readable program further causes the computing device to:
    - identify other content on the website on the other computing device via the network;
      
      identify cross reference information between the content and the other content;
      
      update the file structure associated with content with the cross reference information; and
      
      transmit the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers.
  - 10. The computer program product of claim 9, wherein the file structure of the other content is updated with the cross reference information associated with the content.
  - 11. The computer program product of claim 9, wherein the cross reference information is identified using at least one of the group consisting of:
    - parsing, structural analysis, hierarchical analysis, or concept extraction.
  - 12. The computer program product of claim 8, wherein the relationship content information is identified from a Uniform Resource Locator (URL) of the web page or an HyperText Markup Language (HTML) of the web page and wherein the relationship content information is utilized to determine document information directly identified in the content or associated with the content.
  - 13. The computer program product of claim 8, wherein the content identification mechanism comprises at least one of the group consisting of:
    - an Internet bot, a web crawler, a web scraper, a web spider, an ant, or an automatic indexer.

14. An apparatus comprising:
- a processor; and
  
  a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to;
  
  execute, by the processor of a computing device, a content identification mechanism, the content identification mechanism being resident in the memory device of the computing device;
  
  identify content from a website on another apparatus via a network using natural language processing (NLP);
  
  generate, by the content identification mechanism, a file structure in the data processing system, wherein the file structure comprises the content parsed into a hierarchy and a set of cross reference information for the hierarchy;
  
  populate the file structure with path information for the content on the other apparatus that identifies a path to a current web page of the website;
  
  identify relationship content information associated with the current web page based on at least one of the set of cross reference information or contextual clues of the content, wherein the relationship content is a path to a current web page where the relationship content is found as well as other identified content, including headers, section titles, page titles, web site structure, extracted concepts, information type, metadata, or other data about the content itself that is not within the content, including location of the content on the website, type or classification details of the website;
  
  modify the file structure associated with the content with the relationship content information, wherein modifying the file structure associated with the content with the relationship content information is performed either through generating a new file structure with the path information as well as other identified content, augmenting an existing file structure with new information, or updating the existing file structure with a change in the path information or the other identified content;
  
  identify one or more classification identifiers associated with the web page in order to classify the content from the website;
  
  transmit the content and the file structure associated with the content to a specific corpus in a NLP system based on the one or more classification identifiers so that the NLP system may respond to inquiries using the content and information in the file structure associated with the content;
  
  responsive to the content identification mechanism identifying changes to the content or the relationship content from the website or information associated with the current web page where the content is found on the website, update the file structure associated with the content thereby forming an updated file structure;
  
  transmit the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers so that the NLP system may respond to new inquiries using the content and information in the updated file structure associated with the content;
  
  receive, by a Question Answering (QA) system, a first question from a first user;
  
  process the first question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
  
  generate, by the QA system, one or more potential candidate answers for answering the first question;
  
  generate, by the QA system, a confidence score for the one or more potential candidate answers to the first question, wherein the score is determined by comparing the one or more candidate answers to the first question using one or more reasoning algorithms;
  
  generate a first set ranked list of candidate answers based on the confidence score for the one or more candidate answers;
  
  store the generated first set ranked list of candidate answers, by the QA system, in association with the first question received by the first user;
  
  receive, by the Question Answering (QA) system, a second question from a second user subsequent to the first question, the second question being the same as the first question received by the first user;
  
  process the second question, by one or more software engines of the QA system, using the updated file structure, into one or more queries to apply to a corpora and/or knowledge domain;
  
  generate, by the QA system, one or more potential candidate answers for answering the second question;
  
  generate, by the QA system, a confidence score for the one or more potential candidate answers to the second question, wherein the score is determined by comparing the one or more candidate answers to the second question using one or more reasoning algorithms;
  
  generate a second set ranked list of candidate answers based on the confidence score for the one or more candidate answers to the second question;
  
  compare, by the QA system, the generated second set ranked list of candidate answers to the second question to the stored generated first set ranked list of candidate answers to the first question; and
  
  identify, by the QA system, differences between the first set ranked list of candidate answers to the second set ranked list of candidate answers.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The apparatus of claim 14, wherein the instructions further cause the processor to:
    - identify other content on the website on the other computing device via the network;
      
      identify cross reference information between the content and the other content;
      
      update the file structure associated with content with the cross reference information; and
      
      transmit the updated file structure associated with the content to the specific corpus in the NLP system based on the one or more classification identifiers.
  - 16. The apparatus of claim 15, wherein the file structure of the other content is updated with the cross reference information associated with the content.
  - 17. The apparatus of claim 15, wherein the cross reference information is identified using at least one of the group consisting of:
    - parsing, structural analysis, hierarchical analysis, or concept extraction.
  - 18. The apparatus of claim 14, wherein the path information is identified from a Uniform Resource Locator (URL) of the web page or an HyperText Markup Language (HTML) of the web page and wherein the path information is utilized to determine document information directly identified in the content or associated with the content.
  - 19. The apparatus of claim 15, wherein the content identification mechanism comprises at least one of the group consisting of:
    - an Internet bot, a web crawler, a web scraper, a web spider, an ant, or an automatic indexer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bufe, III, John P.
Primary Examiner(s)
Hang, Vu B

Application Number

US14/275,484
Publication Number

US 20150324350A1
Time in Patent Office

2,185 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/3329   Natural language query form...

G06F 16/3344   using natural language anal...

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

Identifying content and content relationship information associated with the content for ingestion into a corpus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

23 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying content and content relationship information associated with the content for ingestion into a corpus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links