Methods and Systems of Automatic Ontology Population
First Claim
1. A method for generating a knowledge graph from a corpus of literature wherein the corpus has multiple documents, comprising:
- a. dividing documents from the corpus into sentences;
b. parsing each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms;
c. creating a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence;
d. creating a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein at least two statements share one term in common and one term not in common and at least one statement comprises an assertion that is not a hypernym/hyponym assertion;
wherein the knowledge graph is created by;
i. creating a training data set by assigning to a subset of term pairs probabilities of the truth of a directional relation for the pair;
ii. using entries in the path-counts matrix and the training data set to produce rules for determining the probability related to the truth of a relation; and
iii. assigning probabilities of the truth of the relation for pairs of terms of the knowledge graph using the rules, thereby creating the knowledge graph; and
e. storing the knowledge graph on a computer readable medium.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and systems for creating a knowledge graph that relates terms in a corpus of literature in the form of an assertion and provides a probability of the veracity of the assertion are disclosed herein. Various aspects of the invention are directed to and/or involve knowledge graphs and structured digital abstracts (SDAs) offering a machine readable representation of statements in a corpus of literature. Various methods and systems of the invention can automatically extract, structure, and visualize the statements. Such graphs and abstracts can be useful for a variety of applications including, but not necessarily limited to, semantic-based search tools for search of electronic medical records, specific content verticals (e.g. newswire, finance, history) and general internet searches.
-
Citations
44 Claims
-
1. A method for generating a knowledge graph from a corpus of literature wherein the corpus has multiple documents, comprising:
-
a. dividing documents from the corpus into sentences; b. parsing each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms; c. creating a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence; d. creating a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein at least two statements share one term in common and one term not in common and at least one statement comprises an assertion that is not a hypernym/hyponym assertion; wherein the knowledge graph is created by; i. creating a training data set by assigning to a subset of term pairs probabilities of the truth of a directional relation for the pair; ii. using entries in the path-counts matrix and the training data set to produce rules for determining the probability related to the truth of a relation; and iii. assigning probabilities of the truth of the relation for pairs of terms of the knowledge graph using the rules, thereby creating the knowledge graph; and e. storing the knowledge graph on a computer readable medium. - View Dependent Claims (2, 3)
-
-
4. A knowledge graph on a computer readable medium derived from a corpus of literature comprising a plurality of statements, wherein each statement is derived from a portion of the corpus, each statement comprising at least four elements wherein;
-
a. two elements are terms; b. one element is a directional relation that connects the two terms to form an assertion; and c. one element is an estimated probability that the assertion is true or false; wherein at least two statements share one term in common and one term not in common and at least one statement comprises an assertion that is not a hypernym/hyponym assertion. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. An automatically produced structural digital abstract of a document comprising a machine readable abstract comprising a plurality of statements wherein a statement comprises at least four elements wherein;
-
a. two elements are terms; b. one element is a directional relation that connects the two terms to form an assertion; and c. one element is an estimated probability that the assertion is true or false; - View Dependent Claims (19, 20)
-
-
21. A method of semantically searching biomedical literature comprising:
-
a. providing a search string, wherein the string is at least one of a term a relation, and an assertion of two terms with a directional relation linking the terms; b. comparing the search string with a knowledge graph produced from a corpus of literature which is. stored on a computer readable medium comprising a plurality of statements, wherein each statement is obtained from sentences within the corpus, each statement comprising at least four elements wherein; i. two elements are terms; ii. one element is a directional relation that connects the two terms to form an assertion;
one element is an estimated probability that the assertion is true or false; andiii. one element is a back-trace object that provides a link to the portion of the corpus from which the assertion was obtained; c. ranking the statements obtained from the back-trace object that are most closely related to the search assertion; and d. displaying a representation of a subset of the statements that are closely related to the search assertion. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. A computer implemented method of searching the internet comprising:
-
a. methodically searching documents on web pages; b. extracting the content of the pages with a program that utilizes a path-counts matrix, pairs of terms, and corresponding relationship probabilities derived from a corpus of literature to extract pairs of terms and calculate probabilities for relations between the terms; and c. storing the extracted content of the pages in a computer readable format.
-
-
31. A computer program product that generates a knowledge graph comprising:
-
a. code that divides documents from the corpus into sentences; b. code that parses each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms; c. code that creates a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence; d. code that creates a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein the knowledge graph is created by; i. creating a training data set by assigning to a subset of term pairs probabilities of the truth of a directional relation for the pair; ii. using entries in the path-counts matrix and the training data set to produce rules for determining the probability related to the truth of a relation; and iii. assigning probabilities of the truth of the relation for pairs of terms of the knowledge graph using the rules, thereby creating the knowledge graph.
-
-
32. A computer program product that generates a structured digital abstract comprising:
-
a. code that divides a document into sentences, wherein the document belongs to or is to be added to a corpus of literature; b. code that parses each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms; c. code that creates a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence; and d. code that creates a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein the knowledge graph is related to the document, thereby creating a structured digital abstract.
-
-
33. A business method comprising;
-
a. entering into a contract with an owner of a corpus of literature to produce an ontological graph from their corpus; b. producing a knowledge graph by creating a path-counts matrix from the parsed sentence entries from the corpus of literature wherein a column represents an linguistic dependency path, the rows represent a pair of terms, and the entries represent the number of times the terms are connected by the path in a sentence, wherein revenue is derived from the use of the knowledge graph that was generated from the owner'"'"'s corpus of literature. - View Dependent Claims (34, 35)
-
-
36. A graph representing assertions derived from a body of literature, wherein the assertions are represented in statements, wherein each of the statements includes two terms and relation, the relation term connecting the two terms, thereby fomming an assertion, the graph comprising:
-
a. a plurality of assertions, each representing the two terms and a relation, wherein the relation is a directional relation; and b. at least one estimated probability that the directional relation of at least one of the assertions is true or false.
-
-
37. A method for determining a confidence level of an assertion present in a body of literature wherein the assertion represents a relationship between two terms, the method comprising:
-
a. generating relational data to represent a relationship between each of the terms and the assertion; and b. using the relational data to estimate a confidence level for the assertion. - View Dependent Claims (38)
-
-
39. A method for determining a veracity level of an assertion representing a relationship between two terms using a body of literature, the method comprising:
-
a. from the body of literature, automatically accessing assertions where each assertion represents an relation that connects the two terms; b. for the automatically accessed statements, defining a numerically-based relationship with the assertion; c. using the numerically-based relationship to generate estimated probability data as a confidence level for the assertion.
-
-
40. A computer implemented method comprising:
-
a. generating relational data from a corpus of literature for a pair of terms in a corpus of literature; and b. correlating the relational data with a confidence level for an assertion, wherein the assertion comprises the terms and a directional relation that connects the terms. - View Dependent Claims (41, 42)
-
-
43. A method comprising:
-
a. executing computer code that generates training data comprising a plurality of elements, each element comprising (i) an assertion comprising a pair of terms from a corpus and a directional relation between the terms, (ii) a confidence level that the assertion is true or false for the terms and (iii) relational data between the terms derived from the corpus; and b. executing computer code that generates a rule that classifies the confidence that the assertion is true or false for a pair of terms from the corpus.
-
-
44. A system comprising:
-
a. a database comprising a corpus of literature in machine readable form; and b. a computer comprising an algorithm for determining a confidence level of an assertion present in a body of literature wherein the assertion represents a relationship between two terms, wherein the algorithm;
(i) generates relational data to represent a relationship between each of the terms and the assertion; and
(ii) uses the relational data to estimate a confidence level for the assertion.
-
Specification