Methods and Systems of Automatic Ontology Population

US 20090012842A1
Filed: 04/25/2008
Published: 01/08/2009
Est. Priority Date: 04/25/2007
Status: Abandoned Application

First Claim

Patent Images

1. A method for generating a knowledge graph from a corpus of literature wherein the corpus has multiple documents, comprising:

a. dividing documents from the corpus into sentences;

b. parsing each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms;

c. creating a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence;

d. creating a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein at least two statements share one term in common and one term not in common and at least one statement comprises an assertion that is not a hypernym/hyponym assertion;

wherein the knowledge graph is created by;

i. creating a training data set by assigning to a subset of term pairs probabilities of the truth of a directional relation for the pair;

ii. using entries in the path-counts matrix and the training data set to produce rules for determining the probability related to the truth of a relation; and

iii. assigning probabilities of the truth of the relation for pairs of terms of the knowledge graph using the rules, thereby creating the knowledge graph; and

e. storing the knowledge graph on a computer readable medium.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for creating a knowledge graph that relates terms in a corpus of literature in the form of an assertion and provides a probability of the veracity of the assertion are disclosed herein. Various aspects of the invention are directed to and/or involve knowledge graphs and structured digital abstracts (SDAs) offering a machine readable representation of statements in a corpus of literature. Various methods and systems of the invention can automatically extract, structure, and visualize the statements. Such graphs and abstracts can be useful for a variety of applications including, but not necessarily limited to, semantic-based search tools for search of electronic medical records, specific content verticals (e.g. newswire, finance, history) and general internet searches.

Citations

44 Claims

1. A method for generating a knowledge graph from a corpus of literature wherein the corpus has multiple documents, comprising:
- a. dividing documents from the corpus into sentences;
  
  b. parsing each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms;
  
  c. creating a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence;
  
  d. creating a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein at least two statements share one term in common and one term not in common and at least one statement comprises an assertion that is not a hypernym/hyponym assertion;
  
  wherein the knowledge graph is created by;
  
  i. creating a training data set by assigning to a subset of term pairs probabilities of the truth of a directional relation for the pair;
  
  ii. using entries in the path-counts matrix and the training data set to produce rules for determining the probability related to the truth of a relation; and
  
  iii. assigning probabilities of the truth of the relation for pairs of terms of the knowledge graph using the rules, thereby creating the knowledge graph; and
  
  e. storing the knowledge graph on a computer readable medium.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1 further comprising the step of creating a link from the knowledge graph to at least one sentence from which the probabilities were derived.
  - 3. The method of claim 1, wherein the training data set is modifiable by a user.

4. A knowledge graph on a computer readable medium derived from a corpus of literature comprising a plurality of statements, wherein each statement is derived from a portion of the corpus, each statement comprising at least four elements wherein;
- a. two elements are terms;
  
  b. one element is a directional relation that connects the two terms to form an assertion; and
  
  c. one element is an estimated probability that the assertion is true or false;
  
  wherein at least two statements share one term in common and one term not in common and at least one statement comprises an assertion that is not a hypernym/hyponym assertion.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 5. The graph of claim 4, wherein the assertion contains an ontological relationship.
  - 6. The graph of claim 4, wherein each statement comprises at least five elements wherein one element is a back-trace object that provides a link to the portion of the corpus that supports the veracity of the assertion.
  - 7. The graph of claim 4, wherein the probability element of some statements is automatically generated from a corpus of data.
  - 8. The graph of claim 4, wherein the probability element of most assertions in the graph is automatically generated from a corpus of data.
  - 9. The graph of claim 4, wherein the graph is a resource description framework.
  - 10. The graph of claim 9, wherein the framework is a probabilistic RDF.
  - 11. The graph of claim 4, wherein the probability element is derived from a path-counts matrix from the corpus of literature wherein a column represents a linguistic dependency path, a row represents a pair of terms, and an entry represents the number of times the pair of terms is connected by the path in a sentence.
  - 12. The graph of claim 11, wherein the path-counts matrix is from parsed sentences of the corpus of literature.
  - 13. The graph of claim 11, wherein the entry of the path-counts matrix represents a boolean vector of the number.
  - 14. The graph of claim 13, wherein the probability is calculated from the boolean vector by logistic regression.
  - 15. A method of searching a corpus of literature comprising obtaining the link from the back-trace object of the graph of claim 6.
  - 16. The method of claim 15 further comprising displaying the portion of the corpus from which the assertion was obtained.
  - 17. The graph of claim 5, wherein the ontological relationship is part of an ontology.

18. An automatically produced structural digital abstract of a document comprising a machine readable abstract comprising a plurality of statements wherein a statement comprises at least four elements wherein;
- a. two elements are terms;
  
  b. one element is a directional relation that connects the two terms to form an assertion; and
  
  c. one element is an estimated probability that the assertion is true or false;
- View Dependent Claims (19, 20)
- - 19. The structured digital abstract of claim 18 wherein the probability element is generated by applying rules determined using a path-counts matrix produced from parsed sentence entries from a corpus of literature, wherein a column in the path-counts matrix represents a linguistic dependency path, a row represents a pair of terms, and an entry represents the number of times in the corpus the terms are connected by the path in a sentence.
  - 20. The structured digital abstract of claim 18 wherein the assertions further comprise a link to the portion of the corpus from which the assertion was derived.

21. A method of semantically searching biomedical literature comprising:
- a. providing a search string, wherein the string is at least one of a term a relation, and an assertion of two terms with a directional relation linking the terms;
  
  b. comparing the search string with a knowledge graph produced from a corpus of literature which is. stored on a computer readable medium comprising a plurality of statements, wherein each statement is obtained from sentences within the corpus, each statement comprising at least four elements wherein;
  
  i. two elements are terms;
  
  ii. one element is a directional relation that connects the two terms to form an assertion;
  
  one element is an estimated probability that the assertion is true or false; and
  
  iii. one element is a back-trace object that provides a link to the portion of the corpus from which the assertion was obtained;
  
  c. ranking the statements obtained from the back-trace object that are most closely related to the search assertion; and
  
  d. displaying a representation of a subset of the statements that are closely related to the search assertion.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29)
- - 22. The method of claim 21 further comprising displaying a sentence from the corpus from which the statement was obtained using the back-trace object.
  - 23. The method of claim 21 further comprising displaying a reference from the corpus from which the statement was obtained using the back-trace object.
  - 24. The method of claim 21 further the ranking is determined by at least one of the criteria selected from the group consisting of:
    - the extent to which the statements match the search assertion, the impact factor of the reference from which the statements were derived, the number of citations to the papers from which the statements were derived, the number of citations to the authors of each paper, the number of citations involving topics which the paper covers, the time at which these papers were published, and the extent to which a given statement is central to a given topic.
  - 25. The method of claim 21 further the knowledge graph is a structured digital abstract.
  - 26. The method of claim 21 further the knowledge graph is a resource description framework.
  - 27. The method of claim 26, wherein the framework is a probabilistic RDF.
  - 28. The method of claim 21 further the portion of a sentence from which the statement was obtained is highlighted.
  - 29. The method of claim 21 further entering search terms comprises issuing SQL or SPARQL queries.

30. A computer implemented method of searching the internet comprising:
- a. methodically searching documents on web pages;
  
  b. extracting the content of the pages with a program that utilizes a path-counts matrix, pairs of terms, and corresponding relationship probabilities derived from a corpus of literature to extract pairs of terms and calculate probabilities for relations between the terms; and
  
  c. storing the extracted content of the pages in a computer readable format.

31. A computer program product that generates a knowledge graph comprising:
- a. code that divides documents from the corpus into sentences;
  
  b. code that parses each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms;
  
  c. code that creates a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence;
  
  d. code that creates a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein the knowledge graph is created by;
  
  i. creating a training data set by assigning to a subset of term pairs probabilities of the truth of a directional relation for the pair;
  
  ii. using entries in the path-counts matrix and the training data set to produce rules for determining the probability related to the truth of a relation; and
  
  iii. assigning probabilities of the truth of the relation for pairs of terms of the knowledge graph using the rules, thereby creating the knowledge graph.

32. A computer program product that generates a structured digital abstract comprising:
- a. code that divides a document into sentences, wherein the document belongs to or is to be added to a corpus of literature;
  
  b. code that parses each sentence into entries wherein an entry comprises (i) a pair of terms and (ii) a linguistic dependency path describing a directional relation between the terms;
  
  c. code that creates a path-counts matrix from the parsed sentence entries comprising rows and columns wherein a row represents a pair of terms, a column represents a linguistic dependency path, and a cell represents the number of times in the corpus that the terms are connected by the path in a sentence; and
  
  d. code that creates a knowledge graph comprising a plurality of statements, wherein each statement is obtained from a portion of the corpus, each statement comprising at least four elements wherein two elements are terms, one element is a directional relation that connects the two terms to form an assertion, and one element is an estimated probability that the assertion is true or false, wherein the knowledge graph is related to the document, thereby creating a structured digital abstract.

33. A business method comprising;
- a. entering into a contract with an owner of a corpus of literature to produce an ontological graph from their corpus;
  
  b. producing a knowledge graph by creating a path-counts matrix from the parsed sentence entries from the corpus of literature wherein a column represents an linguistic dependency path, the rows represent a pair of terms, and the entries represent the number of times the terms are connected by the path in a sentence, wherein revenue is derived from the use of the knowledge graph that was generated from the owner'"'"'s corpus of literature.
- View Dependent Claims (34, 35)
- - 34. The business method of claim 33 wherein the revenue is derived by selling ad space on a web page that allows search of the knowledge graph.
  - 35. The business method of claim 33 wherein the revenue is derived by selling access to the database.

36. A graph representing assertions derived from a body of literature, wherein the assertions are represented in statements, wherein each of the statements includes two terms and relation, the relation term connecting the two terms, thereby fomming an assertion, the graph comprising:
- a. a plurality of assertions, each representing the two terms and a relation, wherein the relation is a directional relation; and
  
  b. at least one estimated probability that the directional relation of at least one of the assertions is true or false.

37. A method for determining a confidence level of an assertion present in a body of literature wherein the assertion represents a relationship between two terms, the method comprising:
- a. generating relational data to represent a relationship between each of the terms and the assertion; and
  
  b. using the relational data to estimate a confidence level for the assertion.
- View Dependent Claims (38)
- - 38. The method of claim 37 wherein the relational data is represented in a path-counts matrix.

39. A method for determining a veracity level of an assertion representing a relationship between two terms using a body of literature, the method comprising:
- a. from the body of literature, automatically accessing assertions where each assertion represents an relation that connects the two terms;
  
  b. for the automatically accessed statements, defining a numerically-based relationship with the assertion;
  
  c. using the numerically-based relationship to generate estimated probability data as a confidence level for the assertion.

40. A computer implemented method comprising:
- a. generating relational data from a corpus of literature for a pair of terms in a corpus of literature; and
  
  b. correlating the relational data with a confidence level for an assertion, wherein the assertion comprises the terms and a directional relation that connects the terms.
- View Dependent Claims (41, 42)
- - 41. The method of claim 40 further comprising displaying the confidence level and the assertion on a user interface.
  - 42. The method of claim 40 further comprising providing the confidence level and assertion to a user conducting a computer based search.

43. A method comprising:
- a. executing computer code that generates training data comprising a plurality of elements, each element comprising (i) an assertion comprising a pair of terms from a corpus and a directional relation between the terms, (ii) a confidence level that the assertion is true or false for the terms and (iii) relational data between the terms derived from the corpus; and
  
  b. executing computer code that generates a rule that classifies the confidence that the assertion is true or false for a pair of terms from the corpus.

44. A system comprising:
- a. a database comprising a corpus of literature in machine readable form; and
  
  b. a computer comprising an algorithm for determining a confidence level of an assertion present in a body of literature wherein the assertion represents a relationship between two terms, wherein the algorithm;
  
  (i) generates relational data to represent a relationship between each of the terms and the assertion; and
  
  (ii) uses the relational data to estimate a confidence level for the assertion.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Counsyl Incorporated (Myriad Genetics, Inc.)
Original Assignee
Counsyl Incorporated (Myriad Genetics, Inc.)
Inventors
Snow, Rion L., Srinivasan, Balaji S.

Application Number

US12/110,199
Publication Number

US 20090012842A1
Time in Patent Office

Days
Field of Search
US Class Current

705/10
CPC Class Codes

G06F 16/3344 using natural language anal...

G06F 40/30 Semantic analysis

Methods and Systems of Automatic Ontology Population

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and Systems of Automatic Ontology Population

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links