INFERRING BIOLOGICAL PATHWAYS FROM UNSTRUCTURED TEXT ANALYSIS
First Claim
Patent Images
1. A method for discovering a pathway among a set of biological and/or chemical entities, comprising:
- a) providing documents about each of the biological and/or chemical entities;
b) creating a vector space representation of the documents based on words and/or phrases occurring in the documents;
c) for each biological and/or chemical entity, creating a centroid in the vector space based on the vectors corresponding to documents mentioning that biological and/or chemical entity;
d) creating a relative distance network of the biological and/or chemical entities, in view of the centroids, thereby identifying a particular pathway connecting the centroids; and
e) finding at least one most connected centroid on said particular pathway, thereby identifying a particular biological and/or chemical entity for further investigation, wherein said particular biological and/or chemical entity corresponds to said at least one most connected centroid.
1 Assignment
0 Petitions
Accused Products
Abstract
A biological pathway is a series of actions that take place in an organism that lead to some resulting pathology or otherwise change the organism state. In the cell, these actions typically take place between molecules called proteins. Proteins within the cell interact in ways that are not fully understood, but evidence concerning these interactions is constantly being collected and published by microbiologists. The disclosed method automatically infers such biological pathways between proteins by looking at the overall system of published literature about those proteins.
10 Citations
19 Claims
-
1. A method for discovering a pathway among a set of biological and/or chemical entities, comprising:
-
a) providing documents about each of the biological and/or chemical entities; b) creating a vector space representation of the documents based on words and/or phrases occurring in the documents; c) for each biological and/or chemical entity, creating a centroid in the vector space based on the vectors corresponding to documents mentioning that biological and/or chemical entity; d) creating a relative distance network of the biological and/or chemical entities, in view of the centroids, thereby identifying a particular pathway connecting the centroids; and e) finding at least one most connected centroid on said particular pathway, thereby identifying a particular biological and/or chemical entity for further investigation, wherein said particular biological and/or chemical entity corresponds to said at least one most connected centroid. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
a. receiving a set of biological and/or chemical entities of interest, E; b. identifying a document set, R, mentioning any biological and/or chemical entity, and/or a variant thereof in E; c. creating a dictionary, D, from common terms and/or phrases in documents of document set R; d. assigning each document in document set R a numeric vector using a vector space model based on said dictionary D; e. computing a centroid for each biological and/or chemical entity in E by averaging numerical vectors of documents in R mentioning that biological and/or chemical entity; f. computing a distance matrix listing a distance between pairs of centroids; g. creating a relative neighborhood graph of biological and/or chemical entities in E based on said computed distance matrix, said relative neighborhood graph identifying a particular pathway connecting computed centroids; and h. identifying, from said relative neighborhood graph, at least one most connected centroid and outputting biological and/or chemical entity associated with said at least one most connected centroid. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A non-transitory, computer accessible memory medium storing program instructions for discovering a pathway among a set of biological and/or chemical entities, wherein the program instructions are executable by a processor to:
-
a. receive a set of biological and/or chemical entities of interest, E; b. identify a document set, R, mentioning any biological and/or chemical entity, and/or a variant thereof in E; c. create a dictionary, D, from common terms and/or phrases in documents of document set R; d. assign each document in document set R a numeric vector using a vector space model based on said dictionary D; e. compute a centroid for each biological and/or chemical entity in E by averaging numerical vectors of documents in R mentioning that biological and/or chemical entity; f. compute a distance matrix listing a distance between pairs of centroids; g. create a relative neighborhood graph of biological and/or chemical entities in E based on said computed distance matrix, said relative neighborhood graph identifying a particular pathway connecting computed centroids; and h. identify, from said relative neighborhood graph, at least one most connected centroid and outputting biological and/or chemical entity associated with said at least one most connected centroid.
-
-
19. A system for discovering a pathway among a set of biological/chemical entities, the system comprising:
-
one or more processors; and a memory comprising instructions which, when executed by the one or more processors, cause the one or more processors to; a. receive a set of biological and/or chemical entities of interest, E; b. identify a document set, R, mentioning any biological and/or chemical entity, and/or a variant thereof in E; c. create a dictionary, D, from common terms and/or phrases in documents of document set R; d. assign each document in document set R a numeric vector using a vector space model based on said dictionary D; e. compute a centroid for each biological and/or chemical entity in E by averaging numerical vectors of documents in R mentioning that biological and/or chemical entity; f. compute a distance matrix listing a distance between pairs of centroids; g. create a relative neighborhood graph of biological and/or chemical entities in E based on said computed distance matrix, said relative neighborhood graph identifying a particular pathway connecting computed centroids; and h. identify, from said relative neighborhood graph, at least one most connected centroid and outputting biological and/or chemical entity associated with said at least one most connected centroid.
-
Specification