Systems, methods, and apparatus for processing documents to identify structures
First Claim
Patent Images
1. An apparatus for electronically identifying and compiling chemical structures found in a storage facility comprising one or more electronic files, the apparatus comprising:
- (a) a memory for storing a code defining a set of instructions; and
(b) a processor for executing the set of instructions, wherein the instructions, when executed by the processor, cause the processor to;
(i) identify a plurality of candidate chemical structures in at least a portion of the one or more electronic files of the storage facility, whereineach electronic file of the portion of the one or more electronic files comprises at least one respective non-embedded image of a chemical structure, andidentifying a respective candidate chemical structure comprises identifying one or more graphical features common to chemical structures;
(ii) for each candidate chemical structure of the plurality of candidate chemical structures, derive a respective chemical structure object with an associated set of properties, whereinone or more properties of the set of properties is derived from at least a portion of the one or more graphical features common to chemical structures,a first property of the set of properties is a number of carbons, wherein the number of carbons is derived from the one or more graphical features common to chemical structures, anda second property of the set of properties comprises one of the following;
(A) number of hetero atoms, (B) number of bonds, (C) number of bonds of a selected bond order, (D) number of rings, and (E) formula weight;
(iii) for each chemical structure object, apply one or more filters to at least one property of the associated set of properties, whereinthe one or more filters includes a filter configured to eliminate chemical structure objects having a value of the first property of the set of properties less than a predetermined number of carbons;
(iv) for each chemical structure object, compute a respective confidence factor value based on two or more properties of the set of properties associated with the chemical structure object;
whereinone or more chemical structure objects are eliminated based on respective confidence factor values in order to reduce false positives; and
(v) provide, for storage in a searchable electronic compendium of identified chemical structure objects, chemical structure objects not eliminated by the one or more filters.
4 Assignments
0 Petitions
Accused Products
Abstract
In various embodiments, multiple heterogeneous documents are processed to identify structures, such as chemical structures, contained therein, including non-embedded structures. Also described is a graphical user interface that permits a user to search for a structure or substructure within a set of electronic documents, then displays the matching structures as well as the actual pages of the documents on which the matching structures are found. Display of the actual pages allows the user to verify the matches and provides helpful context for the user.
-
Citations
19 Claims
-
1. An apparatus for electronically identifying and compiling chemical structures found in a storage facility comprising one or more electronic files, the apparatus comprising:
-
(a) a memory for storing a code defining a set of instructions; and (b) a processor for executing the set of instructions, wherein the instructions, when executed by the processor, cause the processor to; (i) identify a plurality of candidate chemical structures in at least a portion of the one or more electronic files of the storage facility, wherein each electronic file of the portion of the one or more electronic files comprises at least one respective non-embedded image of a chemical structure, and identifying a respective candidate chemical structure comprises identifying one or more graphical features common to chemical structures; (ii) for each candidate chemical structure of the plurality of candidate chemical structures, derive a respective chemical structure object with an associated set of properties, wherein one or more properties of the set of properties is derived from at least a portion of the one or more graphical features common to chemical structures, a first property of the set of properties is a number of carbons, wherein the number of carbons is derived from the one or more graphical features common to chemical structures, and a second property of the set of properties comprises one of the following;
(A) number of hetero atoms, (B) number of bonds, (C) number of bonds of a selected bond order, (D) number of rings, and (E) formula weight;(iii) for each chemical structure object, apply one or more filters to at least one property of the associated set of properties, wherein the one or more filters includes a filter configured to eliminate chemical structure objects having a value of the first property of the set of properties less than a predetermined number of carbons; (iv) for each chemical structure object, compute a respective confidence factor value based on two or more properties of the set of properties associated with the chemical structure object;
whereinone or more chemical structure objects are eliminated based on respective confidence factor values in order to reduce false positives; and (v) provide, for storage in a searchable electronic compendium of identified chemical structure objects, chemical structure objects not eliminated by the one or more filters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for automated identification of graphical representations of chemical structures comprising:
-
identifying, by a processor of a computing device, within an electronic file comprising one or more non-embedded images of chemical structures, at least one candidate chemical structure, wherein the at least one candidate chemical structure is identified based at least in part on graphical image processing, wherein the graphical image processing comprises identifying one or more graphical features common to chemical structures; for each of the at least one candidate chemical structure, deriving, by the processor, a respective chemical structure object, wherein the respective chemical structure object comprises a set of properties, wherein one or more properties of the set of properties is derived from at least a portion of the one or more graphical features common to chemical structures, a first property of the set of properties is a number of carbons, and a second property of the set of properties comprises one of the following;
(A) number of hetero atoms, (B) number of bonds, (C) number of bonds of a selected bond order, (D) number of rings, and (E) formula weight;for each chemical structure object, applying, by the processor, one or more filters, wherein each filter of the one or more filters is configured to compare one or more properties of the set of properties of a given chemical structure object to one or more predetermined values, and a first filter of the one or more filters is configured to eliminate chemical structure objects based at least in part on a determination that the chemical structure object has fewer than a predetermined number of carbons; for each chemical structure object, calculating a respective confidence factor value based on two or more properties of the set of properties associated with the chemical structure object, wherein each chemical structure of the at least one confirmed chemical structure is identified based at least in part upon respective confidence score associated with the respective chemical structure object derived therefrom; and identifying, by the processor, at least one confirmed chemical structure, wherein each chemical structure of the at least one confirmed chemical structure is identified based at least in part upon the respective chemical structure object derived therefrom avoiding elimination by the one or more filters. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A non-transitory computer-readable medium having instructions stored thereon, wherein the instructions, when executed by a processor, cause the processor to:
-
identify, within an electronic file comprising one or more non-embedded images of chemical structures, at least one candidate chemical structure, wherein the at least one candidate chemical structure is identified based at least in part on graphical image processing, wherein the graphical image processing comprises identifying one or more graphical features common to chemical structures; for each of the at least one candidate chemical structure, derive a set of properties, wherein one or more properties of the set of properties is derived from at least a portion of the one or more graphical features common to chemical structures, a first property of the set of properties is a number of carbons, and a second property of the set of properties comprises one of the following;
(A) number of hetero atoms, (B) number of bonds, (C) number of bonds of a selected bond order, (D) number of rings, and (E) formula weight;for each candidate chemical structure, apply one or more filters, wherein each filter of the one or more filters is configured to compare one or more properties of the set of properties of a given candidate chemical structure to one or more predetermined values, and a first filter of the one or more filters is configured to eliminate candidate chemical structures based at least in part on a determination that the given candidate chemical structure has fewer than a predetermined number of carbons; for each chemical structure object, calculate a respective confidence factor value based on two or more properties of the set of properties associated with the chemical structure object, wherein each chemical structure of the at least one confirmed chemical structure is identified based at least in part upon respective confidence score associated with the respective chemical structure object derived therefrom; and identify at least one confirmed chemical structure, wherein each chemical structure of the at least one confirmed chemical structure is identified based at least in part upon the candidate chemical structure avoiding elimination by the one or more filters. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification