System and method for global identification in a collection of documents
1. A method comprising:
- identifying matching pairs of one or more information objects corresponding to a real world object, one information object from a document and at least one information object from a document storage for a combination of global identification patterns that exist in the document and in the document storage;
ascertaining consistency of the matching pairs and determining which of the one or more information objects in the document are suitable for merging into the document storage; and
adding the one or more information objects from the document to the document storage to associate information objects corresponding to the real world object.
Techniques for machine-based identification of objects extracted from text documents in natural language are disclosed. An example method may comprise: identifying matching pairs of one or more information objects corresponding to a real world object, one information object from the document and at least one information object from the document storage for a combination of global identification patterns that exist in the document and in the document storage; ascertaining consistency of the matching pairs and determining which of the one or more information objects in the document are suitable for merging into the document storage; and adding the one or more information objects from the document to the document storage to associate information objects corresponding to the real world object.
|Fact-based object merging|
Patent #US 7,966,291 B1
Current AssigneeGoogle LLC
Sponsoring EntityGoogle Inc.
|Object Recognition Device, Object Recognition Method, Object Recognition Program, Feature Registration Device, Feature Registration Method, and Feature Registration Program|
Patent #US 20070269107A1
Current AssigneeSony Corporation
Sponsoring EntitySony Corporation
|Query Execution Systems and Methods|
Patent #US 20120310916A1
Current AssigneeYale University
Sponsoring EntityYale University
- 1. A method comprising:
identifying matching pairs of one or more information objects corresponding to a real world object, one information object from a document and at least one information object from a document storage for a combination of global identification patterns that exist in the document and in the document storage; ascertaining consistency of the matching pairs and determining which of the one or more information objects in the document are suitable for merging into the document storage; and adding the one or more information objects from the document to the document storage to associate information objects corresponding to the real world object.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- 12. A system comprising:
a memory; and a processor, coupled to the memory, the processor to; identify matching pairs of one or more information objects corresponding to a real world object, one information object from a document and at least one information object from a document storage for a combination of global identification patterns that exist in the document and in the document storage; ascertain consistency of the matching pairs and determining which of the one or more information objects in the document are suitable for merging into the document storage; and add the one or more information objects from the document to the document storage to associate information objects corresponding to the real world object.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- 19. A computer-readable non-transitory storage medium comprising executable instructions that, when executed by a processing device, cause the processing device to:
- View Dependent Claims (20)
This application is a continuation of U.S. application Ser. No. 14/714,556, filed May 18, 2015, and claims the benefit of priority to Russian Patent Application No. 2015/109,664, filed Mar. 19, 2015; disclosure of which is incorporated herein by reference in its entirety.
The present disclosure relates to the field of machine processing of natural language and identification of information in text documents.
When working with a collection of documents, it is often necessary to search for the desired information in the collection. Search results for information of interest may be generated by search engines using keywords entered by a user as a search query. Existing search systems enable users to use simple query languages to find documents that either contain or do not contain the words or word combinations specified by the user.
The search of information in numerous and heterogeneous text resources inevitably meets the facts that the same event, object, person is expressed in various documents by a different way using various words, expressions, notations etc. For example, a system of information extraction should understand that “Winter Olympics 2014”, HarrpHMep, rrporpaMMa H3BJieqeHH51 HHQ:>opMa:o;HH,ll;Omrma rroHHMaTh, qTo “Olympic Games in Sochi”, “Olympics in Sochi” etc. correspond to the same event, as well as “Yuri Gagarin”, “first cosmonaut of the Earth”, “first soviet cosmonaut” refer to the same person.
In order to increase the reliability and completeness of such searches and state that, for example, two objects from two different documents correspond to the same real world object, identifying features of such objects need to be determined or known. Still, even if some unique identifying features are determined for two objects, those objects can turn out to be different, such as complete namesakes.
Special models of presenting data, such as Resource Description Framework (RDF) are used to store information of objects in a collection of documents. RDF is a graph structure presenting a set of statements about entities, which are the real world objects (such as people, organizations, location), as well as facts (such as the fact of a person working at a particular organization). Each statement is presented in the form of three data entities (subject, predicate, object} and it is called a “triplet”. A plurality of statements—triplets form a graph with its nodes corresponding to objects and subjects linked by arcs—predicates directed from subjects to objects. Such RDF graphs can be constructed for one sentence as well as for the entire document in the collection of documents.
Each real world object in the collection is associated with one or more features of an RDF graph, and different copies of the same real world object in different documents can be characterized by the same features. Therefore, the task of global identification consists in comparing objects from texts in a natural language with each other and with real world objects and in creating RDF-graph and at one or more index of the document collection as different objects with identical features are represented in the RDF graph as the same object.
According to the present invention, a method of machine-based identification of one or more objects in a document and in a document storage, the objects corresponding to the same real world object, comprises: (a) searching for global identification patterns and a combination of the global identification patterns in the document; (b) searching for the same global identification patterns and their combinations in the document storage; (c) finding matching pairs of the combination of the patterns for said one or more objects from the document and from the document storage; (d) ascertaining consistency of the matching pairs and deciding which said one or more objects in the document can be merged with that in the document storage; and (e) adding the one or more objects in the document to the document storage. It is contemplated by the present method that step (e) further comprises adding one or more feature of said one or more information object of the document to the document storage if said one or more feature is absent from the document storage and if said one or more information object in the document and in the document storage corresponds to the same real world object.
It is also contemplated by the inventive method that step (e) further comprises adding one or more information object from the document to the document storage as a new information object if said one or more information object in the document storage does not have one or more information object in the document storage corresponding to the same real world object.
The method contemplates that the global identification patterns correspond to features of real world objects. The document and the document storage are presented in a form of an RDF graph.
The method also contemplates that a computer language SPARQL is used for communicating with the document storage. Additionally, ascertaining consistency of the matching pairs further comprises ascertaining cardinality of features of said one or more objects.
Various other aspects and embodiments of the disclosure are described in further detail below. It has been contemplated that features of one embodiment of the disclosure may be incorporated in other embodiments thereof without further recitation.
The Summary is neither intended nor should be construed as being representative of the full extent and scope of the present disclosure. All objects, features and advantages of the present disclosure will become apparent in the following detailed written description and in conjunction with the accompanying drawings.
The novel features believed being characteristic of the description are set forth in the appended claims.
The images in the drawings are simplified for illustrative purposes and are not depicted to scale.
To facilitate understanding, identical reference numerals are used in the drawings to designate, where possible, substantially identical elements that are common to the figures, except that alphanumerical extensions and/or suffixes may be added, when appropriate, to differentiate such elements.
Objects, features and advantages of the present disclosure are discussed below in reference to machine-based semantic identification of searched information in collections of text documents available, in an electronic form, in one of natural languages.
It has been contemplated that at least portions of the present invention may also be utilized for analysis of other types of documents (e.g., multi-language documents, etc.). We distinguish objects of real word and their mentions, names, references to them in texts in a natural language, that we call information objects.
Global identification in a collection of documents is a merging information objects related to the same object of real world in process of extracting information from a text (or corpus of texts) written in a natural language to merge RDF graphs of numerous documents into one common RDF graph by merging matching objects. In one embodiment, for accomplishing such goal a merger of information from a large selection of documents by running a global identification process for each document and successively comparing all documents among each other is applied. That process continues until the last pair of information objects is merged into the common RDF graph. Another possibility is merging the information objects by successively adding information objects from a new document to a storage. In order for the referenced merger to be effective, information from the documents need to be stored as special structures—RDF graphs.
Global identification consists of associating the information objects from the document with information objects already contained in the storage and merging identical information objects. Because the information about the objects is represented as a graph, global identification task may be alternatively formulated as a task of search for identical subgraphs in the RDF graphs of the document and the storage.
Global identification is implemented with the help of global identification patterns. The SPARQL language is used to define patterns. A single pattern and is responsible for only one of an information object'"'"'s features. Therefore, reliable identification generally uses combinations of patterns. For example, the pattern combination <“First name”, “Surname”> is used to identify an information object with type “Person”. A combination may consist of an arbitrary number of patterns. All global identification patterns are contained in a special library. It is structured so that patterns designed to identify various objects in the real world are stored separately from each other.
Different combinations can have different combination reliability, which is determined by three parameters. The first parameter is the weight which reflects how precisely the combination can identify objects. The weight ranges from 0 to 1. The closer it is to 1, the more probable it is that two objects correspond to the object/entity of the real world. The other two parameters are the unicity parameter and the specialty parameter. The specialty parameter indicates how many different values a combination has, the unicity parameter indicates how frequently different objects are characterized by the same combination value. The second and third parameters range from 0 to 1. If unicity is equal to 0, it means that the object has more than one feature. If specialty is equal to 0, it means that there is more than one object with that one feature. And vice versa, of unicity and specialty are equal to 1, then the feature of an object has only one value and there is only one object with that feature. For example, a combination <“first name”, “Surname”> has low combination reliability, because there can be many different people with the same first names and Surnames, but a combination <“first name”, “middle name”, “Surname”> will have a higher combination reliability, because complete namesakes are rare.
In one embodiment the reliable combinations is defined with the help of one or more classifiers. The classifiers are trained on a tagged corpus of texts, storing combinations of patterns that were used for merging information objects. The result of the classifier training is a set of combinations.
In one embodiment one or more linear classifiers are used. The classifiers output a set of coefficients—the weights bound to the patterns of global identification. Then, during the global identification for each information object the classifier can identify sets of patterns with the sum of weights exceeding a predetermined threshold, and determine such sets as combinations to be used for merging information objects.
Each pattern always contains variables this and that, called allocated variables. Variable this is the name for an identifiable information object. Variable this corresponds to an information object from the document, which is a merger candidate. Variable that is the name of an identifying information object (or the value of the feature). After comparing the patterns with the graph, variable that may correspond to either to an information object from the document or to a value of a simple property (a line, number or Boolean true/false) according to which the process of global identification will be accomplished. The value corresponding to that is called the value of the pattern for the given object.
In addition to this and that variables pattern may comprise local variables. These local variables define information objects, through which in RDF-graph of the document the identifiable information object (this) is linked with the identifying information object (that). If it is possible to find a group of information objects in the document, such that after the variables in the pattern are substituted for the objects of this group, the resulting substitution matches a subgraph of RDF-graph of the document, then the pattern is considered to be realized (or executed) in this document. The result of the searching and realizing of global identification patterns in the RDF-graph of document is a list of triples <this, that, pattern>, which are called matchings. Instead of this—variable in the list identifiers of information objects—candidates for merging are located, instead of that-variable identifiers of information objects are located or identifiers of specific values of simple properties (string, number, Boolean), depending on the feature specified in the pattern may be located.
Collections of documents, or, more specifically, their RDF graphs, are stored in a storage. Before the document enters the storage, the information must be extracted from the text and represented using a special data structure that enables rapid searching of the information and also allows it to be stored compactly. Moreover, the information extraction process itself represents a complex technical task, which for the purposes of the present invention is performed using a system of production rules that are in turn applied to structures resulting from a complete semantic-syntactic analysis.
The main steps of the method being described are outlined in
The results of the complete semantic and syntactic analysis are then used in the information extraction process at step 130, from which an RDF (Resource Description Framework) graph is generated. The information extraction module processes a forest of semantic-syntactic trees, one tree for each sentence of the source text. In accordance with the RDF concept, the extracted data is represented as a set of <subject, predicate, object> (<s, p, o>) triplets. The subject is some entity, or information object, that represents an object in the real world. The predicate is a certain feature that describes the subject. There are two types of predicates (features): attributes and relations. An attribute is a non-object feature with the value of a simple data type: string, integer, or Boolean value. A relation is an object feature having a value, which is another information object that represents a different entity in the real world. An object is therefore a given predicate'"'"'s value for a given subject and may be either a simple data type (integer, string, etc.) or the identifier of a different information object. There are various types of information objects, for example: Person, Location, Organization, Employer etc. All RDF data extracted from text conforms to a model of the domain (the types of information objects match concepts from an appropriate ontology) within which the information extraction module is running.
To add the information extracted from documents to the storage, global identification at step 140 may be performed. Its purpose is to join the RDF graphs of separate documents into one common graph, while merging information objects that represent the same object in the real world.
The global identification process concludes by importing the data derived from a new document into the storage of ontological data at step 150.
An RDF-graph of the document 200 is an input for the global identification process (wherein we assume that object identification has been performed within some document and that all information objects in the graph are different). In the first stage of the identification process, a search is launched for known patterns and combinations of known patterns 210 in the RDF-graph of the document 200. Then the storage is also searched 220 for the corresponding patterns and combinations. If corresponding patterns and combinations are found, then a list of information objects that are candidates for a merge 230 is generated. These candidate information objects are tested for consistency 240. Consistency means that merging the information objects does not violate the cardinality of their relations (consistency with the ontology is not violated). If a pair fails the test, then the identification process returns to step 220. If the list passes the consistency test 240, it means that the information object from the document is already contained in the storage and all of the information object'"'"'s new features extracted from the document are added 260 to the add list 270. During the step 220, if combinations found in the RDF-graph of the document do not have corresponding combinations in the storage, it means that the document contains new information objects and these new information objects are to be added to the add list 250. In the last step, the add list 270 is formed. It contains the new information objects from the document and the new features of the objects, already existing in the storage.
Template B is more complex, it defines the feature of information object of type <Person> “to be born in a given year”. This template also has graph representation 303 and triplet representation 304. Information object of type <PointOlTime>, which specifies the year, is not directly linked to the information object of type <Person> but through the fact of birth (<Birth>). This fact in RDF-graph is represented as a separate information object. Thus, for a description of this feature one should specify information objects of types <Person>, <PointOlTime> and <Birth>, object relations between them, and attributes for each of the information objects. Filled arrows on the graph representation 303 correspond to the object relations, dotted arrows correspond to non-object relations. The direction of the arrow is given from the subject “s” to the object “o”, the name of the arrow corresponds to the predicate “p”.
Then we realize the entire pattern 404 by identifying a path from the variable that to the variable this in the graph of the document. First, we identify all candidates for variable this. Then, for each of the identified candidate we look for a candidate for a neighbor variables (neighbor variable is a variable which is connected to the considered variable with a single arc). Then we look for the neighbors of the neighbors, and so on until there are no more candidates for neighbor variables or until a conflict is found. If no conflict is identified, than the pattern is fully realized, and we save all sets of objects that fall under this and that variables simultaneously. All such triples <this, that, the pattern> are combined into a list of matchings 405.
After all possible patterns of identification are found in the graph of the document, for this-objects of matchings we search for objects for which the complete combination is realized. The combination is realized when all of the patterns that are included in it are realized. After that, all matching for all combinations are collected and stored.
The need in implementing cascade global identification can arise only for the patterns, in which the variable that must be replaces by an information object. If this information object is not identified, the object that replaces this-variable also cannot be identified, because, in fact, there is no identifier, or in other words, its value does not exist.
When an information object is added to the storage, a number identifier is assigned to it. In one embodiment, the storage contains tables of searching indexes that facilitate searching (i.e. iterating) as described below:
- (a) double index <subject(s), document (d)>. For each information object, this index makes it possible to view a list of documents that contain it. A search of documents containing a sought-after information object may be conducted efficiently thanks to the fact that all pairs <s, d> for the sought after s are arranged sequentially in the table;
- (b) triplet index <subject (s), predicate (p), object (o)>: <s, p, o>;
- (c) quad index <document (d), subject (s), predicate (p), object (o)>, i.e. for each document the list of triplets extracted from that document is stored.
The tables contain identifiers of the concepts, predicates, information objects, documents, and simple property values. Concept and predicate (attribute and relation) identifiers may be assigned when defining a specific domain. In one embodiment, an information object'"'"'s identifier is assigned when a new node is added to the storage'"'"'s RDF graph (i.e. it is the information object'"'"'s index number in the storage). A document'"'"'s identifier is also assigned when the document is added to the storage. Simple property identifiers are identifiers of strings and numbers. String identifiers are computed using a special data structure called a trie. With a trie, a string may be used to quickly get its identifier and search for triplets where it is object'"'"'s value. Number identifiers may also be computed and stored using a trie (a binary tree having a key for each of its nodes). Objects have their own UR is that can be used in a query. When a query is sent to the storage, the response contains an iterator listing all the objects that fit the query. The iterator returns identifiers of the objects one by one in the ascending order.
For the list of formed queries the list of patterns 503 with iterators is generated that point to the identifiers of information objects from the storage that realize the patterns. The combination is considered to be realized if the identifier of an information object is found in all iterators corresponding to the patterns. After the list of patterns is formed, iterators need to be initialized 504. During initialization of the iterator the repository is searched for information objects that realize each template, empty iterators are deleted. Iterator is empty when in the storage there is no information object that realizes a corresponding pattern.
Then, we check whether the remaining patterns form a complete combination 505. If not, then there is no information object in the storage that was found in the document, and we can stop the process of finding pairs of candidates for merging 506. If, however, from the remaining list of patterns, we can form a combination, than we can start searching for candidates.
The process begins with sorting of iterators by their identification values. In an ordered list of patterns we identify the first and the last iterator for each combination. Then the minimum combination is selected. Minimum combination is a combination in which the index of the last iterator in an ordered list of patterns is minimal. Next it should be checked whether the combination of patterns was realized 508. The combination is realized, as previously mentioned, if all its patterns got the same information object. This condition is sufficient to check for the first and the last iterator of the combination. If the minimum combination is not realized, we should make the shift to the next iterator identifier 510, and in this case it is necessary to move the first iterators to the identifier which is the current value of the last iterator of not realized minimum combination. If the minimum combination is realized, then a pair of <source object of combinations, found identifier> is added in the resulting list of pairs of candidates for merging 509. Next the other combinations are viewed, and checked that the resulting pair of candidates for merging is also realized in other combinations. Then we proceed to step 510, but in this case the first iterators are shifted to the identifier equal to the current value of the last iterator of minimum combination plus one. After shifting the iterators, some lists of patterns may become empty, empty iterators are removed from the list 511, and then we return to step 505.
An example of searching combinations of patterns in the storage is illustrated in FIG. SA and
Fragment 513 illustrates the step of sorting iterators on the current values of objects'"'"' identifiers to which they point. In fragment 513 combination c1 is checked first. It is the minimum combination for this sorting of iterators: on the first local position (r1) and the last local position (r2) of this combination there is an object with identifier o1, i.e., the object with identifier 1 simultaneously realizes patterns r1 and r2, and therefore the complete combination is realized for this object. Further, the combination c2 is checked: on its first local position (r1) there is again an object with identifier o1, on its last local position (r4) there is an object with identifier o2. This means that the combination c2 is not realized, while the identifier of the object at the iterator 4 shouldn'"'"'t be taken into account. After a minimum combination is realized the identifier of the object for which this combination is realized should be removed, and the iterators must be shifted to a value not less than 1+1=2. Fragment 514 depicts iterators after the shift. In fragment 515 the resulting iterators are sorted by the identifiers value. Minimum combination at this time is combination c3, it is realized by the subject with identifier o2. Combination c2 is also realized, and combination c1 is not. In fragment 516 first iterators are shifted to a value of 2+1=3. In fragment 517, after sorting of iterators, minimum combination c1 is identified as not realized. This means that iterators should be shifted. In fragment 518 iterators are shifted to the identifier o5. In fragment 519 iterators are sorted, and all combinations are realized, the iterators are shifted to the identifier following the object o5 in pattern r1. There are no more information objects in pattern r1, and therefore the combinations c1 and c2 can no longer be realized. Therefore, as shown in fragment 521, further search will be made only for combination c3. The searching process stops when for every combination there are no more information objects that realize at least one pattern of this combination.
The hardware also typically receives a number of inputs and outputs for communicating information externally. For interface with a user or operator, the hardware\ may include one or more user input devices 606 (e.g., a keyboard, a mouse, imaging device, scanner, microphone) and a one or more output devices 608 (e.g., a Liquid Crystal Display (LCD) panel, a sound playback device (speaker)). To embody the present invention, the hardware typically includes at least one screen device.
For additional storage, the hardware may also include one or more mass storage devices 610, e.g., a floppy or other removable disk drive, a hard disk drive, a Direct Access Storage Device (DASD), an optical drive (e.g. a Compact Disk (CD) drive, a Digital Versatile Disk (DVD) drive) and/or a tape drive, among others. Furthermore, the hardware 1400 may include an interface with one or more networks 612 (e.g., a local area network (LAN), a wide area network (WAN), a wireless network, and/or the Internet among others) to permit the communication of information with other computers coupled to the networks. It should be appreciated that the hardware typically includes suitable analog and/or digital interfaces between the processor 602 and each of the components 604, 606, 608, and 612 as is well known in the art.
The hardware operates under the control of an operating system 614, and executes various computer software applications, components, programs, objects, modules, etc. to implement the techniques described above. Moreover, various applications, components, programs, objects, etc., collectively indicated by application software 616 in
In general, the routines executed to implement the embodiments of the invention may be implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions referred to as a “computer program.” A computer program typically comprises one or more instruction sets at various times in various memory and storage devices in a computer, and that, when read and executed by one or more processors in a computer, cause the computer to perform operations necessary to execute elements involving the various aspects of the invention. Moreover, while the invention has been described in the context of fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of being distributed as a program product in a variety of forms, and that the invention applies equally to actually effect the distribution regardless of the particular type of computer-readable media used. Examples of computer-readable media include but are not limited to recordable type media such as volatile and non-volatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., Compact Disk Read-Only Memory (CD-ROMs), Digital Versatile Disks (DVDs), flash memory, etc.), among others. Another type of distribution may be implemented as Internet downloads.
Aspects of the present disclosure have been described above with respect to techniques for machine interpretation of information in text documents. However, it has been contemplated that portions of this disclosure may, alternatively or additionally, be implemented as separate program products or elements of other program products.
All statements, reciting principles, aspects, and embodiments of the disclosure and specific examples thereof are intended to encompass both structural and functional equivalents of the disclosure.
It will be apparent to those skilled in the art that various modifications can be made in the devices, methods, and program products of the present disclosure without departing from the spirit or scope of the disclosure. Thus, it is intended that the present disclosure includes modifications that are within the scope thereof and equivalents.