Method for retrieving semantically distant analogies
First Claim
1. A method for selectively retrieving records or portions of records which contain information relative to at least one chosen term, said method comprising:
- a) receiving a well defined first body of records selected from a first knowledge domain;
b) computing by means of a processing unit an abstract representation of terms from said well defined first body of records, said abstract representation encoding the characteristic co-occurrence relationships between terms in said first domain, and storing said abstract representation in a storage unit;
c) employing said stored abstract representation of term co-occurrence relationships to retrieve records relevant to one or more terms from said first body of records, said relevant records retrieved from within a second body of records stored in the storage unit;
wherein said second body of records is selected from at least one other knowledge domain which is semantically distant from said first knowledge domain and wherein said second body of records is substantially free of records containing said chosen terms or any synonyms of said chosen terms; and
d) displaying said relevant records as retrieved in step c as an output, said output indicating the relative degree of relevance of each retrieved record with respect to at least one of said chosen terms wherein each of said chosen terms or synonym thereof, from said first knowledge domain is represented in a majority of the records in said first body of records and said abstract representation of terms from said first knowledge domain is a set of term vectors in a single multi-dimensional vector space, said set of term vectors computed from said well defined first body of records representing said first domain;
the relative orientations of said term vectors in said multi-dimensional space collectively encoding the term co-occurrence relationships characteristic of said first domain.
1 Assignment
0 Petitions
Accused Products
Abstract
A process of identifying terms or sets of terms in target domains having functional relationships (roles) analogous to terms (contained in the query) selected from a source domain whereby queryrelevant but semantically distant (novel) analogies may be retrieved, corresponding to any user defined query. The process is capable of discovering deep functional analogies between terms in source and target domains, even where there is a misleading superficial matching of terms (i.e. same terms, with different meanings) between the query and the target domains. The process comprises the automated generation of abstract representations of source domain content, and application of the abstract representations of content to the efficient discovery of analogous objects in one or more semantically distant target domains. Said abstract representations of terms are preferably vectors in a high dimensionality space, encapsulating characteristic occurrence patterns of terms in the source domain.
-
Citations
39 Claims
-
1. A method for selectively retrieving records or portions of records which contain information relative to at least one chosen term, said method comprising:
-
a) receiving a well defined first body of records selected from a first knowledge domain;
b) computing by means of a processing unit an abstract representation of terms from said well defined first body of records, said abstract representation encoding the characteristic co-occurrence relationships between terms in said first domain, and storing said abstract representation in a storage unit;
c) employing said stored abstract representation of term co-occurrence relationships to retrieve records relevant to one or more terms from said first body of records, said relevant records retrieved from within a second body of records stored in the storage unit;
wherein said second body of records is selected from at least one other knowledge domain which is semantically distant from said first knowledge domain and wherein said second body of records is substantially free of records containing said chosen terms or any synonyms of said chosen terms; and
d) displaying said relevant records as retrieved in step c as an output, said output indicating the relative degree of relevance of each retrieved record with respect to at least one of said chosen terms wherein each of said chosen terms or synonym thereof, from said first knowledge domain is represented in a majority of the records in said first body of records and said abstract representation of terms from said first knowledge domain is a set of term vectors in a single multi-dimensional vector space, said set of term vectors computed from said well defined first body of records representing said first domain;
the relative orientations of said term vectors in said multi-dimensional space collectively encoding the term co-occurrence relationships characteristic of said first domain.- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
the semantic distance between the domains is the Euclidean distance between the positions of the respective domain centroids in common semantic space;
the domain radii are the respective averages of the absolute values of the Euclidean distances between the positions of the respective domain centroids and the positions of the individual records representing the respective domains in the common semantic space;
each domain in the common semantic space is represented by a well defined body of records, together constituting a mixed-domain corpus of records;
each of said well defined domain representative bodies of records in said mixed domain corpus contains at least about 100 individual records;
the common semantic space being a multi-dimensional vector space in which each record in said mixed-domain corpus of records is assigned a unique position based on its content;
the position of each record in the common semantic space is determined by computation of term vectors from the mixed-domain corpus and calculation of a summary vector for each record in the mixed domain corpus by use of the term vectors as computed from the mixed-domain corpus;
said summary vectors derived from said mixed corpus are such that for any subset of summary vectors corresponding to a subset of records from the mixed corpus there is a single logical relative orientation of the summary vectors that defines the relative meaning of the records in said mixed domain corpus;
each record in the mixed-domain corpus of records is uniquely assignable to one of the domains represented in the common semantic space.
-
-
8. The method according to claim 7 wherein the common semantic space is a vector space of about 100 to about 1000 dimensions, records represented in the common semantic space are ASCII text documents, said documents are written in the same human language, and each of said well defined domain representative bodies of records in said mixed domain corpus contain between about 1000 and about 50,000 individual records, each record containing a plurality of terms.
-
9. The method according to claim 8 wherein the semantic distance between said first domain and any other one of said knowledge domains is at least twice the sum of the respective domain radii as measured in the common semantic space.
-
10. A method for efficiently retrieving semantically distant analogies, said method comprising:
-
a) assembling a first body of records according to a single consistent method of sampling;
b) computing an abstract representation of terms from said first body of records, said abstract representation encoding the characteristic co-occurrence relationships between terms in said first body, said co-occurrence relationships being stable with respect to sample size, and storing said abstract representation in a storage unit;
c) employing said stored abstract representation of term co-occurrence relationships to selectively retrieve records relevant to any of two or more terms chosen from said first body of records, said relevant records retrieved from within a second body of records stored in the storage unit;
wherein said second body of records is substantially devoid of records containing said two or more chosen terms or synonyms of said chosen terms; and
d) displaying said relevant records as an output, said output indicating the relative degree of relevance of each record retrieved with respect to at least one of said chosen terms wherein each of said two or more chosen terms, or synonyms thereof, from said first body of records are represented in at least about 70% of the records in said first body of records; and
wherein said first body of records contains at least about 100 records, each record containing a plurality of terms and said abstract representation of terms from said first body of records is a set of term vectors in a single multi-dimensional vector space, said set of term vectors computed from said first body of records; and
wherein relative orientations of said term vectors in said multi-dimensional space collectively encode the term co-occurrence relationships characteristic of said first body of records.- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for selective retrieval of semantically distant analogies, said method comprising:
-
a) receiving one or more characteristic terms specific to a given knowledge domain of interest to a user;
b) assembling by use of at least one of the characteristic terms or a synonym thereof, a well defined body of records which support the characteristic terms by providing a domain specific context, and storing the body of records in a storage unit;
c) receiving one or more target domains which target domains are semantically distant from the domain represented by the characteristic terms of step a;
d) assembling a representative body of domain specific records from each of the target domains of step c, each record containing a plurality of terms, and combining said representative bodies of target domain records to create a search domain, wherein said search domain is substantially free of records containing characteristic terms or synonyms of characteristic terms received in step a, and storing the representative body of target domain records in the storage unit in a location separate from the body of records of step b;
e) computing by means of a processor a set of term vectors in a single multi-dimensional vector space for a body of selected terms from the body of records in step b, and storing the term vectors in the storage unit, wherein said body of selected terms includes and is greater than the set of characteristic terms received in step a, and further wherein the relative orientation of said term vectors corresponding to said body of selected terms collectively encodes characteristic co-occurrence relationships between the terms in the domain represented by the well defined body of records of step b;
f) computing a normalized summary vector using the term vectors of step e for each record in said search domain and storing the resulting set of summary vectors in the storage unit;
g) computing for at least one of the characteristic terms received in step a or a synonym thereof, the relative overlap of the term vector of said term or said synonym with each of the summary vectors obtained in step f and storing the results in the storage unit;
h) comparing the relative degree of vector overlap, as obtained from step g, between the term vector of said domain specific term or said synonym thereof and the summary vectors of the records in the search domain;
i) generating a relevance ranked list from said search domain; and
j) displaying by means of a suitable output device said relevance ranked list of records from said search domain for at least one of the domain specific terms received in step a or synonym thereof. - View Dependent Claims (19, 20)
-
-
21. A method for selective retrieval of semantically distant analogies, said method comprising:
-
a) receiving one or more source domains;
b) assembling a well defined first body of records form said source domains, wherein each record contains a plurality of terms, and storing said records in the storage unit;
c) receiving one or more target domains, said target domains being semantically distant from said source domains;
d) assembling a second body of records from said target domains, wherein each record contains a plurality of terms, and storing said second body of records in the storage unit of the computer in a location separate from said first body of records;
e) assembling a single training corpus comprising records from said first body of records and optionally in addition a lesser proportion of records from said second body of records, and storing said training corpus separately within the storage unit;
f) computing a set of term vectors in a single multi-dimensional vector space for a body of selected terms in the training corpus of step e, wherein the relative orientation of said term vectors corresponding to said selected terms collectively encode domain specific co-occurrence relationships between the terms within the training corpus, and storing said set of term vectors in the storage unit;
g) computing a set of normalized summary vectors for each record in said second body of records using the term vectors of step f and storing said summary vectors in the storage unit;
h) receiving one or more queries, each query containing one or more chosen terms from said first body of records which chosen terms are substantially absent from said second body of records and for which there are substantially no synonymous terms in said second body of records, further wherein said chosen terms form a subset within the body of selected terms in step f and said body of selected terms in step f is larger than said subset of chosen terms;
i) computing a query vector for at least one of the queries received in step h using the term vectors of step f and storing the resulting query vector in the storage unit;
j) computing, for at least one of the queries received in step h, a measure of the relative overlap of the query vector corresponding to said query with each of the summary vectors from step g, and storing the results in the storage unit;
k) for at least one query received in step h, displaying a relevance ranked list of records from said second body of records by comparing the relative degree of vector overlap, as obtained from step j, between the query vector of said query and the individual summary vectors of the records in said second body of records in order to conduct the relevance ranking. - View Dependent Claims (22, 23, 24)
-
-
25. A method for selectively retrieving records or portions of records which contain information relative to at least one chosen term said method comprising the steps of:
-
a) inputting to a storage unit a training corpus consisting essentially of a plurality of records wherein a majority of said records in said training corpus contain said chosen term, and wherein each of said records in said training corpus further contains a plurality of other terms;
b) inputting to the storage unit a second body of records consisting essentially of a plurality of records wherein said second body of records is substantially free of records containing said chosen term or synonyms of said chosen term;
c) processing the records in said training corpus to compute a body of term vectors in a single multi-dimensional vector space for at least a portion of the terms in said training corpus, wherein said portion of terms processed includes and is greater than said chosen term, and wherein the relative orientation of the term vectors in said body of term vectors encodes co-occurrence relationships between terms in the training corpus, and storing said set of term vectors in the storage unit;
d) computing, by means of the term vectors of step c and the operation on said second body of records, a set of normalized summary vectors for the records in said second body of records, and storing said set of summary vectors in the storage unit;
e) computing by the operation on the summary vectors of step d and the term vector of said at least one chosen term as computed in step c, a measure of the relative amount of overlap between said term vector of said chosen term with each of the summary vectors of the records in said second body of records, and storing the results in the storage unit;
f) computing from the vector overlap results of step e, an ordered list of the vector overlap measures computed in step e, said ordered list arranged according the relative amount of vector overlap; and
g) displaying an ordered output list of records from said second body of records, said output list corresponding to the ordered list computed in step f, wherein said output list contains only records from said second body of records. - View Dependent Claims (26, 28, 29, 30, 31, 32)
the semantic distance between the two bodies of records is the Euclidean distance between the positions of their respective centroids in the common semantic space; and
whereinthe common semantic space is a multi-dimensional vector space in which each record in the mixed corpus of records, said mixed corpus consisting of all records from said training corpus plus all records from said second body of records, is assigned a unique position based on its content;
and wherein the position of each record in the common semantic space is determined by computation of term vectors from said mixed corpus and computation of a summary vector for each record in the mixed corpus by summing all said term vectors for the terms in each record and normalizing, said term vectors as obtained from the mixed corpus;
and wherein said summary vectors derived from said mixed corpus are such that for any subset of summary vectors corresponding to a subset of records from the mixed corpus there is a single logical relative orientation of the summary vectors that defines the relative meaning of the records in said mixed domain corpus.
-
-
32. The method according to claim 28 wherein the semantic distance between said training corpus and said second body of records is greater than the sum of the respective average distances between the positions of centroids of each body of records and the positions of the individual records within each body of records, as measured in a common semantic space and wherein said average distances are determined for each body of records by summing the absolute values of the Euclidean distances in the common semantic space between the respective centroid positions and the positions of all the records in the respective body of records and then dividing the total by the number records in the respective body of records;
- and wherein
the semantic distance between the two bodies of records is the Euclidean distance between the positions of their respective centroids in the common semantic space; and
whereinthe common semantic space is a multi-dimensional vector space in which each record in the mixed corpus of records, said mixed corpus consisting of all records from said training corpus plus all records from said second body of records, is assigned a unique position based on its content;
and wherein the position of each record in the common semantic space is determined by computation of term vectors from said mixed corpus and computation of a summary vector for each record in the mixed corpus by summing all said term vectors for the terms in each record and normalizing, said term vectors as obtained from the mixed corpus;
and wherein said summary vectors derived from said mixed corpus are such that for any subset of summary vectors corresponding to a subset of records from the mixed corpus there is a single logical relative orientation of the summary vectors that defines the relative meaning of the records in said mixed domain corpus.
- and wherein
-
27. In a computer having a storage unit and a processing unit, a computer implemented method suitable for selectively retrieving records which contain information relative to at least one chosen term said method comprising:
-
a) inputting to the computer storage unit by means of an input device a training corpus consisting of a plurality of records wherein a majority of said records in said training corpus contain said at least one chosen term, and wherein each of said records in said training corpus further contains a plurality of other terms;
b) inputting to the computer storage unit by means of an input device a second body of records consisting of a plurality of records wherein said second body of records is devoid of records containing said chosen term or synonyms of said chosen term;
c) processing by means of the processing unit the records in said training corpus in order to compute a body of term vectors in a single multi-dimensional vector space for at least a portion of the terms in said training corpus, wherein said portion of terms processed includes and is greater than said chosen term, and wherein the relative orientation of the term vectors in said body of term vectors encodes co-occurrence relationships between terms in the training corpus, and storing said set of term vectors in the storage unit;
d) computing, by means of the term vectors of step c and the processing unit operating on said second body of records, a set of normalized summary vectors for the records in said second body of records, and storing said set of summary vectors in the storage unit;
e) computing by means of the processing unit operating on the summary vectors of step d and the term vector of said at least one chosen term as computed in step c, a measure of the relative amount of overlap between said term vector of said chosen term with each of the summary vectors of the records in said second body of records, and storing the results in the storage unit;
f) computing by means of the processing unit operating on the vector overlap results of step e an ordered list of the vector overlap measures computed in step e, said ordered list arranged according the relative amount of vector overlap; and
g) displaying by means of an output device an ordered list of records from said second body of records, said output list corresponding to the ordered list computed in step f, wherein said output list contains only records from said second body of records.
-
-
33. A method comprising automated generation of an abstract representation of terms from a first knowledge domain, said representation encoding the co-occurrence relationships between terms characteristic of said first domain;
- storing said representation in the storage unit of the computer;
receiving at least one chosen term from said first domain; and
applying said abstract representation of term co-occurrence relationships to facilitate the selective automated discovery of one or more objects, different from but analogous to said at least one chosen term, in one or more knowledge domains semantically distant from the first, wherein said abstract representation of terms from said first knowledge domain is a set of term vectors in a single multi-dimensional vector space, said set of term vectors generated from a well defined body of records representing said first domain, and wherein the relative orientations of said term vectors in said multi-dimensional vector space collectively encode the term co-occurrence relationships characteristic of said first domain. - View Dependent Claims (34, 35, 36, 37, 38, 39)
- storing said representation in the storage unit of the computer;
Specification