Dynamic taxonomy process for browsing and retrieving information in large heterogeneous data bases
First Claim
1. A process for retrieving information on large heterogeneous databases, wherein information retrieval is performed through visual queries on dynamic taxonomies, said dynamic taxonomies being an organization of concepts that ranges from a most general concept to a most specific concept, said concepts and their generalization or specialization relationships being an intension, documents in said databases being able to be classified under different concepts, said documents and their classification being called an extension, said process comprising:
- initially displaying a complete taxonomy for said retrieval;
selecting subsets of interest of said complete taxonomy in order to refine said retrieval, said subsets of interest being specified by selecting taxonomy concepts and combining them through boolean operations or being specified through querying methods, which retrieve classified documents according to different selection criteria, including words contained in a document;
displaying a reduced taxonomy for said selected set, said reduced taxonomy being derived from the original taxonomy by pruning the concepts under which no document in the selected subset of interest is classified; and
iteratively repeating said steps of selecting subsets and of displaying a reduced taxonomy to further refine said retrieval, wherein;
said process is performed on documents of any type and format;
said intension is organized as a hierarchy of concepts or as a directed acyclic graph of concepts, thereby allowing a concept to have multiple fathers;
said process dynamically reconstructs all relationships among concepts based on the classification without requiring, in the intension, concept relationships in addition to generalization or specialization, a relationship between any two concepts existing if and only if at least one document is classified (1) under a first concept or any descendants of the first concept, and (2) under a second concept, or any descendants of the second concept;
documents in said classification are classified under a concept at any level in said intension, including concepts with no sons;
said taxonomy supports operations for concept insertion, deletion, and modification;
said taxonomy supports operations for document insertion and classification, deletion, and reclassification;
documents in said classification are classified manually, programmatically, or automatically;
said process allows retrieval through different languages on a same database, while maintaining the same classification for all said languages;
said classification is explicitly stored as a set or list of documents for each concept or implicitly stored in external structures;
said explicitly stored classification includes, for each concept in the intension, deep classification, which records all documents classified under the concept and under any of its descendants, and a shallow classification, which records all the documents classified directly under the concepts, said shallow classification being only required if documents can be classified under non-terminal concepts and is equivalent, by definition, to the deep classification for terminal concepts;
said deep and shallow classifications are physically stored as compressed or uncompressed bit vectors, or as compressed or uncompressed inverted lists, or as Bloom'"'"'s filters, or in a relational database system;
said process accounts for an age of documents either explicitly or implicitly, or by a lazy reclassification of a minimum number of concepts;
said intension and classification are used either for querying/browsing the database or to dynamically inform a user when documents of interest are added or modified in the database; and
said step of displaying a reduced taxonomy either reports only the concepts belonging to the reduced taxonomy or, for each such concept, also reports how many documents in the interest set are classified under the concept.
3 Assignments
0 Petitions
Accused Products
Abstract
A process is disclosed for retrieving information in large heterogeneous data bases, wherein information retrieval through visual querying/browsing is supported by dynamic taxonomies; the process comprises the steps of: initially showing a complete taxonomy for the retrieval; refining the retrieval through a selection of subsets of interest, where the refining is performed by selecting concepts in the taxonomy and combining them through Boolean operations; showing a reduced taxonomy for the selected set; and further refining the retrieval through an iterative execution of the refining and showing steps.
71 Citations
12 Claims
-
1. A process for retrieving information on large heterogeneous databases, wherein information retrieval is performed through visual queries on dynamic taxonomies, said dynamic taxonomies being an organization of concepts that ranges from a most general concept to a most specific concept, said concepts and their generalization or specialization relationships being an intension, documents in said databases being able to be classified under different concepts, said documents and their classification being called an extension, said process comprising:
-
initially displaying a complete taxonomy for said retrieval;
selecting subsets of interest of said complete taxonomy in order to refine said retrieval, said subsets of interest being specified by selecting taxonomy concepts and combining them through boolean operations or being specified through querying methods, which retrieve classified documents according to different selection criteria, including words contained in a document;
displaying a reduced taxonomy for said selected set, said reduced taxonomy being derived from the original taxonomy by pruning the concepts under which no document in the selected subset of interest is classified; and
iteratively repeating said steps of selecting subsets and of displaying a reduced taxonomy to further refine said retrieval, wherein;
said process is performed on documents of any type and format;
said intension is organized as a hierarchy of concepts or as a directed acyclic graph of concepts, thereby allowing a concept to have multiple fathers;
said process dynamically reconstructs all relationships among concepts based on the classification without requiring, in the intension, concept relationships in addition to generalization or specialization, a relationship between any two concepts existing if and only if at least one document is classified (1) under a first concept or any descendants of the first concept, and (2) under a second concept, or any descendants of the second concept;
documents in said classification are classified under a concept at any level in said intension, including concepts with no sons;
said taxonomy supports operations for concept insertion, deletion, and modification;
said taxonomy supports operations for document insertion and classification, deletion, and reclassification;
documents in said classification are classified manually, programmatically, or automatically;
said process allows retrieval through different languages on a same database, while maintaining the same classification for all said languages;
said classification is explicitly stored as a set or list of documents for each concept or implicitly stored in external structures;
said explicitly stored classification includes, for each concept in the intension, deep classification, which records all documents classified under the concept and under any of its descendants, and a shallow classification, which records all the documents classified directly under the concepts, said shallow classification being only required if documents can be classified under non-terminal concepts and is equivalent, by definition, to the deep classification for terminal concepts;
said deep and shallow classifications are physically stored as compressed or uncompressed bit vectors, or as compressed or uncompressed inverted lists, or as Bloom'"'"'s filters, or in a relational database system;
said process accounts for an age of documents either explicitly or implicitly, or by a lazy reclassification of a minimum number of concepts;
said intension and classification are used either for querying/browsing the database or to dynamically inform a user when documents of interest are added or modified in the database; and
said step of displaying a reduced taxonomy either reports only the concepts belonging to the reduced taxonomy or, for each such concept, also reports how many documents in the interest set are classified under the concept. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
a dictionary relation for each language supported, which binds each concept identifier to a label describing that concept, which will be presented to the user when the taxonomy is displayed;
a father-to-son relation which lists, for each concept, all sons of the concept, said list being configurable as to be ordered; and
a son-to-father relation which lists, for each concept, all fathers of the concept; and
a conceptual schema for the extension comprises, for each concept;
a deep classification that lists all documents classified under the concept or any descendants of the concept;
a shallow classification, which lists all documents directly classified under the concept; and
a classification relation which lists, for each document, all concepts under which the document is directly classified, ancestors of said concepts being recoverable through the son-to-father relation in the intension.
-
-
3. The process according to claim 1, wherein boolean operations on concepts are implemented through corresponding set operations on the deep classification of said concepts.
-
4. The process according to claim 1, wherein said step of displaying a reduced taxonomy for the set selected in said selecting step comprises a testing operation such that a concept is displayed in the reduced taxonomy if an intersection between the set and the deep classification of the concept is not empty, the testing operation being configured to optionally count a number of documents in said intersection to show a user a number of documents in the set that are also classified under the concept, said testing operation being also configured to be applied to the shallow classification, if used, to show the user a number of documents in the set that are also directly classified under the concept, the numbers being useful when documents can be classified at any level of abstraction in the taxonomy, said testing operation being also configured to be applied to a set including a single document d, in order to compute a classification of d, if not explicitly stored, said testing operation being used to produce a reduced tree by testing and displaying the sons of a root and, on subsequent explosion of a concept c, testing and displaying the sons of concept c.
-
5. The process according to claim 1, wherein said taxonomy supports modification operations on the intension, such operations supporting:
-
an insertion of a new concept, performed by inserting the new concept in the dictionaries, in the father-to-son relation, and in the son-to-father relation;
a deletion of an existing concept C, performed by;
deleting from the intension all concepts in Cdown(C), wherein Cdown(C) denotes a set of all descendants of the concept C, in union with the concept C;
adding documents in the deep classification of the concept C to the shallow classification of a set of the fathers of the concept C in the taxonomy;
removing the shallow and deep classifications for all concepts in Cdown(C);
removing the concepts in Cdown(C) from a classification of all documents in the deep classification of the concept C;
changing a labeling of a concept, said changing step only requiring a modification of an appropriate dictionary;
changing a place of a concept in the taxonomy; and
adding an additional father to a concept in the taxonomy.
-
-
6. The process according to claim 1, wherein said taxonomy supports modification operations on the classification, such operations supporting:
-
an insertion of a new document d, said insertion comprising the steps of;
for each c∈
{C}, wherein C denotes a set of concepts under which d must be classified, inserting DID(d) in the shallow classification of c, if c is not a terminal concept and the shallow classification must be stored;
inserting DID(d) in the deep classification of Cup(c), Cup(c) denoting a set of all ancestors of the concept c in union with c; and
inserting c in a classification of d;
a deletion of an existing document d, said deletion comprising the steps of;
retrieving a set of concepts {C} under which d is shallowly classified;
for each c∈
{C}, deleting DID(d) from the shallow classification of c;
for each ancestor of c, c included, deleting DID(d) from the deep classification of each ancestor;
deleting an entry corresponding to DID(d) from the classification relation;
a reclassification of an existing document d, according to a different concept, said reclassification comprising the steps of;
letting d be initially classified under a concept c, possibly null, c′
being a new concept under which d must be classified, possibly null;
if both c and c′
are non-null, classifying d under c′
;
if c is null, additionally classifying d under c′
;
if c′
is null, removing the original classification under c;
if c is not null, eliminating DID(d) from the shallow classification of c;
eliminating DID(d) from the deep classification of all ancestors c″
of c, c included;
eliminating c from the classification of d;
if c′
is not null, inserting DID(d) in the shallow classification of c′
, if the shallow classification of c exists;
inserting DID(d) in the deep classification of all ancestors c″
of c, c included;
inserting c′
in the classification of d.
-
-
7. The process according to claim 1, wherein said deep and shallow classifications are physically stored as uncompressed or compressed bit vectors, and a counting of documents in a result of logic operations on bit vectors is performed through a constant table V whose size is 2n, whose element V[i] contains a number of bits at 1 in binary number i, and processing the uncompressed form of the bit vector n bits at a time, adding to a counter for every group j of n bits, whose binary value is v′
- , an amount V[v′
].
- , an amount V[v′
-
8. The process according to claim 1, wherein said classification is implicitly stored as virtual concepts in external structures, said virtual concepts being concepts for which neither actual sons, that are actual values of a domain to be represented, nor an actual classification are stored, but instead are computed, said virtual concept being able to be a simple virtual concept, which is completely described by four abstract operations:
-
V1;
given a virtual concept v, retrieve all its sons;
V2;
given a virtual concept v, retrieve its deep classification;
V3;
given the son s of a virtual concept v, retrieve its deep classification; and
V4;
given a document d, find all the terminal concepts, descendants of v, under which it is classified;
said abstract operations being implemented based on two abstract relations, for each virtual concept v;
a relation Sv which stores a set of documents for each value in a domain of values of the virtual concept; and
its inversion Cv, optionally stored, which stores a set of values for each document classified under said virtual concept;
said virtual concept being configured to be a derived virtual concept, which refers to said two abstract relations, with additional restrictions;
additional information being kept in the intension, for each concept, to describe its type;
for each simple virtual concept to address said S, and said optional Cv relation;
for each derived virtual concept to address said two abstract relations and a restriction to apply to its base relations (Sv and Cv).
-
-
9. The process according to claim 8, wherein said process accounts for an age of documents implicitly by means of virtual concepts, or by a lazy reclassification of a minimum number of concepts, a representation of time-varying concepts by virtual concepts being characterized in that the time-varying concepts, whose value is represented by abstract timestamps, can be represented by a virtual concept, representing with T the timestamp value of the current time, sons of said time-varying concept V being retrieved through an abstract query that selects all unique T-values in Cv where T is a current time, the deep classification of said time-varying concept V being retrieved through an abstract query that selects all unique DIDs from Cv, the classification of a son s of said time-varying concept V being retrieved through an abstract query that selects all unique DIDs in Cv with value=T-s, where T is the current time, in the lazy evaluation of time-varying concepts, the time-varying concepts being split into N intervals, from more recent to older, which are stored as real concepts, for each interval I, the following being kept:
-
(a) a list L(I) of DIDs in an interval ordered by decreasing timestamps, (b) an interval representative of IR(I), a last DID in the interval together with its timestamp, (c) a classification criterion for the interval, the classification of documents being periodically recomputed, for each interval I, by reclassifying IR(I) if IR(I) needs reclassification and setting IR(I) as the last DID in the ordered list (a), and deleting IR(I).DID from I, while iteratively inserting IR(I) in interval i+1 to N if IR(I) meets a classification criterion for interval i.
-
-
10. The process according to claim 1, wherein said dynamic taxonomy is used to represent data represented by a single view on an external database, the documents corresponding to tuples, rows, records, or objects, in a view V, and, in order to identify a document, a candidate key of the view being used as document identifier (DID) or two abstract relations being kept for mapping system-generated DIDs to and from a primary key PK of the view, a taxonomy T for V being able to be constructed by inserting concepts of interest for V in the taxonomy, each concept C being associated to a boolean clause B(C, t), where t denotes a tuple, said boolean clause being able to reference any attribute of t and returning true if and only if t must be classified under C, said concept C being a real concept or a virtual concept, said virtual concept using V itself for a synthesis of sons and extensions, wherein if said real concept represents an attribute A of V such that said boolean clause is true when t.A∈
- Sc, where Sc is a subset of the domain of attribute A, the boolean clause is replaced, for a better performance, by a table listing, for each value v in domain(A), corresponding concepts under which t is to be classified, in order to create new taxonomic generalizations, if the sons of a new taxonomic generalization G are real concepts, no boolean clause for G being needed, a classification under G being automatically performed by an insertion operation for a new document, a binding for real concepts requiring an insertion for any new tuple, a deletion if t is deleted or a reclassification if t is changed, said process, in order to classify t, locating a set C of concepts for which B(c, t), c∈
C is satisfied and classifying t under ∀
c∈
C, and binding for virtual concepts is realized by operations of V itself.
- Sc, where Sc is a subset of the domain of attribute A, the boolean clause is replaced, for a better performance, by a table listing, for each value v in domain(A), corresponding concepts under which t is to be classified, in order to create new taxonomic generalizations, if the sons of a new taxonomic generalization G are real concepts, no boolean clause for G being needed, a classification under G being automatically performed by an insertion operation for a new document, a binding for real concepts requiring an insertion for any new tuple, a deletion if t is deleted or a reclassification if t is changed, said process, in order to classify t, locating a set C of concepts for which B(c, t), c∈
-
11. The process according to claim 1, wherein dynamic taxonomies are used in order to represent user profiles of interest to alert users for new interesting documents based on dynamic taxonomy profiles, said process further comprising the steps of:
-
acquiring multiple conceptual expressions from a user, said conceptual expressions defining subjects in which the user is interested;
accepting said conceptual expressions by a monitoring system;
coupling, by said monitoring system, an abstract user address, to which alerts are to be sent, to said conceptual expressions;
monitoring, by said monitoring system, an information base for changes performed thereto, said information base being described by the same taxonomy used by users to express their interests;
when a change occurs in the information base, finding, by said monitoring system, the users to be alerted based on their interests;
dynamic taxonomy concepts being used to realize said monitoring step, for said dynamic taxonomy concepts additional expressions, being composed by AND with taxonomic expressions, and being solved, if required, after a corresponding taxonomic expression is satisfied, a checking step whether a document d satisfies a user specification S being performed by applying a query specified in S to a set {d} and checking whether d is retrieved, if specifications only comprise conjunction operations and document classification is only under terminal concepts, two abstract storage structures being used;
a directory of specifications (SD) relating SID and N, SPEC, where SID is an abstract identifier which uniquely identifies the specification, SPEC is the specification itself, N is a number of concepts referenced in the specification;
a specification inversion (SI), relating CID and SID, listing for each concept, represented by its concept identifier, all specifications represented by their specification ID using that concept wherein, when a specification is created, its abstract identifier is created, its directory entry being created in SD and a set of concepts referenced in the specification being stored in an inversion SI, while, when a document d is changed, C being a set of concepts under which d is classified, a set of specifications that apply to d being found as follows;
K being a set of concepts used to classify document d, for each concept k in K, SID(k) is a list of specifications for k, accessible through relation SI, ordered by increasing specification ID'"'"'s;
defining MergeCount(K) as a set composed of pairs (SID, N) such that SID is in MergeCount (K) if SID belongs to a SID (k), k in K;
if a pair (SID, N) is in MergeCount(K), N counts a number of SID(k) referencing SID;
if S is an initially empty set, which represents a set of specifications satisfied by d, for each pair (SID, N), retrieving SID.N from SD;
if SID.N=N;
S=S union SID, when there are specifications using unrestricted set operations, S, represented by SID(S), being a specification and the following steps being used;
transforming S in disjunctive normal form as a disjunction of conjunctions, each conjunctive clause in S being called a component of S, SIDi(S) denoting the i-th component of S;
storing a directory of specifications as two abstract relations;
SD, omitting N, and SCD, relating COMPONENT and SDI, N, where COMPONENT stores components of specifications, COMPONENT.SDI represents a specification ID of specification S of which COMPONENT is a component, and COMPONENT.N is a number of concepts referenced in the component;
storing a specification inversion as relation (SI) between CID and CID.COMPONENT, where CID is a concept identifier and CID.COMPONENT is a set of components referencing the concept identified by CID;
with K being the set of concepts used to classify document d, for each concept k in K, COMPONENT (k) is a list of components for k, accessible through relation SI, ordered by increasing component ID'"'"'s;
defining ComponentMergeCount(K) as a set composed of pairs (COMPONENT, N) such that COMPONENT is in ComponentMergeCount(K) if COMPONENT belongs to a COMPONENT(k), k in K;
if a pair (COMPONENT, N) is in ComponentMergeCount(K), N counting the number of COMPONENT(k) referencing COMPONENT;
with S being a set initially empty, for each pair (COMPONENT, N), retrieving COMPONENT.N through relation SCD;
if COMPONENT.N=N;
S=S union COMPONENT.SID, S representing a set of specifications satisfied by d;
the modification of a specification inversion SI comprising the steps of;
if a specification or component Z references concept C, represented by CID(C) then;
if C is a terminal concept;
CID(C).SID=CID(C).SID union Z, if Z is a specification,
CID(C).COMPONENT=CID(C).COMPONENT union Z, if Z is a component
if C is a non-terminal concept, for each k in Cdown(C);
ClD(k).SID=CID (k).SID union Z, if Z is a specification,
CID(k).COMPONENT=CID(k).COMPONENT union Z, if Z is a component,the specifications satisfied by a set of documents D whose cardinality is greater than 1 being computed by applying previous techniques without modifications to every document d in D, then removing possible duplicate specifications;
or being computed by;
defining as K a set of concepts used to classify D;
applying an adequate technique among the described ones; and
determining a set S of candidate specifications, every specification s in S being then checked by performing it on D.
-
-
12. The process according to claim 1, wherein the reduced taxonomy is totally computed in a single step, wherein RT is a set of concepts in the reduced taxonomy, RT being computed by applying said testing operation, for each concept C in the intension, and further in that said operation is speeded up through the steps of:
-
initializing a table S of size T, where T is a number of concepts in the intension and S[i] holds information on a current status of concept i, initialized at pending;
starting from uppermost levels, and continuing down in a tree, processing concept i;
if S[i] is empty, determining that i does not belong to RT, and continuing the processing with a next concept;
if S[i] is not empty, applying said testing operation to i;
if said testing operation produces a non-empty intersection, determining that i belongs to RT;
otherwise, determining that neither i nor any of its descendants belong to RT and setting to empty all S[j] in S, such that j is a descendant of i in the taxonomy, the descendants of I being either computed from the intension or being precomputed and stored in a table D, holding for each concept in the taxonomy a list of all concepts descending from it in the taxonomy, such a table being recomputed every time the intension changes.
-
Specification