×

Method for cleansing sequence-based data at query time

  • US 7,516,128 B2
  • Filed: 11/14/2006
  • Issued: 04/07/2009
  • Est. Priority Date: 11/14/2006
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method of cleansing anomalies from sequence-based data at query time, comprising:

  • loading sequence-based data into a database managed by a database management system (DBMS) of a computing system, said loading being performed at a load time of said sequence-based data that precedes a query time of said sequence-based data;

    receiving a cleansing rule C at a cleansing rules engine of said computing system;

    automatically converting, by said cleansing rules engine, said cleansing rule C to a template, said template including logic to compensate for one or more anomalies in said sequence-based data;

    receiving, at said query time and by a query rewrite engine of said computing system, a user query to retrieve said sequence-based data;

    automatically rewriting, at said query time and by said query rewrite engine, said user query to provide a rewritten query, said automatically rewriting including applying said logic included in said template to compensate for said one or more anomalies; and

    executing, at said query time, said rewritten query by said DBMS, wherein an answer provided by said executing said rewritten query is identical to a result of executing said user query on a set of data generated by an application of said cleansing rule C to all of said sequence-based data, wherein said automatically rewriting includes;

    for each context reference X of one or more context references included in a pattern of said cleansing rule C on a relational table R, performing a loop that includes;

    setting a correlation condition cr to a list of one or more conjuncts. said one or more conjuncts comprising at least one of;

    one or more explicit conjuncts included in a condition of said cleansing rule C and referring to said context reference X and one or more implied conjuncts, each implied conjunct being on a cluster key of said relational table R or a sequence key of said relational table R, wherein said correlation condition cr is a correlation condition between said context reference X and T, said T being a target reference included in said pattern,if said context reference X is a position-based context reference, then retaining in said one or more conjuncts of said correlation condition cr only position-preserving conjuncts,binding s to said target reference T, wherein said s is a query condition on said relational table R and is included in said user query (Q),running a transitivity analysis between said correlation condition cr and said query condition s,determining d. said d being a set including any conjunct of a condition generated through said transitivity analysis that refers only to said context reference X, andif set d is not empty, adding set d to a context condition cc, otherwise setting said context condition cc to an empty set and breaking out of said loop, wherein said context condition cc defines a context set for context reference X; and

    if said context condition cc is not said empty set, generating an expanded rewrite Qe as said rewritten query, otherwise setting said expanded rewrite Qe to a null value, wherein said generating said expanded rewrite Qe includes;

    computing an expanded condition ec as s ∥

    cc,simplifying said query condition s to an optimized query condition s'"'"', saidsimplifying including setting said optimized query condition s'"'"' equal to s -cc, andcomputing said expanded rewrite Qe by an expression σ

    s'"'"'(Φ

    c

    ec(R))), wherein said Φ

    c

    ec(R)) is a result of applying said cleansing rule C on a data set σ

    ec(R), wherein said data set σ

    ec(R) is a result of directly pushing said expanded condition ec to said relational table R and cleansing data of said relational table R selected by said expanded condition ec, andwherein a result of said automatically rewriting is an assurance that said answer provided by said executing said rewritten query is identical to said result of said executing said user query on said set of data generated by said application of said cleansing rule C to all of said sequence-based data.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×