×

SYSTEM FOR DISCOVERING DATA ARTIFACTS IN AN ON-LINE DATA OBJECT

  • US 20080147642A1
  • Filed: 03/08/2007
  • Published: 06/19/2008
  • Est. Priority Date: 12/14/2006
  • Status: Abandoned Application
First Claim
Patent Images

1. A system for discovering data artifacts in an on-line data object, the system comprising:

  • a data acquisition subsystem configured to parse the on-line data object into at least one string;

    a string pre-parser configured to divide each string into a set of separate characters;

    a lexical analyzer configured, for each set of separate characters, to aggregate the separate characters in that set of separate characters into a sequence of tokens, each token in the sequence of tokens being one of a word, a punctuation symbol, a HyperText-Markup-Language tag, and a number;

    a syntax analyzer configured, for each sequence of tokens during a first analysis phase, to;

    determine, for each of a plurality of rule sets, whether the sequence of tokens includes one or more candidate data artifacts of a distinct type to which that rule set corresponds, each of the plurality of rule sets being adapted to discovery of the distinct type of data artifact to which that rule set corresponds, at least one rule set in the plurality of rule sets including a context-free grammar;

    compute, for each candidate data artifact of a distinct type, a probability ranking indicating a degree of likelihood that the candidate data artifact is a data artifact of that distinct type; and

    classify each candidate data artifact as a data artifact of the distinct type for which a most favorable probability ranking was computed for that candidate data artifact;

    the syntax analyzer being configured to associate with each classified data artifact a subject found within the on-line data object; and

    a storage subsystem including at least one data structure in which to store the classified data artifacts, the storage subsystem being configured to index and organize the classified data artifacts by subject for retrieval in response to a search query indicating a particular subject.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×