×

Method and system for classifying semi-structured documents

  • US 6,606,620 B1
  • Filed: 07/24/2000
  • Issued: 08/12/2003
  • Est. Priority Date: 07/24/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A classifier, for use on a computer readable medium, for dynamically classifying a semi-structured document with a schema, comprising:

  • a vectorization module for parsing the document into a structured vector model, wherein the structured vector model is divided into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;

    a sorting module for searching the, document and for counting the occurrences of individual terms in the document;

    the sorting module further accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;

    a testing module for assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and

    wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,Fk] according to the following expression;

    Pr[c



    d
    ,Fk
    ]
    =π







    (c)













    t





    ε







    pd





    (i,j)
    ,t





    ε







    d


    Fk





    (ed





    (i,j)
    )


    f






    (c,pd





    (i,j)
    ,t
    )
    n

    (d,pd





    (i,j)
    ,t
    )










    c



    π







    (c

    )














    t





    ε







    pd





    (i·

    )
    ,t





    ε







    d


    Fk





    (ed





    (i,j)
    )


    f






    (c,pd





    (i,j)
    ,t
    )
    n

    (d,pd





    (i,j)
    ,t
    )


    embedded imagewhere d is the document, p(c) is a prior distribution on the class c;

    c′

    is a class in a set of documents;

    pd is a path to a structure node ed from a root;

    n is a number of occurrences of term t in pd, f is a maximum likelihood estimation;

    Fk is a set of selected terms;

    F is a Fisher index defined by the following equation, where c1 and c2 are children of an internal class c0, and m is an average number of an occurrence of term t in class c;

    F





    (t)
    =





    (μ







    (c1,t)
    -μ







    (c2,t)
    )
    2c1,c2








    c


    1

    c













    (f





    (t,d,c)
    -μ







    (c,t)
    )
    2d





    ε







    c
    .
    embedded image

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×