×

Retrieval of Structured Documents

  • US 20090012956A1
  • Filed: 09/16/2008
  • Published: 01/08/2009
  • Est. Priority Date: 01/06/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method, comprising:

  • ranking structured elements within a structured document, the structured document includes a document element or root element, at least one section element, and at least one paragraph elements, the ranking including;

    for each paragraph element, in which Weight(tj,Pj) stands for the weight of the term ti in the paragraph Pj, “

    tf(ti,Pj)”

    is the term frequency of ti in this paragraph, N denotes the number of documents in the corpus, and ni represents the number of documents containing the term ti calculating the terms'"'"' weight according to the calculation;

    Weight

    ( t i , P j )
    = ln

    ( 1 + tf

    ( t i , P j )
    )
    ×

    ln

    N n i
    ;

    for any section element Ej at the upper levels following a bottom-up fashion, in which “

    I(ti,Ej)”

    is the entropy measure of the term ti in element Ej, wherein if Weight(ti,Ej)≧

    average(Ej)+std_dev(Ej), the term ti is selected as an index term of the element Ej and all sub-elements of Ej would eliminate ti from their index term list, where “

    average (Ej)”

    denotes the arithmetic average of all terms'"'"' weights in the element Ej, and std_dev(Ej) denotes the standard deviation of these weights, calculating term weights using the calculation Weight(ti,Ej)=ln(1+tf(ti,Ej))×

    I(ti,Ej);

    repeating the calculating the term weights using the calculation Weight(ti,Ej)=ln(1+tf(ti,Ej))×

    I(ti,Ej) until the root element (i.e., the document element) is reached;

    obtaining paths for all evaluated candidate elements, and assign query terms'"'"' weight for elements to paths respectively;

    ranking paths in which ln

    N n i
    is the inverse document frequency (IDF) value of query term ti, which represents the query term'"'"'s weight and Q is the number of query terms in a query, using the calculation;

    Rank



    ( Path p )
    =

    i = 1 Q




    Weight

    ( t i , E j )
    ×

    ln

    N n i
    ;



    and
    returning elements corresponding to the ranked paths in a descending order.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×