Automatic learning and extending evolution handling method for Chinese basic block descriptive rule

Automatic learning and extending evolution handling method for Chinese basic block descriptive rule

  • CN 101,021,842 A
  • Filed: 03/09/2007
  • Published: 08/22/2007
  • Est. Priority Date: 03/09/2007
  • Status: Active Application
First Claim
Patent Images

1. the automatic study and the extending evolution handling method of Chinese basic block description rule is characterized in that, described method contains following steps successively:

  • (1) computer initialization,(1.1) form language knowledge base, comprise fundamental block tagged corpus and lexical knowledge bank, wherein;

    I. the fundamental block tagged corpus has marked word, part of speech and fundamental block descriptor to Chinese real text sentence, wherein;

    The sentence sum is represented with T;

    The sentence S=W+BC that is marked, W={<

    w i, t i, w iBe i word in the sentence, t iBe the POS-tagging of i word, i ∈

    [1, n], n are a word sum in the sentence;

    BC={bc j, bc jBe j fundamental block in the sentence, j ∈

    [1, bcs], bcs are the fundamental block sum in this sentence;

    Described fundamental block is divided into, the holophrase fundamental block of being made up of a word and many words fundamental block of being made up of two above words;

    II. lexical knowledge bank is preserved various vocabulary descriptors, comprises following content;

    Vocabulary association knowledge storehouse is contained the syntactic relation that forms between the Chinese notional word commonly used and is described rightly, and the master data form is;



    word 1〉



    word 2〉



    part of speech 1〉



    part of speech 2〉



    the syntactic relation mark 〉

    ;

    The feature verb list, contain from the syntactic information dictionary, extract can be with the verb vocabulary information of dissimilar objects, the master data form is;

    {<

    verb entry〉

    } is organized into different verb lists according to different type of object;

    The semantic nouns information table, 11 semantic category information that contain the Chinese major terms;

    tissue, people, artifact, natural thing, information, spirit, incident, attribute, quantity, time and space, the master data form is;



    noun entry〉



    the semantic category mark 〉

    ;

    (1.2) definition rule is described state space and fundamental block description rule, wherein;

    The rule description state space is defined as follows;

    at certain specific word combination, extract following description example automatically from the corpus annotation sentence;

    <

    w i-1

    t i-1>

    |<

    w i

    t i>

    ...<

    w j

    t j>

    |<

    w j+1

    t j+1>



    [1|0] Wherein<

    w i, t iI the word w of expression in the sentence iAnd its POS-tagging t i, [i, j] formed between the word combination region that satisfies specified conditions, w I-1Represent its left adjacent word, w J+1Represent its right adjacent word;

    " →

    1 " represents that this word is combined in fundamental block of formation under this linguistic context, promptly forms a positive example, further provides corresponding fundamental block mark this moment;

    syntactic marker+relation mark;

    " →

    0 " is represented then that this word is combined in and can not be formed a fundamental block under this linguistic context, promptly form a counter-example;

    All these describe example will form the rule description state space that makes up at this particular words;

    All positive examples in this state space form the positive example set, and positive example wherein adds up to the positive example frequency;

    All counter-examples form the counter-example set, and counter-example wherein adds up to the counter-example frequency;

    At a top state space, define a fundamental block description rule, its citation form is;



    textural association〉





    the reduction mark〉



    degree of confidence 〉

    , wherein;

    Textural association is described the inside unitized construction of each fundamental block, is divided into two levels according to the difference of rule description ability;

    A) primitive rule, it is the POS-tagging string that its textural association is described,B) extension rule by increasing the restriction of lexical constraint and linguistic context, forms the stronger textural association of descriptive power and describes, and the reduction mark comprises syntactic marker and relation mark two parts, describes the basic syntactic information of this fundamental block;

    Degree of confidence has provided uses this regular reliability desired value, and computing formula is;

    θ

    =fp/ (fp+fn), and wherein fp is the positive example frequency of rule state space covering, fn is the counter-example frequency that the rule state space covers;

    (1.3) set following data structure;

    components series stack ChkStack[], primitive rule table BasRules[], extension rule table ExpRules[], state space description table ZTList[], positive counter-example mark sentence table ExamSents[] and extension process formation EPList[], wherein;

    I. components series stack ChkStack[], all markup informations that preservation extracts from fundamental block tagged corpus sentence, comprise compositions such as word, holophrase fundamental block, many words fundamental block, form the linear composition mark sequence at a sentence, each record stack comprises following information;



    composition sign〉



    the composition left margin〉



    the composition right margin〉



    syntactic marker〉



    relation mark 〉

    , form following master record form;

    [cflag, cl, cr, cctag, crtag], wherein;

    Composition sign cflag;

    used the different composition classification of following character representation;

    the W-word;

    B-holophrase fundamental block;

    Many words of P-fundamental block;

    Composition left margin cl;

    represent the left margin position of this composition in sentence, cl ∈

    [0, n-1];

    Composition right margin cr;

    represent the right margin position of this composition in sentence, cr ∈

    [1, n];

    Syntactic marker cctag;

    represent the outside syntactic function of this composition,To the word composition, preserve its POS-tagging, particular content comprises;

    n-noun, s-place speech, the t-time word, the f-noun of locality, r-pronoun, the vM-auxiliary verb, v-verb, a-adjective, the d-adverbial word, m-number, q-measure word, the p-preposition, u-auxiliary word, c-conjunction, the y-modal particle, e-interjection, w-punctuation mark;

    To the fundamental block composition, preserve its syntactic marker, particular content comprises;

    np-noun piece, vp-verb piece, sp-space piece, tp-time block, mp-quantity piece, ap-adjective piece, dp-adverbial word piece;

    Relation mark crtag;

    represent the internal grammar relation of corresponding composition,To the word composition, preserve its word information;

    To the fundamental block composition, preserve its relation mark, particular content comprises;

    ZX-right corner division center, LN-chain type relational structure, the LH-coordination, PO-states guest'"'"'s relation, and SB-states the relation of benefit, the AD-additional relationships, AM-ambiguity interval, SG-holophrase piece, wherein;

    The right corner division center, all words in the expression fundamental block are directly interdependent to form a dextrad center dependency structure to the right corner centre word, and basic model is;

    A 1... A nH, dependence is;

    A 1

    h ..., A n

    H, H are the syntactic-semantic centre word of whole fundamental block, A 1..., A nBe qualifier;

    The chain type relational structure, each word in the expression fundamental block is interdependent successively to form a multicenter dependence chain of arranging from left to right to its directly right adjacent word, and basic model is;

    H 0H 1... H n, dependence is;

    H 0

    H 1..., H N-1

    H n, H i, i ∈

    [1, n-1] becomes the semantic polymerization site of different levels, H nSyntactic-semantic centre word for whole fundamental block;

    Coordination, each word in the expression fundamental block forms parallel construction;

    State guest'"'"'s relation, two words in the expression fundamental block form predicate-object phrase;

    State the relation of benefit, two words in the expression fundamental block form predicate-complement structure;

    Additional relationships, two words in the expression fundamental block form additional structure;

    II. primitive rule table BasRules[];

    preserve all primitive rules based on the POS-tagging string descriptor, its master record form is;

    [r_stru, r_tag, fp, fn], wherein r_stru is the regular texture combination, r_tag is the reduction mark, and fp is the positive example frequency, and fn is the counter-example frequency;

    III. extension rule table ExpRules[];

    preserving all increases the extension rule that lexical constraints and linguistic context restrictive condition are described, and its master record form is;

    [r_stru, r_tag, fp, fn, pelist, nelist], r_stru wherein, r_tag, fp, fn define same BasRules[], pelist is the index information table of all positive examples of rule state space covering, and nelist is the index information table of all counter-examples of rule state space covering;

    IV. state space description table ZTList[];

    preserve the related data of each rule description state space, basic format is [SentID, LWP, RWP, EF, r_tag], wherein;

    Sentence serial numbers SentID;

    each the mark sentence to the rule description example occurs provides a unique sequence number ID;

    Left margin position LWP preserves the left margin word position that the associated description example occurs in a mark sentence;

    Right margin position RWP preserves the right margin word position that the associated description example occurs in a mark sentence;

    Example sign EF represents the classification of corresponding description example;

    1-positive example, 0-counter-example;

    Reduction mark r_tag preserves syntactic marker and relation mark information that fundamental block is described example, to the mark counter-example, is NULL;

    V. positive counter-example mark sentence table ExamSents[];

    preserve all mark sentences of the description example appearance of each state space covering, basic format is;



    sentence serial numbers SentID, mark sentence content strings S 〉

    ;

    VI. extension process formation EPList[];

    preserve each textural association for the treatment of extension rule and corresponding state space information, basic format is;



    treat the textural association string r_stru of extension rule, state space index ZTIndexs 〉

    , wherein each index value points to state space description table ZTList[] a record;

    (1.4) load following base conditioning module;

    (1.4.1) regular reliability decision module, by selecting different degree of confidence and positive example frequency threshold value, with the rule of all automatic acquistions by its reliability standard be divided into highly reliable, moderate reliable, low reliable and unreliable 4 grades, its step is as follows;

    The first step;

    the positive example of input rule and counter-example frequency;

    fp and fn, computation rule degree of confidence θ

    =fp/ (fp+fn);

    Second step;

    , carry out following reliability classification and handle, and return the different evaluation value according to positive example frequency fp and degree of confidence θ

    ;

    If meet one of following condition, be highly reliable rule then, return 1;

    (fp>

    =10) ﹠

    amp;



    amp;





    =0.85) or ((fp>

    =5) ﹠

    amp;



    amp;

    (fp<

    10)) ﹠

    amp;



    amp;





    =0.9) or ((fp>

    =2) ﹠

    amp;



    amp;

    (fp<

    5)) ﹠

    amp;



    amp;





    =0.95)If meet one of following condition, then be reliably rule of moderate, return 2;

    (fp>

    =10) ﹠

    amp;



    amp;





    =0.5) or ((fp>

    =5) ﹠

    amp;



    amp;

    (fp<

    10)) ﹠

    amp;



    amp;





    =0.55) or ((fp>

    =2) ﹠

    amp;



    amp;

    (fp<

    5)) ﹠

    amp;



    amp;





    =0.6) or (fp>

    0) ﹠

    amp;



    amp;





    =0.6)If meet one of following condition, low reliable rule, return 3;

    (fp>

    =10) ﹠

    amp;



    amp;





    =0.1) or ((fp>

    =5) ﹠

    amp;



    amp;

    (fp<

    10)) ﹠

    amp;



    amp;





    =0.2) or ((fp>

    =2) ﹠

    amp;



    amp;

    (fp<

    5)) ﹠

    amp;



    amp;





    =0.3) or (fp>

    0) ﹠

    amp;



    amp;





    =0.3)Other situations for unreliable rule, return 4;

    Utilize this fail-safe analysis function that primitive rule table and extension rule table are classified and gather, obtain following intermediate data file;

    The primitive rule data file;

    at all primitive rules through positive counter-example training, preserve the primitive rule information with Different Reliability by four data files, data layout is with primitive rule table BasRules[];

    Treat the extension rule data file;

    from all primitive rule set through positive counter-example training, selecting all extendible primitive rules to preserve treats in the extension rule data file, data layout is with primitive rule table BasRules[], and further form every state space data file and mark sentence data file set for the treatment of the extension rule correspondence, as the initial data set of regular evolutionary learning;

    The extension rule set of data files;

    treat extension rule at each, preserve the extension rule information with Different Reliability that obtains in the extending evolution process by 4 data files, data layout is with extension rule table ExpRules[];

    (1.4.2) regular texture combination expansion module, carry out following steps successively;

    At first whether can expand by certain rule of following condition judgment;

    If highly reliable rule then needn'"'"'t be expanded;

    If positive example frequency<

    Th that rule covers, and Th=6 then could not expand;

    If used all internal vocabulary constraint and outside linguistic context restricted informations in the rule, then can not expand;

    Secondly,, describe r_stru, carry out word interval<

    L, R according to regular existing structure combination at each rule description example〉

    the information expansion, obtain the textural association description string of the new extension rule of NRS bar, concrete steps are as follows;

    The first step is checked regular existing structure combination description r_stru, if primitive rule then need be carried out " lexical constraint+linguistic context restriction " expansion in proper order;

    If comprised lexical constraint information, then only need carry out linguistic context restriction expansion;

    Second step was utilized lexical knowledge bank, sequential search word interval<

    L, R〉

    inner lexical constraint situation;

    If exist the vocabulary association to information, then generation comprises the textural association description string of vocabulary association to constraint;

    If there is feature verb list information, then generate the textural association description string that comprises the constraint of feature verb;

    If there is the semantic nouns category information, then generate the textural association description string that comprises the constraint of semantic nouns class;

    If occur specific function speech POS-taggings such as adverbial word, preposition and the noun of locality in the interval, then generate the textural association description string that comprises corresponding word constraint information;

    The 3rd step is at the rule of every basic part of speech string descriptor rule or process lexical constraint expansion, consider following three kinds of integrated modes;

    left adjacent POS-tagging, the adjacent POS-tagging with the left and right sides of right adjacent POS-tagging form three rule descriptions that increase the linguistic context restriction;

    (1.4.3) state space is dynamically divided module, by the extension rule table, the extension process formation, the reciprocation of state space description table and positive counter-example mark sentence table realizes, wherein state space description table and positive counter-example mark sentence table have formed the good working condition space description for the treatment of extension rule, the dynamic inner link of setting up the different conditions space by the state space index preserved in the extension process formation and each extension rule has realized treating the dynamic division in the good working condition space that extension rule covers by the inner positive counter-example concordance list of preserving of each extension rule;

    The specific implementation step is as follows;

    The first step is obtained one and is treated extension record from the extension process formation;

    [r_stru, ZTIndexs];

    Second step was obtained the index entry sum EISum among this regular state space concordance list ZTIndexs;

    Each record in the 3rd step sequential processes state space description table, obtain the relevant information of each rule description example according to its state space index;

    [SentID, LWP, RWP, EF, r_tag], and according to SentID from positive counter-example mark sentence table ExamSents[] retrieval obtains corresponding mark sentence string, locate this and describe the accurate left and right sides boundary position of example in sentence, form and wait to expand word interval<

    LWP, RWP 〉

    ;

    The 4th step was utilized lexical knowledge bank, and according to existing rule description r_tag, regular texture that invocation step (1.4.2) provides combination expansion module carries out interval<

    LWP, RWP〉

    information expand, obtain the new extension rule textural association of NRS bar description string;

    The 5th step order adds extension rule table ExpRules[with each new extension rule description string] in, return the subscript position ERLid of corresponding extension rule table, according to the positive counter-example mark EF of current example, with current state space index ZTIndexs[k] be added into ExpRules[ERLid] corresponding positive counter-example concordance list in;

    In addition, also define following parameter and basic function;

    Expansion study starts threshold value Th;

    have only when the positive example frequency of rule is worth more than or equal to this, just start extending evolution study, Th=6 is set at present;

    Min;

    the function of minimizing, min (x, y) minimum value among x and the y is selected in expression;

    (2) extract primitive rule positive example descriptor, step is as follows;

    (2.1) initialization i=0;

    (2.2) be initialized to sub-sequence stack ChkStack[];

    (2.3) from tagged corpus, read i mark sentence, obtain its relevant information and deposit ChkStack[in];

    (2.4) initialization j=0;

    (2.5) order is obtained ChkStack[] in the markup information of j fundamental block;

    [cflag, cl, cr, cctag, crtag];

    (2.6) if this fundamental block is not many words fundamental block, promptly cflag ≠

    '"'"' P '"'"', then change (2.9);

    (2.7) therefrom obtain primitive rule information;

    textural association string r_stru=t Clt Cl+1... t Cr, reduction mark r_tag=cctag+crtag;

    (2.8) with corresponding base this rule record;

    [r_stru, r_tag, 1,0] adds in the primitive rule table, and carries out the positive example frequency statistics of same structure combination string;

    (2.9) if;

    j<

    bcs i, then make j=j+1, repeating step (2.5)-(2.8);

    (2.10) if;

    i<

    T then makes i=i+1, repeating step (2.2)-(2.9);

    (2.11) output obtain the primitive rule table descriptions of<

    textural association 〉

    +<

    the reduction mark+<

    the positive example frequency, the termination;

    (3) carry out the positive counter-example training of primitive rule, step is as follows;

    (3.1) read in the primitive rule table BasRules[that step (2) generates], initialization i=0;

    (3.2) from the fundamental block tagged corpus, read i mark sentence, obtain its word sum n i(3.3) from left to right scan whole sentence, each word from sentence is combined to form the word interval<

    j of all possible length between 2 to 6, k 〉

    , and obtain this interval POS-tagging string t jt J+1... t kIf this POS-tagging string occurs in the primitive rule table, then the total frequency of positive counter-example of respective rule adds 1;

    (3.4) if;

    i<

    T then makes i=i+1, repeating step (3.2)-(3.3);

    (3.5) utilize fail-safe analysis function in the step (1.4.1) that all are classified through primitive rules of positive counter-example training and gather, preserve respectively in 4 primitive rule data files;

    (3.6) extract all extendible primitive rules and preserve and treat in the extension rule data file, stop;

    (4) generate the state space description data for the treatment of extension rule, step is as follows;

    (4.1) treat to read in the extension rule data file and treat the extension rule table, initialization i=0 from what step (3) generated;

    (4.2) from the fundamental block tagged corpus, read i mark sentence S, obtain its word sum n i(4.3) extension rule of finding in the initialization sentence for the treatment of is described the counter IsSent=0 of example;

    (4.4) from left to right scan whole sentence, each word from sentence is combined to form the word interval<

    j of all possible length between 2 to 6, k 〉

    , and obtain this interval POS-tagging string t jt J+1... t k(4.5) if this POS-tagging string occurs in treating the extension rule table, then generate this mark sentence serial numbers SentID, and determine corresponding example sign EF and reduction mark r_tag according to the mark state of this interval in sentence, generate a state space description record;

    [SentID, j, k, EF, r_tag], preserve in the corresponding state space data file, and make IsSent=IsSent+1;

    (4.6) do not describe example if find to treat accordingly extension rule in the sentence, promptly IsSent=0 then changes (4.8);

    (4.7) generate positive counter-example mark sentence record [SentID, S], preserve in the corresponding mark sentence data file;

    (4.8) if i<

    T then makes i=i+1, repeating step (4.2)-(4.7), otherwise stop;

    (5) extending evolution that carries out primitive rule is learnt, and step is as follows;

    (5.1) treat to read in the extension rule data file and treat the extension rule table from what step (3) generated, obtain and treat extension rule sum WERSum, initialization r=0;

    (5.2) obtain the textural association r_stru that the r bar is treated extension rule r, select to determine corresponding positive counter-example mark sentence and state space data file;

    (5.3) from the state space data file, read in state space description table ZTList[], from positive counter-example mark sentence data file, read in positive counter-example mark sentence table ExamSents[], set up state space index ZTIndexs;

    (5.4) generate a new record;

    [r_stru r, ZTIndexs], add in the extension process formation;

    (5.5) initialization extension rule table;

    (5.6) state space that provides of invocation step (1.4.3) is dynamically divided module, carries out regular extending evolution study;

    (5.7) extension rule that newly obtains is carried out fail-safe analysis and data preservation, step is as follows;

    (5.7.1) obtain the extension rule sum ExpRSum that newly obtains;

    (5.7.2) initialization extension rule table subscript control variable k=0;

    (5.7.3) obtain k bar extension rule;

    [r_stru, r_tag, fp, fn, pelist, nelist];

    (5.7.4) according to its positive counter-example frequency fp, fn carries out fail-safe analysis, obtains its reliability classification mark;

    (5.7.5) according to its Different Reliability, be saved in respectively in 4 extension rule data files;

    (5.7.6), then generate a new record [r_stru, pelist+nelist], add in the extension process formation if this rule can also further expand;

    (5.7.7) if;

    k<

    ExpRSum then makes k=k+1, repeating step (5.7.3)-(5.7.6);

    Otherwise stop;

    (5.8) if the extension process formation is not empty, then change (5.5);

    (5.9) if r<

    WERSum then makes r=r+1, repeating step (5.2)-(5.8);

    Otherwise stop.

View all claims
    ×
    ×

    Thank you for your feedback

    ×
    ×