Method of retrieving no word separation text data and a data retrieving apparatus therefor
First Claim
1. A method of retrieving first and second candidate data in full text data including no word separation data, comprising the steps of:
- (a) dividing said full text data into words and thereby generating word separation data;
(b) generating and storing index data including the steps of;
(c) extracting all character strings from said full text data, each character string including N characters, N being a natural number; and
(d) attaching said word separation data and character position data of each of said character strings to each of said character strings to generate said index data;
(e) inputting query data with segmentation indicative of leading and trailing ends of said query data;
(f) detecting agreement in word retrieving, said step (f) including steps of;
(g) collating said query data with each of said character strings in said index data to detect character agreement;
(h) collating said segmentation of said query data with said word separation data of each of said character strings to detect segmentation agreement;
(i) outputting said character position data of one of character strings showing said character agreement and said segmentation agreement; and
(j) detecting agreement in character string retrieving, said step (j) including steps of;
(k) collating said query data with each of said N characters in said index data; and
(l) outputting said character position data of one of said character strings showing only said character agreement, wherein either of said step (f) or step (j) is effected in accordance with a selection command and said index data is commonly used in the steps (f) and (j).
1 Assignment
0 Petitions
Accused Products
Abstract
Full text data is divided into words to generate word separation data. All character strings are extracted from the full text data, each character string including N characters. The word separation and position data is attached to each character string to generate index data. In word retrieving, character and segmentation agreement between query data and all character strings is checked. Word retrieving and/or character string retrieving are effected according to a selection command. The word separation data may include leading or trailing end data. In the word retrieving mode, the leading end of the first character and the trailing end of the last character are checked but the intermediate portion is not checked. Continuity of retrieve character strings is checked with reference to position data thereof. The word retrieving mode includes a number of modes including the completion agreement mode. A non-target word in retrieving is detected according to a word class and the word separation data is not attached to the non-target word. The word separation data is not attached to words of the affix. Sets of full text data are retrieved and the matching degrees are detected and the sets of full text data are ordered to provide various text agreement. The matching degree is also calculated with an operator.
25 Citations
40 Claims
-
1. A method of retrieving first and second candidate data in full text data including no word separation data, comprising the steps of:
-
(a) dividing said full text data into words and thereby generating word separation data;
(b) generating and storing index data including the steps of;
(c) extracting all character strings from said full text data, each character string including N characters, N being a natural number; and
(d) attaching said word separation data and character position data of each of said character strings to each of said character strings to generate said index data;
(e) inputting query data with segmentation indicative of leading and trailing ends of said query data;
(f) detecting agreement in word retrieving, said step (f) including steps of;
(g) collating said query data with each of said character strings in said index data to detect character agreement;
(h) collating said segmentation of said query data with said word separation data of each of said character strings to detect segmentation agreement;
(i) outputting said character position data of one of character strings showing said character agreement and said segmentation agreement; and
(j) detecting agreement in character string retrieving, said step (j) including steps of;
(k) collating said query data with each of said N characters in said index data; and
(l) outputting said character position data of one of said character strings showing only said character agreement, wherein either of said step (f) or step (j) is effected in accordance with a selection command and said index data is commonly used in the steps (f) and (j). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
generating said word separation data to have leading and trailing end data of each of said words and in step (h), said segmentation of said query data is compared with said leading and trailing end data of each character string, and in step (i), said position data of said first candidate data is outputted when said segmentation of said query data agrees with said leading and trailing end data of said one character string.
-
-
3. A method as claimed in claim 2, wherein said step (a) further includes step of:
- checking whether a first character having a first order in one of said character strings has leading and trailing ends;
attaching said leading end data to one of said character strings with respect to said first character when said first character has said leading end;
attaching said trailing end data to one of said character strings with respect to said first character when said first character has said trailing end;
checking whether a second character following said first character has a trailing end;
attaching said trailing end data to said one of said character strings with respect to said second character when said second character has said trailing end.
- checking whether a first character having a first order in one of said character strings has leading and trailing ends;
-
4. A method as claimed in claim 1, wherein both said steps (f) and (j) are effected in accordance with said selection command.
-
5. A method as claimed in claim 1, further comprising steps of:
-
dividing said query data into query character strings, each query character string includes N query characters, said step (g) being executed for said query character strings to obtain collating results of said query character strings, respectively;
estimating continuity of said character strings showing said character agreement with said query character strings in accordance with said position data of said character strings showing said character agreement, said step (h) being executed with respect to said word separation data just before the first character and said word separation data just after the last character of said character strings showing said character agreement and said continuity, wherein in step (i) said position data of said first candidate data is outputted when there is said continuity and said word separation data of the first and the last characters of said character strings agrees with said segmentation of said word separation data of the first and the last characters.
-
-
6. A method as claimed in claim 5, wherein said segmentation agreement is detected in either of first to fifth modes in response to a mode command,
in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement; -
in said second mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement and when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement;
in said third mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement and when said segmentation of only the last character of said query data agrees with said word separation data just after the last character of said character string showing said character agreement;
in said fourth mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement; and
in said fifth mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement.
-
-
7. A method as claimed in claim 1, further comprising the steps of:
-
detecting a condition of each word in said full text data; and
judging whether each word is a non-target word in retrieving in accordance with said condition, wherein in said step (d), said word separation data is not attached to said one character string including said non-target word when one of said words is judged to be non-target word and said segmentation agreement is not effected when said word separation data is not attached to said one character string.
-
-
8. A method as claimed in claim 2, further comprising the steps of:
-
detecting a condition of each word in said full text data; and
judging whether each word is a non-target word in retrieving in accordance with said condition, wherein in said step (d), said leading and trailing end data of said word separation data is not attached to said each character string when one of said words is judged to be a non-target word and said segmentation agreement is not detected when said word separation data is not attached to said one character string.
-
-
9. A method as claimed in claim 7, further comprising the steps of:
-
detecting whether each of said words connects the previous one of said words to the following one of said words; and
judging that one of said words is a non-target word when said one of words connects the previous one of said words to the following one of said words.
-
-
10. A method as claimed in claim 7, further comprising the steps of:
detecting a word class of each word in said full text data to detect said condition, wherein one of words is judged to be said non-target word in accordance with said word class.
-
11. A method as claimed in claim 7, further comprising the steps of:
detecting whether each of said words includes at least a hiragana character in said full text data to detect said condition, wherein one of said words is judged to be said non-target word when one of said words includes one hiragana character and when one of said words includes two hiragana characters.
-
12. A method as claimed in claim 7, further comprising the steps of:
detecting a frequency of appearance of each word in said full text, wherein one of said words is judged to be said non-target word when one of said words has said frequency which is higher than a reference.
-
13. A method as claimed in claim 5, wherein said step (h) is not executed for intermediate word between said first character and the last character of said character strings showing said character agreement.
-
14. A method as claimed in claim 2, further comprising the steps of:
detecting a prefix and a suffix of each word in said full text data, wherein said leading end data is not generated as said word separation data when the previous word of one of said words is prefix and said trailing end data is not generated as said word separation data when the following word of one of said words is suffix.
-
15. A method as claimed in claim 14, further comprising the steps of:
detecting a word class of each word in said full text data to detect said prefix and said suffix.
-
16. A method as claimed in claim 14, further comprising the steps of:
detecting a frequency of appearance of each word in said full text, wherein one of words is judged to be said prefix and suffix in accordance with said frequency.
-
17. A method as claimed in claim 1, further comprising steps of:
-
numerically evaluating the results of said steps of (f) and (j), wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said method further comprising the steps of;
ordering said sets of said full text data in accordance with the results of said steps of (f) and (j) of said sets of said full text data; and
outputting said document identification data of said ordered full text data.
-
-
18. A method as claimed in claim 17, wherein said both steps of (f) and (j) are executed, said method further comprising the step of:
weighting said results of said steps (f) and (j) with different first and second coefficients, respectively.
-
19. A method as claimed in claim 18, wherein said first and second coefficients are determined such that any set of said full text data having the lowest numerically evaluated result of said step (f) is ranked higher than any set of said full text data having the highest numerically evaluated result of said (j).
-
20. A method as claimed in claim 5, wherein said segmentation agreement is detected in either of first to third modes in response to a mode command,
in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data of the first and the last characters of said character string showing said character agreement; -
in said second mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data of the first character of said character string showing said character agreement;
in said third mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data of the last character of said character string showing said character agreement, said method further comprising steps of;
weighting said results of said step (f) with first to third different coefficients in said first to third modes, respectively numerically evaluating the results of said steps of (f) and (j), wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said method further comprising the steps of;
ordering said sets of said full text data in accordance with the results of said steps of (f) and (j) of said sets of said full text data; and
outputting said document identification data of said ordered full text data.
-
-
21. A method as claimed in claim 20, further comprising the steps of:
-
inputting ordering commands for ordering said first to third modes;
generating said first to third coefficients in accordance with said ordering commands such that one of said first to third coefficients of which mode is the most highly ordered has a highest value, another of said first to third coefficients of which mode is the lowliest ordered has a lowest value, the other of said first to third coefficients of which mode is intermediately ordered has an intermediate value.
-
-
22. A method as claimed in claim 21, wherein said first and second candidate data is successively retrieved in each set of said full text data having document identification data, said method further comprising the steps of:
-
classifying said sets of full text data into first to third groups such that said first group of said full text data includes said candidate data most highly ordered mode, said second group of said full text data includes said candidate data intermediately ordered mode but does not include said candidate data most highly ordered mode, and said third group of said full text data includes said candidate data lowliest ordered mode but does not include said candidate data most highly ordered and intermediately ordered modes;
ordering a first portion of said sets of said full text data in each of said first to third groups every said group in accordance with the number of pieces of said first candidate data retrieved in step (f) in respective full text data of said first portion; and
ordering a second portion of said sets of said full text data in which only said second candidate data is retrieved in step (j) in accordance with the number pieces of said second candidate data.
-
-
23. A method as claimed in claim 20, further comprising the steps of:
-
detecting ratios between the number of said first candidate data and said second candidate data in said sets of full text data, respectively;
estimating accuracies of said sets of said full text data in operation in step (a) in accordance with said ratio, respectively, wherein said sets of full text data is ordered in accordance with said accuracies, respectively.
-
-
24. A method as claimed in claim 17, wherein in said step (e), said query data including a plurality of quarry character strings and at least an operator indicating operation among a plurality of query character strings are inputted, wherein in said step of ordering, said each of full text data is ordered in accordance with each of said query character strings, said method further comprising the step of:
- finally ordering said sets of said full text data in accordance with the ordering result of said sets of said full text data and said operator.
-
25. A data retrieving apparatus for retrieving first and second candidate data in full text data including no word separation data, comprising:
-
dividing means for dividing said full text data into words and thereby generating word separation data;
generation and storing means for generating and storing index data including;
extracting means for extracting all character strings from said full text data, each character string including N characters, N being a natural number; and
attaching means for attaching said word separation data and character position data of each of said character strings to each of said character strings to generate said index data;
inputting means for inputting query data with segmentation indicative of leading and trailing ends of said query data;
first detecting means for detecting agreement in word retrieving including;
first collating means for collating said query data with each of said character strings in said index data to detect character agreement;
second collating means for collating said segmentation of said query data with said word separation data of each of said character strings to detect segmentation agreement;
first outputting means for outputting said character position data of one of character strings showing said character agreement and said segmentation agreement; and
second detecting means for detecting agreement in character string retrieving including;
third collating means for collating said query data with each of said N characters in said index data; and
second outputting means for outputting said character position data of one of said character strings showing only said character agreement, wherein either of said first detecting means or said second detecting means is operated in accordance with a selection command and said index data is commonly used in said first and second detecting means. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
query data dividing means for dividing said query data into query character strings, each query character string includes N query characters, said first collating means being operated for said query character strings to obtain collating results of said query character strings, respectively;
estimating means for estimating continuity of said character strings showing said character agreement with said query character strings in accordance with said position data of said character strings showing said character agreement, said second collating means being executed with respect to said word separation data just before the first character and said word separation data just after the last character of said character strings showing said character agreement and said continuity, wherein said position data of said first candidate data is outputted by said first outputting means when there is said continuity and said word separation data of the first and the last characters of said character strings agrees with said segmentation of said word separation data of the first and the last characters.
-
-
30. A data retrieving apparatus as claimed in claim 29, wherein said segmentation agreement is detected in either of first to fifth modes in response to a mode command,
in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement, respectively; -
in said second mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement, respectively, and when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement;
in said third mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement, respectively and when said segmentation of only the last character of said query data agrees with said word separation data just after the last character of said character string showing said character agreement;
in said fourth mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement; and
in said fifth mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement.
-
-
31. A data retrieving apparatus as claimed in claim 30, further comprising:
-
detecting means for detecting a condition of each word in said full text data; and
judging means for judging whether each word is a non-target word in retrieving in accordance with said condition, wherein said attaching means attaches said word separation data to one of said character strings including said non-target word when one of said words is judged as a non-target word and said segmentation agreement is not effected when said word separation data is not attached to said one of character strings.
-
-
32. A data retrieving apparatus as claimed in claim 25, further comprising:
-
third detecting means for detecting a condition of each word in said full text data; and
judging means for judging whether each word is a non-target word.in retrieving in accordance with said condition, wherein said attaching means does not attach said leading and trailing end data of said word separation data to said each character string when one of said words is judged as a non-target word and said segmentation agreement is not detected when said word separation data is not attached to said one of said character strings.
-
-
33. A data retrieving apparatus as claimed in claim 26, further comprising:
third detecting means for detecting a prefix and a suffix of each word in said full text data, wherein said leading end data is not generated as said word separation data when the previous word of one of said words is prefix and said trailing end data is not generated as said word separation data when the following word of one of said words is suffix.
-
34. A data retrieving apparatus as claimed in claim 26, further comprising:
third detecting means for detecting a word class of each word in said full text data to detect said prefix and said suffix.
-
35. A data retrieving apparatus as claimed in claim 25, further comprising:
-
evaluating means for numerically evaluating the results of said first and second detecting means, wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said method further comprising;
ordering means for ordering said sets of said full text data in accordance with the results of said first and second detecting means of said sets of said full text data; and
third outputting means for outputting said document identification data of said ordered full text data.
-
-
36. A data retrieving apparatus as claimed in claim 35, wherein said first and second coefficients are determined such that any set of said full text data having the lowest numerically evaluated result of said first detecting means is ranked higher than any set of said full text data having the highest numerically evaluated result of said second detecting means.
-
37. A data retrieving apparatus claimed in claim 35, wherein said segmentation agreement is detected in either of first to third modes in response to a mode command,
in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data of the first and the last characters of said character string showing said character agreement; -
in said second mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data of the first character of said character string showing said character agreement;
in said third mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data of the last character of said character string showing said character agreement, said data retrieving apparatus further comprising;
weighting means for weighting said results of said first detecting means with first to third different coefficients in said first to third modes, respectively, evaluating means for numerically evaluating the results of said first and second detecting means, wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said data retrieving means further comprising;
ordering means for ordering said sets of said full text data in accordance with the results of said first and second detecting means of said sets of said full text data; and
third outputting means for outputting said document identification data of said ordered full text data.
-
-
38. A data retrieving apparatus as claimed in claim 37, further comprising:
-
inputting means for inputting ordering commands for ordering said first to third modes;
generating means for generating said first to third coefficients in accordance with said ordering commands such that one of said first to third coefficients of which mode is the most highly ordered has a highest value, another of said first to third coefficients of which mode is the lowliest ordered has a lowest value, and the other of said first to third coefficients of which mode is intermediately ordered has an intermediate value.
-
-
39. A data retrieving apparatus as claimed in claim 38, wherein said first and second candidate data is successively retrieved in each set of said full text data having document identification data, said data retrieving apparatus further comprising:
-
classifying means for classifying said sets of full text data into first to third groups such that said first group of said full text data includes said candidate data most highly ordered mode, said second group of said full text data includes said candidate data intermediately ordered mode but does not include said candidate data most highly ordered mode, and said third group of said full text data includes said candidate data lowliest ordered mode but does not include said candidate data most highly ordered and intermediately ordered modes;
first ordering means for ordering a first portion of said sets of said full text data in each of said first to third groups every said group in accordance with the number of pieces of said first candidate data retrieved by first detecting means in respective full text data of said first portion; and
second ordering means for ordering a second portion of said sets of said full text data in which only said second candidate data is retrieved by said second detecting means in accordance with the number pieces of said second candidate data.
-
-
40. A data retrieving apparatus as claimed in claim 35, further comprising:
-
third detecting means for detecting ratios between the number of said first candidate data and said second candidate data in said sets of full text data, respectively; and
estimating means for estimating accuracies of said sets of said full text data in operation by said dividing means in accordance with said ratio, respectively, wherein said sets of full text data is ordered in accordance with said accuracies, respectively.
-
Specification