Method of retrieving no word separation text data and a data retrieving apparatus therefor

US 6,546,401 B1
Filed: 07/17/2000
Issued: 04/08/2003
Est. Priority Date: 07/19/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method of retrieving first and second candidate data in full text data including no word separation data, comprising the steps of:

(a) dividing said full text data into words and thereby generating word separation data;

(b) generating and storing index data including the steps of;

(c) extracting all character strings from said full text data, each character string including N characters, N being a natural number; and

(d) attaching said word separation data and character position data of each of said character strings to each of said character strings to generate said index data;

(e) inputting query data with segmentation indicative of leading and trailing ends of said query data;

(f) detecting agreement in word retrieving, said step (f) including steps of;

(g) collating said query data with each of said character strings in said index data to detect character agreement;

(h) collating said segmentation of said query data with said word separation data of each of said character strings to detect segmentation agreement;

(i) outputting said character position data of one of character strings showing said character agreement and said segmentation agreement; and

(j) detecting agreement in character string retrieving, said step (j) including steps of;

(k) collating said query data with each of said N characters in said index data; and

(l) outputting said character position data of one of said character strings showing only said character agreement, wherein either of said step (f) or step (j) is effected in accordance with a selection command and said index data is commonly used in the steps (f) and (j).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Full text data is divided into words to generate word separation data. All character strings are extracted from the full text data, each character string including N characters. The word separation and position data is attached to each character string to generate index data. In word retrieving, character and segmentation agreement between query data and all character strings is checked. Word retrieving and/or character string retrieving are effected according to a selection command. The word separation data may include leading or trailing end data. In the word retrieving mode, the leading end of the first character and the trailing end of the last character are checked but the intermediate portion is not checked. Continuity of retrieve character strings is checked with reference to position data thereof. The word retrieving mode includes a number of modes including the completion agreement mode. A non-target word in retrieving is detected according to a word class and the word separation data is not attached to the non-target word. The word separation data is not attached to words of the affix. Sets of full text data are retrieved and the matching degrees are detected and the sets of full text data are ordered to provide various text agreement. The matching degree is also calculated with an operator.

25 Citations

View as Search Results

40 Claims

1. A method of retrieving first and second candidate data in full text data including no word separation data, comprising the steps of:
- (a) dividing said full text data into words and thereby generating word separation data;
  
  (b) generating and storing index data including the steps of;
  
  (c) extracting all character strings from said full text data, each character string including N characters, N being a natural number; and
  
  (d) attaching said word separation data and character position data of each of said character strings to each of said character strings to generate said index data;
  
  (e) inputting query data with segmentation indicative of leading and trailing ends of said query data;
  
  (f) detecting agreement in word retrieving, said step (f) including steps of;
  
  (g) collating said query data with each of said character strings in said index data to detect character agreement;
  
  (h) collating said segmentation of said query data with said word separation data of each of said character strings to detect segmentation agreement;
  
  (i) outputting said character position data of one of character strings showing said character agreement and said segmentation agreement; and
  
  (j) detecting agreement in character string retrieving, said step (j) including steps of;
  
  (k) collating said query data with each of said N characters in said index data; and
  
  (l) outputting said character position data of one of said character strings showing only said character agreement, wherein either of said step (f) or step (j) is effected in accordance with a selection command and said index data is commonly used in the steps (f) and (j).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 2. A method as claimed in claim 1, wherein said step (a) includes a step of:
3. A method as claimed in claim 2, wherein said step (a) further includes step of:
- checking whether a first character having a first order in one of said character strings has leading and trailing ends;
  
  attaching said leading end data to one of said character strings with respect to said first character when said first character has said leading end;
  
  attaching said trailing end data to one of said character strings with respect to said first character when said first character has said trailing end;
  
  checking whether a second character following said first character has a trailing end;
  
  attaching said trailing end data to said one of said character strings with respect to said second character when said second character has said trailing end.
4. A method as claimed in claim 1, wherein both said steps (f) and (j) are effected in accordance with said selection command.
5. A method as claimed in claim 1, further comprising steps of:
- dividing said query data into query character strings, each query character string includes N query characters, said step (g) being executed for said query character strings to obtain collating results of said query character strings, respectively;
  
  estimating continuity of said character strings showing said character agreement with said query character strings in accordance with said position data of said character strings showing said character agreement, said step (h) being executed with respect to said word separation data just before the first character and said word separation data just after the last character of said character strings showing said character agreement and said continuity, wherein in step (i) said position data of said first candidate data is outputted when there is said continuity and said word separation data of the first and the last characters of said character strings agrees with said segmentation of said word separation data of the first and the last characters.
6. A method as claimed in claim 5, wherein said segmentation agreement is detected in either of first to fifth modes in response to a mode command,in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement;
- in said second mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement and when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement;
  
  in said third mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement and when said segmentation of only the last character of said query data agrees with said word separation data just after the last character of said character string showing said character agreement;
  
  in said fourth mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement; and
  
  in said fifth mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement.
7. A method as claimed in claim 1, further comprising the steps of:
- detecting a condition of each word in said full text data; and
  
  judging whether each word is a non-target word in retrieving in accordance with said condition, wherein in said step (d), said word separation data is not attached to said one character string including said non-target word when one of said words is judged to be non-target word and said segmentation agreement is not effected when said word separation data is not attached to said one character string.
8. A method as claimed in claim 2, further comprising the steps of:
- detecting a condition of each word in said full text data; and
  
  judging whether each word is a non-target word in retrieving in accordance with said condition, wherein in said step (d), said leading and trailing end data of said word separation data is not attached to said each character string when one of said words is judged to be a non-target word and said segmentation agreement is not detected when said word separation data is not attached to said one character string.
9. A method as claimed in claim 7, further comprising the steps of:
- detecting whether each of said words connects the previous one of said words to the following one of said words; and
  
  judging that one of said words is a non-target word when said one of words connects the previous one of said words to the following one of said words.
10. A method as claimed in claim 7, further comprising the steps of:
- detecting a word class of each word in said full text data to detect said condition, wherein one of words is judged to be said non-target word in accordance with said word class.
11. A method as claimed in claim 7, further comprising the steps of:
- detecting whether each of said words includes at least a hiragana character in said full text data to detect said condition, wherein one of said words is judged to be said non-target word when one of said words includes one hiragana character and when one of said words includes two hiragana characters.
12. A method as claimed in claim 7, further comprising the steps of:
- detecting a frequency of appearance of each word in said full text, wherein one of said words is judged to be said non-target word when one of said words has said frequency which is higher than a reference.
13. A method as claimed in claim 5, wherein said step (h) is not executed for intermediate word between said first character and the last character of said character strings showing said character agreement.
14. A method as claimed in claim 2, further comprising the steps of:
- detecting a prefix and a suffix of each word in said full text data, wherein said leading end data is not generated as said word separation data when the previous word of one of said words is prefix and said trailing end data is not generated as said word separation data when the following word of one of said words is suffix.
15. A method as claimed in claim 14, further comprising the steps of:
- detecting a word class of each word in said full text data to detect said prefix and said suffix.
16. A method as claimed in claim 14, further comprising the steps of:
- detecting a frequency of appearance of each word in said full text, wherein one of words is judged to be said prefix and suffix in accordance with said frequency.
17. A method as claimed in claim 1, further comprising steps of:
- numerically evaluating the results of said steps of (f) and (j), wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said method further comprising the steps of;
  
  ordering said sets of said full text data in accordance with the results of said steps of (f) and (j) of said sets of said full text data; and
  
  outputting said document identification data of said ordered full text data.
18. A method as claimed in claim 17, wherein said both steps of (f) and (j) are executed, said method further comprising the step of:
- weighting said results of said steps (f) and (j) with different first and second coefficients, respectively.
19. A method as claimed in claim 18, wherein said first and second coefficients are determined such that any set of said full text data having the lowest numerically evaluated result of said step (f) is ranked higher than any set of said full text data having the highest numerically evaluated result of said (j).
20. A method as claimed in claim 5, wherein said segmentation agreement is detected in either of first to third modes in response to a mode command,in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data of the first and the last characters of said character string showing said character agreement;
- in said second mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data of the first character of said character string showing said character agreement;
  
  in said third mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data of the last character of said character string showing said character agreement, said method further comprising steps of;
  
  weighting said results of said step (f) with first to third different coefficients in said first to third modes, respectively numerically evaluating the results of said steps of (f) and (j), wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said method further comprising the steps of;
  
  ordering said sets of said full text data in accordance with the results of said steps of (f) and (j) of said sets of said full text data; and
  
  outputting said document identification data of said ordered full text data.
21. A method as claimed in claim 20, further comprising the steps of:
- inputting ordering commands for ordering said first to third modes;
  
  generating said first to third coefficients in accordance with said ordering commands such that one of said first to third coefficients of which mode is the most highly ordered has a highest value, another of said first to third coefficients of which mode is the lowliest ordered has a lowest value, the other of said first to third coefficients of which mode is intermediately ordered has an intermediate value.
22. A method as claimed in claim 21, wherein said first and second candidate data is successively retrieved in each set of said full text data having document identification data, said method further comprising the steps of:
- classifying said sets of full text data into first to third groups such that said first group of said full text data includes said candidate data most highly ordered mode, said second group of said full text data includes said candidate data intermediately ordered mode but does not include said candidate data most highly ordered mode, and said third group of said full text data includes said candidate data lowliest ordered mode but does not include said candidate data most highly ordered and intermediately ordered modes;
  
  ordering a first portion of said sets of said full text data in each of said first to third groups every said group in accordance with the number of pieces of said first candidate data retrieved in step (f) in respective full text data of said first portion; and
  
  ordering a second portion of said sets of said full text data in which only said second candidate data is retrieved in step (j) in accordance with the number pieces of said second candidate data.
23. A method as claimed in claim 20, further comprising the steps of:
- detecting ratios between the number of said first candidate data and said second candidate data in said sets of full text data, respectively;
  
  estimating accuracies of said sets of said full text data in operation in step (a) in accordance with said ratio, respectively, wherein said sets of full text data is ordered in accordance with said accuracies, respectively.
24. A method as claimed in claim 17, wherein in said step (e), said query data including a plurality of quarry character strings and at least an operator indicating operation among a plurality of query character strings are inputted, wherein in said step of ordering, said each of full text data is ordered in accordance with each of said query character strings, said method further comprising the step of:
- finally ordering said sets of said full text data in accordance with the ordering result of said sets of said full text data and said operator.

25. A data retrieving apparatus for retrieving first and second candidate data in full text data including no word separation data, comprising:
- dividing means for dividing said full text data into words and thereby generating word separation data;
  
  generation and storing means for generating and storing index data including;
  
  extracting means for extracting all character strings from said full text data, each character string including N characters, N being a natural number; and
  
  attaching means for attaching said word separation data and character position data of each of said character strings to each of said character strings to generate said index data;
  
  inputting means for inputting query data with segmentation indicative of leading and trailing ends of said query data;
  
  first detecting means for detecting agreement in word retrieving including;
  
  first collating means for collating said query data with each of said character strings in said index data to detect character agreement;
  
  second collating means for collating said segmentation of said query data with said word separation data of each of said character strings to detect segmentation agreement;
  
  first outputting means for outputting said character position data of one of character strings showing said character agreement and said segmentation agreement; and
  
  second detecting means for detecting agreement in character string retrieving including;
  
  third collating means for collating said query data with each of said N characters in said index data; and
  
  second outputting means for outputting said character position data of one of said character strings showing only said character agreement, wherein either of said first detecting means or said second detecting means is operated in accordance with a selection command and said index data is commonly used in said first and second detecting means.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 26. A data retrieving apparatus as claimed in claim 25, wherein said dividing means includes generating means for generating said word separation data to have leading and trailing end data of each of said words and said second collating means compares said segmentation of said query data with said leading and trailing end data of each character string, and said position data of said first candidate data is outputted by said first outputting means when said segmentation of said query data agrees with said leading and trailing end data of said one of character strings.
  - 27. A data retrieving apparatus as claimed in claim 26, wherein said dividing means further includes:
    - first checking means for checking whether a first character having a first order in one of said character strings has leading and trailing ends;
      
      first attaching means for attaching said leading end data to one of said character strings with respect to said first character when said first character has said leading end;
      
      second attaching means for attaching said trailing end data to one of said character strings with respect to said first character when said first character has said trailing end;
      
      second checking means for checking whether a second character following said first character has a trailing end;
      
      third attaching means for attaching said trailing end data to said one of said character strings with respect to said second character when said second character has said trailing end.
  - 28. A data retrieving apparatus as claimed in claim 25, wherein both said first and second detecting means are operated in accordance with said selection command.
  - 29. A data retrieving apparatus as claimed in claim 25, further comprising:
30. A data retrieving apparatus as claimed in claim 29, wherein said segmentation agreement is detected in either of first to fifth modes in response to a mode command,in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement, respectively;
- in said second mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement, respectively, and when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement;
  
  in said third mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data just before the first character and said word separation data just after the last characters of said character string showing said character agreement, respectively and when said segmentation of only the last character of said query data agrees with said word separation data just after the last character of said character string showing said character agreement;
  
  in said fourth mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement; and
  
  in said fifth mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data just before the first character of said character string showing said character agreement.
31. A data retrieving apparatus as claimed in claim 30, further comprising:
- detecting means for detecting a condition of each word in said full text data; and
  
  judging means for judging whether each word is a non-target word in retrieving in accordance with said condition, wherein said attaching means attaches said word separation data to one of said character strings including said non-target word when one of said words is judged as a non-target word and said segmentation agreement is not effected when said word separation data is not attached to said one of character strings.
32. A data retrieving apparatus as claimed in claim 25, further comprising:
- third detecting means for detecting a condition of each word in said full text data; and
  
  judging means for judging whether each word is a non-target word.in retrieving in accordance with said condition, wherein said attaching means does not attach said leading and trailing end data of said word separation data to said each character string when one of said words is judged as a non-target word and said segmentation agreement is not detected when said word separation data is not attached to said one of said character strings.
33. A data retrieving apparatus as claimed in claim 26, further comprising:
- third detecting means for detecting a prefix and a suffix of each word in said full text data, wherein said leading end data is not generated as said word separation data when the previous word of one of said words is prefix and said trailing end data is not generated as said word separation data when the following word of one of said words is suffix.
34. A data retrieving apparatus as claimed in claim 26, further comprising:
- third detecting means for detecting a word class of each word in said full text data to detect said prefix and said suffix.
35. A data retrieving apparatus as claimed in claim 25, further comprising:
- evaluating means for numerically evaluating the results of said first and second detecting means, wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said method further comprising;
  
  ordering means for ordering said sets of said full text data in accordance with the results of said first and second detecting means of said sets of said full text data; and
  
  third outputting means for outputting said document identification data of said ordered full text data.
36. A data retrieving apparatus as claimed in claim 35, wherein said first and second coefficients are determined such that any set of said full text data having the lowest numerically evaluated result of said first detecting means is ranked higher than any set of said full text data having the highest numerically evaluated result of said second detecting means.
37. A data retrieving apparatus claimed in claim 35, wherein said segmentation agreement is detected in either of first to third modes in response to a mode command,in said first mode, said segmentation agreement is established when said segmentation of the first and the last characters of said query data agrees with said word separation data of the first and the last characters of said character string showing said character agreement;
- in said second mode, said segmentation agreement is established when said segmentation of only the first character of said query data agrees with said word separation data of the first character of said character string showing said character agreement;
  
  in said third mode, said segmentation agreement is established when said segmentation of only the last character of said query data agrees with said word separation data of the last character of said character string showing said character agreement, said data retrieving apparatus further comprising;
  
  weighting means for weighting said results of said first detecting means with first to third different coefficients in said first to third modes, respectively, evaluating means for numerically evaluating the results of said first and second detecting means, wherein said first and second candidate data is retrieved in sets of said full text data having document identification data, said data retrieving means further comprising;
  
  ordering means for ordering said sets of said full text data in accordance with the results of said first and second detecting means of said sets of said full text data; and
  
  third outputting means for outputting said document identification data of said ordered full text data.
38. A data retrieving apparatus as claimed in claim 37, further comprising:
- inputting means for inputting ordering commands for ordering said first to third modes;
  
  generating means for generating said first to third coefficients in accordance with said ordering commands such that one of said first to third coefficients of which mode is the most highly ordered has a highest value, another of said first to third coefficients of which mode is the lowliest ordered has a lowest value, and the other of said first to third coefficients of which mode is intermediately ordered has an intermediate value.
39. A data retrieving apparatus as claimed in claim 38, wherein said first and second candidate data is successively retrieved in each set of said full text data having document identification data, said data retrieving apparatus further comprising:
- classifying means for classifying said sets of full text data into first to third groups such that said first group of said full text data includes said candidate data most highly ordered mode, said second group of said full text data includes said candidate data intermediately ordered mode but does not include said candidate data most highly ordered mode, and said third group of said full text data includes said candidate data lowliest ordered mode but does not include said candidate data most highly ordered and intermediately ordered modes;
  
  first ordering means for ordering a first portion of said sets of said full text data in each of said first to third groups every said group in accordance with the number of pieces of said first candidate data retrieved by first detecting means in respective full text data of said first portion; and
  
  second ordering means for ordering a second portion of said sets of said full text data in which only said second candidate data is retrieved by said second detecting means in accordance with the number pieces of said second candidate data.
40. A data retrieving apparatus as claimed in claim 35, further comprising:
- third detecting means for detecting ratios between the number of said first candidate data and said second candidate data in said sets of full text data, respectively; and
  
  estimating means for estimating accuracies of said sets of said full text data in operation by said dividing means in accordance with said ratio, respectively, wherein said sets of full text data is ordered in accordance with said accuracies, respectively.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Iizuka, Yasuki, Kikuchi, Chuichi, Tanabe, Tomoko
Primary Examiner(s)
Metjahic, Safet
Assistant Examiner(s)
Chen, Te Yu

Application Number

US09/618,055
Time in Patent Office

995 Days
Field of Search

707/1, 707/3, 707/102-104, 707/530, 707/531, 707/540, 704/1-10, 704/530, 704/531-532
US Class Current

1/1
CPC Class Codes

G06F 40/284   Lexical analysis, e.g. toke...

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

Method of retrieving no word separation text data and a data retrieving apparatus therefor

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

25 Citations

40 Claims

Specification

Solutions

Use Cases

Quick Links

Method of retrieving no word separation text data and a data retrieving apparatus therefor

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

40 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links