Apparatus and method for retrieving character string based on classification of character
First Claim
Patent Images
1. A string retrieval apparatus for retrieving a given string out of registration strings, comprising:
- a first array unit registering number information corresponding to a prefix at a position of a first subscript, with an index of the prefix followed by a plurality of characters as the first subscript;
a second array unit registering a displacement amount corresponding to each of a plurality of groups obtained by classifying the plurality of characters following the prefix at a position of a second subscript, with the number information corresponding to the prefix as the second subscript;
a third array unit registering the index of the prefix at a position of a third subscript, with a sum of the displacement amount and an internal representation value of a character following the prefix as the third subscript; and
a retrieving unit retrieving the given string using said first, second and third array units.
1 Assignment
0 Petitions
Accused Products
Abstract
A character string retrieval apparatus classifies a plurality of characters following a prefix of a registration character string into a plurality of groups, and registers those following characters in an array structure using a different displacement amount for each group. The character string retrieval apparatus retrieves a given character string based on the displacement amount of a group corresponding to an input character.
44 Citations
33 Claims
-
1. A string retrieval apparatus for retrieving a given string out of registration strings, comprising:
-
a first array unit registering number information corresponding to a prefix at a position of a first subscript, with an index of the prefix followed by a plurality of characters as the first subscript;
a second array unit registering a displacement amount corresponding to each of a plurality of groups obtained by classifying the plurality of characters following the prefix at a position of a second subscript, with the number information corresponding to the prefix as the second subscript;
a third array unit registering the index of the prefix at a position of a third subscript, with a sum of the displacement amount and an internal representation value of a character following the prefix as the third subscript; and
a retrieving unit retrieving the given string using said first, second and third array units. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
said first array unit registers a displacement amount common to the one or more characters, in a subscript position identical to the index of the prefix followed by the one or more characters, said third array unit registers the index of the prefix followed by the one or more characters, in a subscript position identical to a sum of the displacement amount common to the one or more characters and an internal representation value of a character, and said retrieving unit includes an identifying unit judging which of number information or a displacement amount is a value registered in the first array unit. -
3. The string retrieval apparatus according to claim 1, wherein, when the number of the plurality of characters following the prefix exceeds a predetermined value,
said first array unit registers the number information corresponding to the prefix. -
4. The string retrieval apparatus according to claim 1, wherein, when a range of values of the plurality of characters following the prefix exceeds a predetermined value,
said first array unit registers the number information corresponding to the prefix. -
5. The string retrieval apparatus according to claim 1, further comprising
a prefix registering unit registering the prefix, with the number information corresponding to the subscript of said second array unit as a subscript. -
6. The string retrieval apparatus according to claim 1, further comprising
a calculating unit adding the displacement amount and the internal representation value of the character following the prefix, wherein said third array unit uses the sum of the displacement amount and the internal representation value of the character following the prefix as an index of a next prefix. -
7. The string retrieval apparatus according to claim 1, further comprising
a classifying unit classifying the plurality of characters following the prefix and calculating the plurality of groups. -
8. The string retrieval apparatus according to claim 7, wherein, when the given string contains the prefix and a next character is inputted succeeding the prefix,
said classifying unit calculates a group corresponding to the next character, and said retrieving unit extracts a displacement amount corresponding to the group of the next character from said second array unit using number information registered in said first array unit and checks whether or not the index of the prefix is registered in a position of said third array unit where a sum of the extracted displacement amount and an internal representation value of the next character is designated for a subscript. -
9. The string retrieval apparatus according to claim 7, wherein
said classifying unit classifies the plurality of characters following the prefix using one or more bits contained in codes of the plurality of characters following the prefix. -
10. The string retrieval apparatus according to claim 7, wherein
said classifying unit adopts a classification method out of two or more classification methods such as a deviation in number among characters contained in each of obtained groups may be a minimum. -
11. The string retrieval apparatus according to claim 10, wherein
said first array unit registers number information corresponding to the classification method adopted by said classification unit as the number information corresponding to the prefix. -
12. The string retrieval apparatus according to claim 1, further comprising
a calculating unit calculating a displacement amount for registering one or more characters contained in each of the plurality of groups in said first and third array unit corresponding to each of the plurality of groups. -
13. The string retrieval apparatus according to claim 12, wherein
said calculating unit adds an arbitrary addition value to values of the one or more characters contained in each group, calculates an addition value such as all of one or more obtained sums may correspond to empty positions in said third array unit, and calculates the minimum addition value out of obtained addition values as the displacement amount.
-
-
14. A string retrieval apparatus for retrieving a given string out of a plurality of registration strings, comprising:
-
a registering unit classifying a plurality of characters, which follow a prefix and respectively belong to different character strings with the same prefix, into a plurality of groups, each of the plurality of characters following the same prefix, and each of the different character strings being a registration string that has the same prefix as a leading part, assigning different displacement amounts to the respective groups and registering characters in each of the groups with each of the displacement amounts; and
a retrieving unit retrieving the given string using said registering unit.
-
-
15. A computer-readable storage medium recording a program to enable said computer to retrieve a given string out of registration strings, said program comprising:
-
when the given string contains a prefix followed by a plurality of characters and a next character is inputted succeeding the prefix, calculating a group corresponding to the next character out of a plurality of groups obtained by classifying the plurality of characters following the prefix;
referring to a first array registering number information corresponding to the prefix at a position of a first subscript, with an index of said prefix as the first subscript;
referring to a second array registering a displacement amount corresponding to each of said plurality of groups at a position of a second subscript, with the number information corresponding to the prefix as the second subscript and obtaining a displacement amount corresponding to the group of the next character; and
referring to a third array and checking whether or not the index of the prefix is registered in a position where a sum of an obtained displacement amount and an internal representation value of said next character is designated as a subscript.
-
-
16. A computer-readable storage medium recording a program to enable a computer to retrieve a given string out of a plurality of registration strings, said program comprising a process of referring to an array in which a plurality of characters following a prefix are classified into a plurality of groups, the plurality of characters respectively belonging to different character strings with the same prefix, each of the plurality of characters following the same prefix, and each of the different character strings being a registration string that has the same prefix as a leading part, assigning different displacement amounts to the respective groups, and registering characters in each of the groups with each of the displacement amounts.
-
17. A computer-readable storage medium recording data of registration strings, said data comprising:
-
first array data registering number information corresponding to a prefix followed by a plurality of characters, in a subscript identical to an index of the prefix;
second array data registering a displacement amount corresponding to each of a plurality of groups obtained by classifying the plurality of characters following the prefix, in a subscript position identical to the number information corresponding to the prefix; and
third array data registering the index of the prefix, in a subscript position identical to a sum of the displacement amount and an internal representation value of a character following the prefix.
-
-
18. A string retrieval method of retrieving a given string out of registration strings, comprising:
-
registering number information corresponding to a prefix followed by a plurality of characters in a first array, in a subscript position identical to an index of the prefix;
registering a displacement amount corresponding to each of a plurality of groups obtained by classifying the plurality of characters following the prefix in a second array, in a subscript position identical to the number information corresponding to the prefix;
registering the index of the prefix in a third array, with a sum of the displacement amount and a value of a character following the prefix as a subscript;
when the given string contains the prefix and a next character is inputted succeeding the prefix, calculating a group corresponding to the next character out of the plurality of groups;
referring to said first array, with the index of the prefix as a subscript and obtaining the number information corresponding to the prefix;
referring to said second array, with the number information corresponding to the prefix as a subscript and obtaining a displacement amount corresponding to the group of the next character; and
referring to the third array and checking whether or not the index of the prefix is registered in a position where a sum of the obtained displacement amount and an internal representation value of the next character is designated for a subscript.
-
-
19. A string retrieval method of retrieving a given string out of a plurality of registration strings, comprising:
-
classifying a plurality of characters, which follow a prefix and respectively belong to different character strings with the same prefix, into a plurality of groups, each of the plurality of characters following the same prefix, and each of the different character strings being a registration string that has the same prefix as a leading part;
assigning different displacement amounts to the respective groups;
registering characters in each of the groups with each of the displacement amounts in an array; and
referring to said array and retrieving the given character string.
-
-
20. A character code registration retrieval apparatus for registering character code character strings to be retrieved using keys in a double-array structure being an one-dimensional array of a data structure, comprising:
-
a parallel shift amount calculating unit calculating a parallel shift amount needed to register a character of each of the character strings to be retrieved using keys;
a first array having as a subscript an index of a prefix being a prefix of each of the character strings to be retrieved using keys;
an identifying unit identifying a registration value in said first array;
a second array registering information on a specific character following the prefix of the character string;
a key candidate point calculating unit calculating a sum of a parallel shift amount registered in said first and second arrays, and an internal representation value corresponding to a character following the prefix of the character string; and
a third array registering the index of the prefix of the character string, with the sum obtained by the key candidate point calculating unit as a subscript. - View Dependent Claims (21, 22)
a list unit generating a list of character codes frequently used in idioms and outputting a selection character code selected from the list of character codes;
a frequently-appearing character code selecting unit outputting a frequency threshold regarding up to what order number of character codes should be selected;
a frequently-appearing character code storing unit storing the selection character code selected from the list and outputting the selection character code and an index of the selection character code;
a dictionary unit being a character code dictionary registering idioms composed of character codes, classifying a job according to whether or not a focussed character is a prefix of an idiom composed of the selection character code and outputting each of groups obtained by classifying a character code following the selection character code of the prefix;
a group storing unit storing the groups obtained by classifying the character code following the selection character code of the prefix inputted from said dictionary unit;
a first BASE array unit to store said first array with number information of the selection character code;
a code classifying unit classifying the second character code of the idiom using at least one bits of the second character code in order to classify the characters following the selection character code of the prefix;
a parallel shift amount calculating unit calculating for each of the groups a minimum parallel shift amount, so that all values obtained by adding the minimum parallel shift amount to an internal representation value of each character of the group may indicate empty positions on said third array;
a parallel shift amount storing unit storing the parallel shift amount inputted from said parallel shift amount calculating unit and outputting the parallel shift amount to said second array;
a key candidate point calculating unit registering for each of said groups an index of the prefix being a parent of the character codes of the group in a subscript position, with a sum of the parallel shift amount and the internal representation value of each character of the group as a value of the subscript position on said third array and designating a value of the sum for an index of a next prefix consisting of ((prefix)+(focussed character));
a second BASE array unit to store said second array with the parallel shift amount for each of the groups outputted by said parallel shift amount storing unit based on both the code value outputted from said code classifying unit and the number information outputted from said list unit; and
a CHECK array unit as said third array for registering the index of the prefix in a place corresponding to the value of the sum.
-
-
22. The character code registration retrieval apparatus according to claim 20, further comprising:
-
a document inputting unit first designating a root of a TRIE structure as a prefix, setting up an end mark in the prefix as an end symbol, then instructing to input a character code as a character to be retrieved, and detecting a prefix of the inputted character code;
a first BASE array unit to store said first array with a numeric value from a place corresponding to the index of the prefix;
a registration value judging unit judging which of number information of a frequently-appearing prefix character code or a parallel shift amount is the numeric value inputted from said first BASE array unit, and when the numeric value is out of a range of indexes composing a TRIE, outputting the numeric value as the number information of the prefix character code, and when the numeric value is within the range of indexes, outputting the numeric value as the parallel shift amount;
a code classifying unit, when the numeric value inputted from said first BASE array unit is the number information of the prefix character code, classifying the inputted character code using at least one bit of the character code;
a second BASE array unit as said second array outputting a parallel shift amount from a place corresponding to both the number information of the prefix character code inputted from the registration value judging unit and a classification of the character code;
a parallel shift amount storing unit, when the numeric value inputted from said first BASE array unit is a parallel shift amount, storing the parallel shift amount;
a key candidate point calculating unit calculating a sum of the parallel shift amount and an internal representation value, of the inputted character;
a CHECK array unit as said third array outputting a key from a place corresponding the sum inputted from said key candidate point calculating unit; and
a key/prefix collating unit judging whether or not the key inputted from said CHECK array unit coincides with the index of the prefix, and when the inputted key is judged to coincide with the index of the prefix, judging that an idiom is registered in a dictionary.
-
-
23. A character code registration retrieval method of registering character code character strings to be retrieved using keys in a double-array structure being an one-dimensional array of a data structure, comprising:
-
a parallel shift amount calculation step of calculating a parallel shift amount needed to register a character of each of the character strings to be retrieved using keys;
a first array step of designating an index of a prefix of each of the character strings to be retrieved using keys as a subscript;
an identification step of identifying a registration value in the first array step;
a second array step of registering information on a specific character following the prefix of the character string indicated in the first array step;
a key candidate point calculation step of calculating a sum of a parallel shift amount registered in the first and second array steps and an internal representation value corresponding to a character following the prefix of the character string; and
a third array step of registering the index of the prefix of the character string, with the sum obtained in the key candidate point calculation step as a subscript. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32)
judging which of number information of a leading specific character code locating at the head of an idiom, or a parallel shift amount of another leading character code or a prefix of a character string is a registration content registered in said first array step; and
when the registration content registered in said first array step is judged to be the number information of the leading specific character code, calculating a parallel shift amount referring to an array place designated by the number information in said second array step.
-
-
25. The character code registration retrieval method according to claim 24, further comprising the step of referring said second array step based on both the number information of the leading specific character code of the character string and a classification of a character code following the leading specific character code.
-
26. The character code registration retrieval method according to claim 25, further comprising the step of classifying the character code following the leading specific character code by utilizing a code value of the following character code.
-
27. The character code registration retrieval method according to claim 23, wherein said second array step includes the step of selecting a character frequently used to make idioms as the specific character following the prefix of the character string.
-
28. The character code registration retrieval method according to claim 23, wherein said second array step includes the step of, when characters following the prefix of the character string is a part of idioms, selecting characters in which a width of code values exceeds a predetermined threshold, as specific characters following the prefix of the character string.
-
29. The character code registration retrieval method according to claim 23, comprising:
-
a list step of generating a list of character codes frequently used in idioms and outputting a selection character code selected from the list of character codes;
a frequently-appearing character code selection step of outputting a frequency threshold regarding up to what order number of character codes should be selected;
a frequently-appearing character code storage step of storing the selection character code selected in said list step and outputting the selection character code and an index of the selection character code;
a dictionary step of using a character code dictionary registering idioms composed of character codes, classifying a job˜
according to whether or not a focussed character is a prefix of an idiom composed of the selection character code and outputting each of groups obtained by classifying a character code following the selection character code of the prefix;
a group storage step of storing the groups obtained by classifying the character code following the selection character code of the prefix generated in said dictionary step;
a first BASE array step as said first array step of calculating number information of the selection character code and storing the number information in a position of an index of the selection character code on a first BASE array;
a code classification step of classifying the second character code of the idiom using at least one bit of the second character code in order to classify the characters following the selection character code of the prefix;
a parallel shift amount calculation step of calculating for each of the groups a minimum parallel shift amount, so that all values obtained by adding an arbitrary parallel shift amount to an internal representation value of each character of the group may indicate empty positions on a CHECK array;
a parallel shift amount storage step of storing the parallel shift amount generated in said parallel shift amount calculation step and outputting the parallel shift amount to a second BASE array;
a key candidate point calculation step of calculating for each of said groups a sum of the parallel shift amount and the internal representation value of each character of the group as a subscript of the CHECK array and designating a value of the sum for an index of a next prefix consisting of ((prefix)+(focussed character));
a second BASE array step as said second array step of storing the parallel shift amount for each of the groups outputted in said parallel shift amount storage step based on both the code value generated in said code classification step and the number information generated in said list step; and
a CHECK array step as said third array step of registering an index of a prefix being a parent of each character code of the group in a place corresponding to the value of said sum in the CHECK array.
-
-
30. The character code registration retrieval method according to claim 29, comprising:
when characters following the prefix of the character string is a part of idioms, selecting characters in which a width of code values exceeds a predetermined threshold, as specific characters following the prefix of the character string.
-
31. The character code registration retrieval method according to claim 23, comprising:
-
a document input step of first designating a root of a TRIE structure for a prefix, setting up an end mark in the prefix as an end symbol, then instructing to input a character code of a character to be retrieved, and detecting a prefix of the inputted character code;
a first BASE array step as said first array step of extracting a numeric value from a place corresponding to the index of the prefix of a first BASE array;
a registration value judgement step of judging which of number information of a frequently-appearing prefix character code or a parallel shift amount is a numeric value generated in said first BASE array step, and when the numeric value is out of a range of indexes composing a TRIE, outputting the numeric value as the number information of the prefix character code, and when the numeric value is within the range of indexes, outputting the numeric value as a parallel shift amount;
a code classification step of, when the numeric value generated in from said first BASE array step is the number information of the prefix character code, classifying the inputted character code using at least one bit of the character code;
a second BASE array step as said second array step of extracting a parallel shift amount from a place of a second BASE array corresponding to both the number information of the prefix character code generated in the registration value judgement step and a classification of the character code;
a parallel shift amount storage step of, when the numeric value generated in said first BASE array step is a parallel shift amount, storing the parallel shift amount;
a key candidate point calculation step of calculating a sum of the parallel shift amount and an internal representation value of the inputted character;
a CHECK array step as said third array step of extracting a key from a place of a CHECK array corresponding to the sum calculated in said key candidate point calculation step; and
a key/prefix collation step of judging whether or not the key generated in said CHECK array step coincides with the index of the prefix, and when the inputted key is judged to coincide with the index of the prefix, judging that an idiom is registered in a dictionary.
-
-
32. The character code registration retrieval method according to claim 31, comprising the step of, when characters following the prefix of the character string is a part of idioms, selecting characters such as a width of code values exceeds a predetermined threshold, as specific characters following the prefix of the character string.
-
33. A string retrieval apparatus for retrieving a given string from a plurality of registered strings, comprising:
-
a storage unit to store a first array of number information for prefixes followed by strings of characters, a second array of displacement amounts stored at positions corresponding to the number information for a corresponding prefix, the displacement amounts corresponding to groups of characters obtained by classifying the strings of characters following the corresponding prefix;
a third array storing the index of the corresponding prefix in the first array at a position corresponding to a sum of the displacement amount for one of the groups of characters and an internal representation value of a character following the prefix in one of the strings of characters classified in the one of the groups of characters; and
a processor, coupled to said storage unit, to retrieve the given string using said first, second and third arrays.
-
Specification