Apparatus and method for compressing texts
First Claim
1. A text file compressing apparatus comprising:
- word detecting means for detecting words in sentences comprising a text file;
representative word storage means for classifying a plurality of words into a plurality of sets of synonymous words and to store the sets with a representative word for each set;
representative word retrieval means for retrieving the representative words for the detected words from their respective sets in said representative word storage means;
representative word rewrite means for rewriting the detected words into the retrieved representative words; and
coding means for coding the text file after the detected words are rewritten by said representative word rewrite means.
1 Assignment
0 Petitions
Accused Products
Abstract
A text compressing apparatus comprising morpheme-parsing unit for taking out words from sentences retrieved from an external storage unit, dictionary retrieving unit for converting the post-morpheme-parse words into form marks indicating the form of the detected words while referring to a parse dictionary, a structure-parsing unit for generating equation trees for each sentence, an expression rewriting unit for rewriting the words which are referred to as nodes into their representative words, an equation converting unit for converting the equation trees written into the representative expressions into words, and Huffman coding unit for converting the words from the equation tree converting unit into strings of bits.
58 Citations
40 Claims
-
1. A text file compressing apparatus comprising:
-
word detecting means for detecting words in sentences comprising a text file; representative word storage means for classifying a plurality of words into a plurality of sets of synonymous words and to store the sets with a representative word for each set; representative word retrieval means for retrieving the representative words for the detected words from their respective sets in said representative word storage means; representative word rewrite means for rewriting the detected words into the retrieved representative words; and coding means for coding the text file after the detected words are rewritten by said representative word rewrite means.
-
-
2. The text file compressing apparatus of claim 1, wherein said coding means includes:
-
correspondence table storage means for storing a correspondence table showing a relation between a plurality of words and string of bits; and string-of-bits conversion means for converting each detected word in the text file into the string of bits after the detected words are rewritten into the representative words by said representative word rewrite means.
-
-
3. The text file compressing apparatus of claim 2, wherein:
-
the strings of bits in said correspondence table are variable; and the representative word in each set is assigned a shortest string of bits compared with other words in the set.
-
-
4. The text file compressing apparatus of claim 3, wherein words with high probabilities with which the words appear in the texts are assigned short strings of bits, and words with small probabilities are assigned long strings of bits.
-
5. The text file compressing apparatus of claim 2, wherein words with high probabilities with which the words appear in the texts are assigned short strings of bits, and words with small probabilities are assigned long strings of bits.
-
6. The text file compressing apparatus of claim 1, wherein said coding means includes:
-
dictionary storage means for storing a dictionary including a plurality of words arranged in an order of their respective probabilities; and string-of-bits conversion means for converting the detected words in the text file into strings of bits after the detected words are rewritten into the representative words by said representative word conversion means, said strings of bits indicating a location of each word in the dictionary.
-
-
7. The text file compressing apparatus of claim 6 further comprising:
-
probability computing means for detecting the words in the sentences comprising the text file to compute probabilities with which the words appear in the text; and dictionary generation means for generating a dictionary by arranging the detected words in relation with their respective probabilities, the resulting dictionary being stored in said dictionary storage means, whereby the text file compressing apparatus yields the text file including strings of bits, and the dictionary generated by the dictionary generating means as a result of an operation.
-
-
8. The text file compressing apparatus of claim 6, wherein the dictionary stored in said dictionary storage means is shared by a plurality of text files.
-
9. The text file compressing apparatus of claim 1 further comprising:
-
take-out means for taking out sentences separately from the text file; structure-parse means for structure-parsing each sentence to generate an equation tree, said equation tree having a tree structure with a plurality of nodes linked one with another; string-of-characters conversion means for converting the equation tree into strings of characters, a parent-child relation being represented by a parenthesis; and sentence conversion means for converting each sentence taken out by said take-out means into the strings of characters, and wherein said correspondence table storage means stores the parenthesis in relation with strings of bits, whereby said converting means converts the parenthesis into a string of bits.
-
-
10. The text file compressing apparatus of claim 9, wherein said string-of-characters conversion means includes:
-
a string-of-characters buffer in which the equation tree in the form of strings of characters are arranged; first placement means for placing a word corresponding to a parent node at a top of the equation tree, and the words corresponding to all children nodes within said string-of-character buffer, all the children nodes being parenthesized; current node set means for making each child node into a current node in turn; second placement means for placing a word corresponding to the current node and the words corresponding to all children nodes, all the children nodes being parenthesized; and repetition means for repeatedly activating said second placement means for each child node.
-
-
11. A text file compressing apparatus for compressing a text file, said text file compressing apparatus comprising:
-
detecting means for detecting words in sentences comprising a text file; form storage means for storing form information for a plurality of words, said form information comprising infinitives and form marks indicating forms of the words; form information retrieval means for retrieving the form information that matches with the words detected by said detection means from said form storage means; form information rewrite means for rewriting the words detected by said detection means into their respective form information; and coding means for coding the text file after the words detected by said detecting means are rewritten into the representative words by said form information rewrite means.
-
-
12. The text file compressing apparatus of claim 11, wherein said coding means includes:
-
correspondence table storage means for storing a correspondence table showing a relation between a plurality of words and string of bits; and string-of-bits conversion means for converting each detected word in the text file into the string of bits after the detected words are rewritten into the representative words by said representative word rewrite means.
-
-
13. The text file compressing apparatus of claim 12, wherein the infinitives are assigned shorter strings of bits compared with their different forms in said correspondence table.
-
14. The text file compressing apparatus of claim 13, wherein words with high probabilities with which the words appear in the texts are assigned short strings of bits, and words with small probabilities are assigned long strings of bits.
-
15. The text file compressing apparatus of claim 14, wherein said correspondence storage means further stores the form marks in relation with the strings of bits,
whereby said string-of-bits conversion means converts the form marks into the strings of bits.
-
16. The text file compressing apparatus of claim 12, wherein said coding means includes:
-
dictionary storage means for storing a dictionary including a plurality of words arranged in an order of their respective probabilities; and string-of-bits conversion means for converting the detected words in the text file into strings of bits after the detected words are rewritten into the representative words by said representative word conversion means, said strings of bits indicating a location of each word in the dictionary.
-
-
17. A text file compressing apparatus comprising:
-
first detecting means for detecting words in sentences comprising a text file; form storage means for storing form information for a plurality of words, said form information comprising infinitives and form marks indicating forms of the words; form information retrieval means for retrieving the form information that matches with the words detected by said first detection means from said form storage means; form information rewrite means for rewriting the words detected by said first detection means into their respective form information; second detecting means for detecting words in the sentences comprising the text file; representative word storage means for classifying a plurality of words into a plurality of sets of synonymous words to store the sets with a representative word for each set; representative word retrieval means for retrieving the representative words for the words detected by said second detection means from their respective sets in said representative word storage means; representative word rewrite means for rewriting the words detected by said second detection means into the retrieved representative words; and coding means for coding the text file after the words detected by said second detection means are rewritten by said representative word rewrite means.
-
-
18. The text file compressing apparatus of claim 17, wherein said coding means includes:
-
correspondence table storage means for storing a correspondence table showing a relation between a plurality of words and string of bits; and string-of-bits conversion means for converting each detected word in the text file into the string of bits after the detected words are rewritten into the representative words by said representative word rewrite means.
-
-
19. The text file compressing apparatus of claim 18, wherein:
-
the strings of bits in said correspondence table are variable; and the representative word in each set is assigned a shortest string of bits compared with other words in the set and the infinitives are assigned shorter strings of bits compared with their different forms in said correspondence table.
-
-
20. The text file compressing apparatus of claim 19, wherein the representative words and infinitives with high probabilities with which the words appear in the texts are assigned short strings of bits, and words with small probabilities are assigned long strings of bits in said correspondence table.
-
21. The text file compressing apparatus of claim 20, wherein said correspondence storage means further stores the form marks in relation with the strings of bits,
whereby said string-of-bits conversion means converts the form marks into the strings of bits.
-
22. A method for compressing a text file for a text file compressing apparatus including a text file storage unit for storing text files, a representative word storage unit for classifying a plurality of words into a plurality of sets of synonymous words to store the sets with a representative word for each set, and a code information storage unit for storing code information used to code words in the text file, said method comprising the steps of:
-
detecting words in sentences comprising the text file; retrieving the representative words from the sets for the detected words; rewriting the detected words with the retrieved representative words; and coding the text file after the detected words are rewritten into the representative words while referring to the code information in said code information storage unit.
-
-
23. The method of claim 22, wherein said coding step includes:
converting the words in the text file into string of bits after the detected words are rewritten into the representative words while referring to the code information in said code information storage unit, the code information being stored in the form of a correspondence table showing a relation between a plurality of words and strings of bits.
-
24. The method of claim 22, wherein said coding step includes:
converting the words in the text file into string of bits after the words are rewritten into the representative words while referring to the code information, the code information being stored in said code information storage unit in the form of a dictionary including words arranged in an order of probabilities with which the words appear in the file text, said strings of bits showing an order of each word in the code information.
-
25. The method of claim 24 further comprising the steps of:
-
detecting the words in the sentences comprising of the text file and computing probabilities with which the words appear in the text; and generating a dictionary by arranging the detected words in relation with their respective probabilities, the resulting dictionary being stored in said code information storage unit.
-
-
26. The method of claim 22 further comprising the steps of:
-
taking out sentences separately from the text file; structure-parsing each sentence and generating an equation tree, said equation tree having a tree structure with a plurality of nodes linked on with another; converting the equation tree into strings of characters, a parent-child relation being represented by a parenthesis; and converting each sentence taken out by said sentence taking out step into the strings of characters, wherein said code information storage unit stores the parenthesis in relation with strings of bits, whereby the parenthesis is converted into a string of bits in said sentence-to-string-of-characters converting step.
-
-
27. The method of claim 26, wherein said equation-tree-to-string-of-characters converting step includes:
-
arranging the equation tree in the form of strings of characters; placing a word corresponding to a parent node at a top of the equation tree, and the words corresponding to all children nodes arranged in said arranging sub-step, all the children nodes being parenthesized; making each child node into a current node in turn; placing a word corresponding to the current node and the words corresponding to all children nodes, all the children nodes being parenthesized; and repeating said secondly mentioned placing sub-step for each child node.
-
-
28. A method for compressing a text file for a text file compressing apparatus including a text file storage unit for storing text files, a form storage unit for storing a plurality of words in the form of infinitives and form marks specifying a form of each word, and a code information storage unit for storing code information used to code words in the text file, said method comprising the steps of:
-
detecting words in sentences comprising the text file; retrieving form information that matches with the detected words from said form storage unit; rewriting the detected words with their respective form information; and coding the text file after the words are rewritten into the form information while referring to the code information in said code information storage unit.
-
-
29. The method of claim 28, wherein said coding step includes:
converting the words in the text file into string of bits after the detected words are rewritten into the representative words while referring to the code information in said code information storage unit, the code information being stored in the form of a correspondence table showing a relation between a plurality of words and strings of bits.
-
30. The method of claim 29, wherein said rewriting step includes:
rewriting the form marks in the text file into strings of bits while referring to said correspondence table, said correspondence table further showing a relation between the strings of bits and form marks.
-
31. A method for compressing a text file for a text file compressing apparatus including a text file storage unit for storing text files, a representative word storage unit for classifying a plurality of words into a plurality sets of synonymous words to store the sets with a representative word for each set, a form storage unit for storing a plurality of words in the form of infinitives and form marks specifying a form of each word, and a code information storage unit for storing code information used to code words in the text file, said method comprising the steps of:
-
detecting words in sentences comprising the text file; retrieving form information that matches with the detected words from said form storage unit; rewriting the detected words with the form information; detecting words in post-rewrite sentence in the rewriting step; retrieving the representative words for the words detected in said post-rewrite-word detecting step from the sets from said representative word storage unit; rewriting the post-rewrite words into their respective representative words; and coding the text file after the post-rewrite words are rewritten into their respective representative words.
-
-
32. The method of claim 31, wherein said coding step includes:
converting the words in the text file into string of bits after the detected words are rewritten into the representative words while referring to the code information in said code information storage unit, the code information being stored in the form of a correspondence table showing a relation between a plurality of words and strings of bits.
-
33. A text file compressing apparatus comprising:
-
word detecting means for detecting words in sentences comprising a text file; representative word storage means for classifying a plurality of words into a plurality of sets of synonymous words and to store the sets with a representative word for each set; representative word retrieval means for retrieving the representative words for the detected words from their respective sets in said representative word storage means; representative word rewrite means for rewriting the detected words into the retrieved representative words; coding means for coding the text file after the detected words are rewritten by said representative word rewrite means; take-out means for taking out sentences separately from the text file; structure-parse means for structure-parsing each sentence to generate an equation tree, said equation tree having a tree structure with a plurality of nodes linked one with another; string-of-characters conversion means for converting the equation tree into strings of characters, a parent-child relation being represented by a parenthesis; and sentence conversion means for converting each sentence taken out by said take-out means into the strings of characters, and wherein said correspondence table storage means stores the parenthesis in relation with strings of bits, whereby said converting means converts the parenthesis into a string of bits.
-
-
34. The text file compressing apparatus of claim 33, wherein said string-of-characters conversion means includes:
-
a string-of-characters buffer in which the equation tree in the form of strings of characters are arranged; first placement means for placing a word corresponding to a parent node at a top of the equation tree, and the words corresponding to all children nodes within said string-of-character buffer, all the children nodes being parenthesized; current node set means for making each child node a current node in turn; second placement means for placing a word corresponding to the current node and the words corresponding to all children nodes, all the children nodes being parenthesized; and repetition means for repeatedly activating said second placement means for each child node.
-
-
35. A method for compressing a text file for a text file compressing apparatus, the apparatus including a text file storage unit for storing text files, a representative word storage unit for classifying a plurality of words into a plurality of sets of synonymous words to store the sets with a representative word for each set, and a code information storage unit for storing code information used to code words in the text file, said method comprising the steps of:
-
detecting words in sentences comprising the text file; retrieving the representative words from the sets for the detected words; rewriting the detected words with the retrieved representative words; coding the text file after the detected words are rewritten into the representative words while referring to the code information in said code information storage unit; taking out sentences separately from the text file; structure-parsing each sentence and generating an equation tree, said equation tree having a tree structure with a plurality of nodes linked one with another; converting the equation tree into strings of characters, a parent-child relation being represented by a parenthesis; and converting each sentence taken out by said sentence taking out step into the strings of characters, wherein said code information storage unit stores the parenthesis in relation with strings of bits, whereby the parenthesis is converted into a string of bits in said sentence-to-string-of-characters converting step.
-
-
36. The method of claim 35, wherein said equation-tree-to-string-of-characters converting step further comprises the steps of:
-
arranging the equation tree in the form of strings of characters; placing a word corresponding to a parent node at a top of the equation tree, and the words corresponding to all children nodes arranged in said arranging sub-step, all the children nodes being parenthesized; making each child node into a current node in turn; placing a word corresponding to the current node and words corresponding to all children nodes, all the children nodes being parenthesized; and repeating said secondly mentioned placing sub-step for each child node.
-
-
37. A text file compressing apparatus comprising:
-
word detecting means for detecting words in sentences comprising a text file; representative word storage means for classifying a plurality of words into a plurality of sets of synonymous words and to store the sets with a representative word for each set; representative word retrieval means for retrieving the representative words for the detected words from their respective sets in said representative word storage means; representative word rewrite means for rewriting the detected words into the retrieved representative words; coding means for coding the text file after the detected words are rewritten by said representative word rewrite means; take-out means for taking out sentences separately from the text file; structure-parse means for structure-parsing each sentence to generate an equation tree, said equation tree having a tree structure with a plurality of nodes linked one with another; string-of-characters conversion means for converting the equation tree into strings of characters, a parent-child relation being represented by a predetermined character; and sentence conversion means for converting each sentence taken out by said take-out means into the strings of characters, and wherein said correspondence table storage means stores the predetermined character in relation with strings of bits, whereby said converting means converts the predetermined character into a string of bits.
-
-
38. The text file compressing apparatus of claim 37, wherein said string-of-characters conversion means includes:
-
a string-of-characters buffer in which the equation tree in the form of strings of characters are arranged; first placement means for placing a word corresponding to a parent node at a top of the equation tree, and the words corresponding to all children nodes within said string-of-character buffer, all the children nodes being marked by the predetermined character; current node set means for making each child node a current node in turn; second placement means for placing a word corresponding to the current node and the words corresponding to all children nodes, all the children nodes being marked by the predetermined character; and repetition means for repeatedly activating said second placement means for each child node.
-
-
39. A method for compressing a text file for a text file compressing apparatus, the apparatus including a text file storage unit for storing text files, a representative word storage unit for classifying a plurality of words into a plurality of sets of synonymous words to store the sets with a representative word for each set, and a code information storage unit for storing code information used to code words in the text file, said method comprising the steps of:
-
detecting words in sentences comprising the text file; retrieving the representative words from the sets for the detected words; rewriting the detected words with the retrieved representative words; coding the text file after the detected words are rewritten into the representative words while referring to the code information in said code information storage unit; taking out sentences separately from the text file; structure-parsing each sentence and generating an equation tree, said equation tree having a tree structure with a plurality of nodes linked one with another; converting the equation tree into strings of characters, a parent-child relation being represented by a predetermined character; and converting each sentence taken out by said sentence taking out step into the strings of characters, wherein said code information storage unit stores the predetermined character in relation with strings of bits, whereby the predetermined character is converted into a string of bits in said sentence-to-string-of-characters converting step.
-
-
40. The method of claim 39, wherein said equation-tree-to-string-of-characters converting step further comprises the steps of:
-
arranging the equation tree in the form of strings of characters; placing a word corresponding to a parent node at a top of the equation tree, and the words corresponding to all children nodes arranged in said arranging sub-step, all the children nodes being marked by the predetermined character; making each child node into a current node in turn; placing a word corresponding to the current node and words corresponding to all children nodes, all the children nodes being marked by the predetermined character and repeating said secondly mentioned placing sub-step for each child node.
-
Specification