Method and system for automatically extracting new word
First Claim
Patent Images
1. A method of extracting new words automatically, said method comprising the steps of:
- segmenting a cleaned corpus in a domain to form a segmented corpus;
splitting the segmented corpus to form sub strings, and counting the occurrences of each sub string appearing in the corpus; and
filtering out false candidates to output new words, wherein the new words are words not contained in a base vocabulary;
wherein the segmenting and the splitting is not dependent upon word boundaries;
wherein new words are determined based upon the domain of the cleaned corpus;
wherein the step of splitting and counting is implemented using a GAST (general atom suffix tree) contained in a reduced memory space;
wherein the GAST is implemented by limiting length of character sub strings.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of and system for automatically extracting new words are provided. The method and system are highly efficient for automatically extracting new words from a mass amount of cleaned corpus.
83 Citations
15 Claims
-
1. A method of extracting new words automatically, said method comprising the steps of:
-
segmenting a cleaned corpus in a domain to form a segmented corpus; splitting the segmented corpus to form sub strings, and counting the occurrences of each sub string appearing in the corpus; and filtering out false candidates to output new words, wherein the new words are words not contained in a base vocabulary; wherein the segmenting and the splitting is not dependent upon word boundaries; wherein new words are determined based upon the domain of the cleaned corpus; wherein the step of splitting and counting is implemented using a GAST (general atom suffix tree) contained in a reduced memory space; wherein the GAST is implemented by limiting length of character sub strings. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An automatic new word extraction system, comprising:
-
a segmentor which segments a cleaned corpus in a domain to form a segmented corpus; a splitter which splits the segmented corpus to form sub strings, and which counts the number of the sub strings appearing in the corpus; and a filter which filters out false candidates to output new words, wherein the new words are words not contained in a base vocabulary; wherein the segmenting and the splitting is not dependent upon word boundaries; wherein new words are determined based upon the domain of the cleaned corpus; wherein the splitter builds a GAST (general atom suffix tree) contained in a reduced memory space; wherein the GAST limits the length of character sub strings. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting new words automatically, said method comprising the steps of:
-
segmenting a cleaned corpus in a domain to form a segmented corpus; splitting the segmented corpus to form sub strings, and counting the occurrences of each sub string appearing in the corpus; and filtering out false candidates to output new words, wherein the new words are words not contained in a base vocabulary; wherein the segmenting and the splitting is not dependent upon word boundaries; wherein new words are determined based upon the domain of the cleaned corpus; wherein the step of splitting and counting is implemented using a GAST (general atom suffix tree) contained in a reduced memory space; wherein the GAST is implemented by limiting length of character sub strings.
-
Specification