Method for extracting multi-word technical terms from text
First Claim
1. Programmed computer apparatus for extracting a list of candidate multi-word technical terms from an input text file, a multi-word technical term being a string of at least two words having a particular meaning in some technical field, said apparatus comprising:
- means for storing a stoplist of tokens which are assumed to not occur in multi-word technical terms, a token being a word, character or string of characters delimited by blanks and/or punctuation;
means for storing a maximum length parameter specifying a maximum number of tokens in any candidate multi-word technical term to be extracted;
means responsive to the stored stoplist for extracting text fragments from an input text file by identifying delimiting tokens in the input text file, including means for identifying as a delimiting token each token in the input text file which is the same as a token in the stored stoplist, the identified delimiting tokens defining text fragments therebetween;
means for deriving from the extracted text fragments all possible subsequences of tokens having a length of at least two tokens and not more than a maximum number of tokens specified by the stored maximum length parameter;
means for testing each of the derived subsequences against at least one filtering condition; and
means for creating a sublist of the derived subsequences which pass the at least one filtering condition, the created sublist being the list of candidate multi-word technical terms.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for extracting multi-word technical terms from a text file in a computer system. Word strings are selected from the text that have at least two words, that have at most a specified maximum number of words, that include none of a special set of selected tokens, and that only include selected characters. Word string which occur less than a specified minimum number of times in the text file are deleted. The remaining strings form a set of word strings very likely to be multi-word technical terms. Improvements on the quality of the set of word strings can be accomplished by deleting word strings which do not satisfy certain grammatical constraints.
-
Citations
28 Claims
-
1. Programmed computer apparatus for extracting a list of candidate multi-word technical terms from an input text file, a multi-word technical term being a string of at least two words having a particular meaning in some technical field, said apparatus comprising:
-
means for storing a stoplist of tokens which are assumed to not occur in multi-word technical terms, a token being a word, character or string of characters delimited by blanks and/or punctuation; means for storing a maximum length parameter specifying a maximum number of tokens in any candidate multi-word technical term to be extracted; means responsive to the stored stoplist for extracting text fragments from an input text file by identifying delimiting tokens in the input text file, including means for identifying as a delimiting token each token in the input text file which is the same as a token in the stored stoplist, the identified delimiting tokens defining text fragments therebetween; means for deriving from the extracted text fragments all possible subsequences of tokens having a length of at least two tokens and not more than a maximum number of tokens specified by the stored maximum length parameter; means for testing each of the derived subsequences against at least one filtering condition; and means for creating a sublist of the derived subsequences which pass the at least one filtering condition, the created sublist being the list of candidate multi-word technical terms. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. Programmed computer apparatus for extracting a list of candidate multi-word technical terms from an input text file, a multi-word technical term being a string of at least two words having a particular meaning in some technical field, said apparatus comprising:
-
means for storing a stoplist of tokens which are assumed to not occur in multi-word technical terms, a token being a word, character or string of characters delimited by blanks and/or punctuation; means for storing a frequency parameter specifying a minimum frequency of occurrence for a candidate multi-word technical term to be extracted; means responsive to the stored stoplist for extracting text fragments from an input text file by identifying delimiting tokens in the input text file, including means for identifying as a delimiting token each token in the input text file which is the same as a token in the stored stoplist, the identified delimiting tokens defining text fragments therebetween; means for deriving from the extracted text fragments each possible subsequence of tokens having a length of at least two tokens and which occurs in the input text file with a frequency not less than specified by the stored frequency parameter; means for testing each of the derived subsequences against at least one filtering condition; and means for creating a sublist of the derived subsequences which pass the at least one filtering condition, the created sublist being the list of candidate multi-word technical terms. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A computer implemented method of extracting a list of candidate multi-word technical terms from an input text file, a multi-word technical term being a string of at least two words having a particular meaning in some technical field, said method comprising the computer implemented steps of:
-
storing a stoplist of tokens which are assumed to not occur in multi-word technical terms, a token being a word, character or string of characters delimited by blanks and/or punctuation; storing a maximum length parameter specifying a maximum number of token in any candidate multi-word technical term to be extracted; extracting text fragments from an input text file by identifying delimiting tokens in the input text file at least in part by identifying as a delimiting token each token in the input text file which is the same as a token in the stored stoplist, the identified delimiting tokens defining text fragments therebetween; deriving from the extracted text fragments all possible subsequences of tokens having a length of at least two tokens and no more than a maximum number of tokens specified by the stored maximum length parameter; testing each of the derived subsequences against at least one filtering condition; and creating a sublist of the derived subsequences which pass the at least one filtering condition, the created sublist being the list of candidate multi-word technical terms. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer implemented method of extracting a list of candidate multi-word technical terms from an input text file, a multi-word technical term being a string of at least two words having a particular meaning in some technical field, said method comprising the computer implemented steps of:
-
storing a stoplist of tokens which are assumed to not occur in multi-word technical terms, a token being a word, character of string of characters delimited by blanks and/or punctuation; storing a frequency parameter specifying a minimum frequency of occurrence for a candidate multi-word technical term to be extracted; extracting test fragments from an input text file by identifying delimiting tokens in the input text file at least in part by identifying as a delimiting token each token in the input text file which is the same as a token in the stored stoplist, the identified delimiting tokens defining text fragments therebetween; deriving from the extracted text fragments all possible subsequences of tokens having a length of at least two tokens and which occur in the input text file with a frequency not less than specified by the stored frequency parameter; testing each of the derived subsequences against at least one filtering condition; and creating a sublist of the derived subsequences which pass the at least one filtering condition, the created sublist being the list of candidate multi-word technical terms. - View Dependent Claims (26, 27, 28)
-
Specification