Method and apparatus for classifying text
First Claim
1. A text classification method for use in a text classification system having first, second, and third counts, the method comprising the steps ofisolating incoming text by separating the incoming text into incoming words,incrementing said first count for each incoming word,comparing each incoming word with a first stored word list to determine if the incoming word matches with a stored word,adding the incoming word to the first stored list if there is no match and incrementing said second count,comparing each incoming word with a second stored word list to determine if the incoming word matches with a stored word and if there is no match, adding the incoming word to the second stored word list,incrementing said third count,determining a first constant by a ratio of said first count to said second count and determining a second constant by a ratio of said third count to said first count, anddetermining the text classification by the value of said first and second constants.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method and apparatus for classifying text by using two constants determined by analyzing the text. The first constant, G, classifies text in the order of constraint. It is defined by the equation G=log (N/L)/ {log(N)-1}, where N is the number of words and L is the number of different words in the text being classified. The second constant, R, is the correlation coefficient between the word length and the logarithm scaled rank order of word frequency. The values of the two constants can be used to determine how to classify text. In the case of English text, the text may be classified as computer language, text from a technical manual, English text written by foreigners or English text written by native English speakers.
-
Citations
11 Claims
-
1. A text classification method for use in a text classification system having first, second, and third counts, the method comprising the steps of
isolating incoming text by separating the incoming text into incoming words, incrementing said first count for each incoming word, comparing each incoming word with a first stored word list to determine if the incoming word matches with a stored word, adding the incoming word to the first stored list if there is no match and incrementing said second count, comparing each incoming word with a second stored word list to determine if the incoming word matches with a stored word and if there is no match, adding the incoming word to the second stored word list, incrementing said third count, determining a first constant by a ratio of said first count to said second count and determining a second constant by a ratio of said third count to said first count, and determining the text classification by the value of said first and second constants.
-
2. A text classification apparatus for use in a text classification system having first, second and third counts, the apparatus comprising
means for storing a word list, means for isolating incoming text by separating the incoming text into incoming words, means for incrementing said first count for each incoming word, means for comparing each incoming word with a first stored word list to determine if the incoming word matches with a stored word, means for adding the incoming word to the first stored list if there is no match and incrementing said second count, means for comparing each incoming word with a second stored word list to determine if the incoming word matches with a stored word and if there is no match, adding the incoming word to the second stored word list, means for incrementing said third count, means for determining a first constant by a ratio of said first count to said second count and determining a second constant by a ratio of said third count to said first count, and means for determining the text classification by the value of said first and second constants.
-
9. A text classification method for use in an English text classification system having first, second and third counts, the method comprising the steps of
isolating incoming text by separating the incoming text into incoming words, incrementing said first count for each incoming word, comparing each incoming word with a first stored word list to determine if the incoming word matches with a stored word, adding the incoming word to the first stored list if there is no match an incrementing said second count, comparing each incoming word with a second stored word list to determine if the incoming word matches with a stored word, and if there is no match, adding the incoming word to the second stored word list, incrementing said third count, determining the English text classification by the value of said first and second constants wherein the text classification is computer language, manual, English text written by a native English speaker or English text written by a non-native English speaker.
-
10. A text classification method for use in a text classification system having first and second counts, the method comprising the steps of
isolating incoming text by separating the incoming text into incoming words, incrementing said first count for each incoming word, a first comparing step for comparing each incoming word with a first stored word list to determine if the incoming word matches with a stored word, adding the incoming word to the first stored list if there is no match and incrementing said second count, a second comparing step for comparing each incoming word with a second stored word list to determine if the incoming word matches with a stored word and if there is no match, adding the incoming word to the second stored word list, and wherein said second comparing step increments a third count, and further including the steps of determining a first constant by a ratio of said first count to said second count and determining a second constant by a ratio of said third count to said first count, and determining the text classification by the value of said first and second constants.
-
11. A text classification apparatus for use in a text classification system having first and second counts, the apparatus comprising
means for storing a word list, means for isolating incoming text by separating the incoming text into incoming words, means for incrementing said first count for each incoming word, first comparing means for comparing each incoming word with a first stored word list to determine if the incoming word matches with a stored word, means for adding the incoming word to the first stored list if there is no match and incrementing said second count, and second comparing means for comparing each incoming word with a second stored word and if there is no match, adding the incoming word to the second stored word list, and wherein said second comparing means for comparing increments a third count, and further including means for determining a first constant by a ratio of said first count to said second count and determining a second constant by a ratio of said third count to said first count, and means for determining the text classification by the value of said first and second constants.
Specification