Method and apparatus for automatically identifying keywords within a document
First Claim
1. A method of generating a plurality of human intelligible keywords from an electronic, stored document including phrases, stop words delimiting the phrases, and punctuation, the method comprising the steps of:
- a) providing features selected to be indicative of word/phrase significance, providing a training document and a set of human intelligible keywords dependent upon the training document, and producing training results in dependence upon the document and the human intelligible keywords, the training results including parameter values indicative of feature weighting for weighting the provided features in order to determine a measure of word/phrase significance;
b) using a computer to select from the document raw phrases comprised of one or more contiguous words excluding stop words, by utilizing slop words, or stop words and punctuation, to determine raw phrases to be selected; and
, c) using a form or the raw phrases, generating the plurality of human intelligible keywords by evaluating the selected raw phrases based on the provided features and the parameter values, wherein the step of selecting raw phrases is performed in dependence upon the training results and in the absence of part-of-speech tagging and a lexicon of target human intelligible keywords.
1 Assignment
0 Petitions
Accused Products
Abstract
A trainable method of extracting keywords of one or more words is disclosed. According to the method, every word within a document that is not a stop word is stemmed and evaluated and receives a score. The scoring is performed based on a plurality of parameters which are adjusted through training prior to use of the method for keyword extraction. Each word having a high score is then replaced by a word phrase that is delimited by punctuation or stop words. The word phrase is selected from word phrases having the stemmed word therein. Repeated keywords are removed. The keywords are expanded and capitalisation is determined. The resulting list forms extracted keywords.
-
Citations
19 Claims
-
1. A method of generating a plurality of human intelligible keywords from an electronic, stored document including phrases, stop words delimiting the phrases, and punctuation, the method comprising the steps of:
-
a) providing features selected to be indicative of word/phrase significance, providing a training document and a set of human intelligible keywords dependent upon the training document, and producing training results in dependence upon the document and the human intelligible keywords, the training results including parameter values indicative of feature weighting for weighting the provided features in order to determine a measure of word/phrase significance;
b) using a computer to select from the document raw phrases comprised of one or more contiguous words excluding stop words, by utilizing slop words, or stop words and punctuation, to determine raw phrases to be selected; and
,c) using a form or the raw phrases, generating the plurality of human intelligible keywords by evaluating the selected raw phrases based on the provided features and the parameter values, wherein the step of selecting raw phrases is performed in dependence upon the training results and in the absence of part-of-speech tagging and a lexicon of target human intelligible keywords. - View Dependent Claims (2, 3, 4)
a frequency of the raw phrase occurrence within the document;
a measure of closeness to a starting portion of the document; and
,a length of the raw phrase.
-
-
4. A method of generating a plurality of human intelligible keywords as defined in claim 1, wherein stop words or stop words and punctuation are used as delimiters to locate raw phrases to be selected.
-
5. A method of generating a plurality of human intelligible keywords from an electronic, stored document including phrases, stop words delimiting the phrases, and punctuation, the method comprising the steps of:
-
a) providing a plurality of parameter values relating to weights and determined through a process of training;
b) using a computer to select from the document, raw phrases comprised of one or more contiguous words excluding stop words, and, c) using a form of the raw phrases, generating the plurality of human intelligible keywords by evaluating the selected raw phrases in dependence upon the parameter values used for weighting in order to determine a measure of human intelligible keyword significance, how closely a human intelligible keyword reflects the electronic, stored document. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14)
c) providing a training document;
d) providing a set of human intelligible keywords that are dependent upon the training document;
e) providing a set of weights that are independent of the training document;
f) performing steps (a) and (b) on the training document;
g) comparing the generated human intelligible keywords with the provided human intelligible keywords;
h) until the comparison is within predetermined limits, adjusting the weights in dependence upon the comparison and iterating steps (f) through (h), wherein the values of the adjusted weights form the parameter values.
-
-
9. A method of generating a plurality of human intelligible keywords from a document as defined in claim 5, wherein the step or training comprises the steps of:
-
c) providing a plurality of training documents;
d) providing sets of human intelligible keywords for each training document;
e) providing a set of weights that are independent of the training document;
f) performing steps (a) and (b) on the training documents;
g) comparing the human intelligible keywords generated for each document with the human intelligible keywords provided for said document;
h) until the comparisons are within predetermined limits, adjusting the weights in dependence upon the comparisons and iterating steps (f) through (h), wherein the values of the adjusted weights form the parameter values.
-
-
10. A method of generating a plurality of human intelligible keywords from a document as defined in claim 9, wherein the training is performed using a genetic algorithm.
-
11. A method of generating a plurality of human intelligible keywords from a document as defined in claim 5, comprising the step of determining an ordering of the human intelligible keywords in dependence upon training data sets independent of the document.
-
12. A method of generating a plurality of human intelligible keywords from a document as defined in claim 11, wherein the step of determining an ordering is based on an evaluation of a plurality of indicators for each key word, and wherein each indicator is weighted with a weighting factor, similar indicators evaluated for different human intelligible keywords using a same weighting factor.
-
13. A method of generating a plurality of human intelligible keywords from a document as defined in claim 5, wherein the plurality of weighted criteria forms a decision tree.
-
14. A method of generating a plurality of human intelligible keywords from a document as defined in claim 5, further comprising the step of stemming words within selected phrases by truncating the words to a predetermined number of characters.
-
15. A method of generating a plurality of human intelligible keywords from an electronic, stored document including phrases, stop words delimiting the phrases, and punctuation, the method comprising the steps of:
-
aa) providing a plurality of indicators and a weight associated with each of the indicators, each indicator and associated weight indicative of word/phrase significance within the document;
a) generating a list of words within the document that are not stop words for determining a score in dependence upon an evaluation of each word of the list in dependence upon the plurality of indicators and the associated and same weights for each indicator, scores for different words in the list determined using same indicators and same weights;
b) ordering the list of words in dependence upon scores;
contiguous words excluding stop words; and
,c) for each word in the list, selecting all raw phrases of one or more words containing a word having a predetermined similarity for determining a score for each selected raw phrase; and
,d) replacing said word in the list with a most desirable word/phrase comprising a word having a predetermined similarity. - View Dependent Claims (16, 17, 18, 19)
aa) stemming each word in the first list by the ordered steps of selecting a number of characters; and
truncating words within the raw phrases to a length corresponding to the selected number of characters;
dd) stemming each word in each selected word phrase;
ff) unstemming the word phrases in the list of replaced word stems.
-
-
17. A method of generating a plurality of human intelligible keywords from a document as defined in claim 16, comprising the step of selecting at most a predetermined number of different words from the list of words.
-
18. A method of generating a plurality of human intelligible keywords from a document as defined in claim 16 comprising wherein the step of replacing said word comprises the step of removing duplicate word phrases from the list of replaced words.
-
19. A method of generating a plurality of human intelligible keywords from a document as defined in claim 15 wherein at least one of steps (b) and (e) is performed in dependence upon a plurality of weighted criteria, the weights determined by a step of training.
Specification