Method and apparatus for highlighting and categorizing documents using coded word tokens
First Claim
1. A method for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the method comprising the steps of:
- eliminating predetermined character shape code classes from said sequence of word tokens;
removing predetermined common function word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list;
determining word token frequency appearance rates for the word tokens of the reduced sequence;
ranking said frequency of appearance rates;
determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates;
highlighting words of the document corresponding to the nth or more most frequently appearing word tokens.
3 Assignments
0 Petitions
Accused Products
Abstract
Highlighting and categorization of documents is carried out by using word tokens which represent words appearing in a document. Elimination of certain unimportant word tokens is first completed, after which the remaining words of the document are ranked according to their word token appearance rates. These rates are then used to highlight frequently appearing words in the document which indicate the document'"'"'s topic. The document can also be categorized using document profiles developed from the word tokens.
101 Citations
50 Claims
-
1. A method for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the method comprising the steps of:
-
eliminating predetermined character shape code classes from said sequence of word tokens; removing predetermined common function word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list; determining word token frequency appearance rates for the word tokens of the reduced sequence; ranking said frequency of appearance rates; determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; highlighting words of the document corresponding to the nth or more most frequently appearing word tokens. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the method comprising the steps of:
-
eliminating predetermined character shape code classes from said sequence of word tokens; removing predetermined common function word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list; determining word token frequency appearance rates for the word tokens of the reduced sequence; ranking said frequency of appearance rates; determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; highlighting words of the document corresponding to the nth or more most frequently appearing word tokens; and categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (7, 8, 9)
-
-
10. A method for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the method comprising the steps of:
-
eliminating predetermined character shape code classes from said sequence of word tokens; removing predetermined common function word tokens and numerical word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list comprising an optional token list; determining word token frequency appearance rates for the word tokens of the reduced sequence; ranking said frequency of appearance rates; determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; and highlighting words of the document corresponding to the nth or more most frequently appearing word tokens. - View Dependent Claims (11, 12, 13)
-
-
14. A method for highlighting and characterizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the method comprising the steps of:
-
eliminating predetermined character shape code classes from said sequence of word tokens; removing predetermined common function word tokens and numerical word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern machine technique and a stop token list comprising an optional token list; determining word token frequency appearance rates for the word tokens of the reduced sequence; ranking said frequency of appearance rates; determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; highlighting words of the document corresponding to the nth or more most frequently appearing word tokens; and categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (15, 16)
-
-
17. A method for highlighting and characterizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the method comprising the steps of:
-
eliminating predetermined character shape code classes from said sequence of word tokens; removing predetermined common function word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list; determining word token frequency appearance rates for the word tokens of the reduced sequence; ranking said frequency of appearance rates; determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; and categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (18, 19, 20, 21)
-
-
22. A method for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the method comprising the steps of:
-
eliminating predetermined character shape code classes from said sequence of word tokens; removing predetermined common function word tokens and numerical word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list comprising an optional token list; determining word token frequency appearance rates for the word tokens of the reduced sequence; ranking said frequency of appearance rates; determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; and categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (23, 24, 25)
-
-
26. An apparatus for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the apparatus comprising:
-
means for eliminating predetermined character shape code classes from said sequence of word tokens; means for removing predetermined common function word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list; means for determining word token frequency appearance rates for the word tokens of the reduced sequence; means for ranking said frequency of appearance rates; means for determining the nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; and means for highlighting words of the document corresponding to the nth or more most frequently appearing word tokens. - View Dependent Claims (27, 28, 29, 30)
-
-
31. An apparatus for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the apparatus comprising:
-
means for eliminating predetermined character shape code classes from said sequence of word tokens; means for removing predetermined common function word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list; means for determining word token frequency appearance rates for the word tokens of the reduced sequence; means for ranking said frequency of appearance rates; means for determining nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; means for highlighting words of the document corresponding to the nth or more most frequently appearing word tokens; and means for categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (32, 33, 34)
-
-
35. An apparatus for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the apparatus comprising:
-
means for eliminating predetermined character shape code classes from said sequence of word tokens; means for removing predetermined common function word tokens and numerical word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list comprising an optional token list; means for determining word token frequency appearance rates for the word tokens of the reduced sequence; means for ranking said frequency of appearance rates; means for determining the nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; and means for highlighting words of the document corresponding to the nth or more most frequently appearing word tokens. - View Dependent Claims (36, 37, 38)
-
-
39. An apparatus for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the apparatus comprising:
-
means for eliminating predetermined character shape code classes from said sequence of word tokens; means for removing predetermined common function word tokens and numerical word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list comprising an optional token list; means for determining word token frequency appearance rates for the word tokens of the reduced sequence; means for ranking said frequency of appearance rates; means for determining the nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; means for highlighting words of the document corresponding to the nth or more most frequently appearing word tokens; and means for categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (40, 41)
-
-
42. An apparatus for highlighting and categorizing images from a document using a sequence of word tokens, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the apparatus comprising:
-
means for eliminating predetermined character shape code classes from said sequence of word tokens; means for removing predetermined common function word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list; means for determining word token frequency appearance rates for the word tokens of the reduced sequence; means for ranking said frequency of appearance rates; means for determining the nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; and means for categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (43, 44, 45, 46)
-
-
47. An apparatus for highlighting and categorizing images from a document using a sequence of word tokens representing words of the document, the word tokens comprising character shape code classes, each word of the document being represented by only one word token, the apparatus comprising:
-
means for eliminating predetermined character shape code classes from said sequence of word tokens; means for removing predetermined common function word tokens and numerical word tokens from said sequence of word tokens to form a reduced sequence of word tokens using a pattern matching technique and a stop token list comprising an optional token list; means for determining word token frequency appearance rates for the word tokens of the reduced sequence; means for ranking said frequency of appearance rates; means for determining the nth or more most frequently appearing word tokens based on the ranked frequency of appearance rates; and means for categorizing the document into one of a plurality of pre-existing categories. - View Dependent Claims (48, 49, 50)
-
Specification