Two-pass hash extraction of text strings
First Claim
1. A method for recognizing text, the method comprising:
- generating a plurality of generated terms used in a text string;
calculating, by a calculating device, a plurality of hash values from the plurality of generated terms;
creating a plurality of hash buckets respectively corresponding to the plurality of hash values;
maintaining a plurality of occurrence count values respectively corresponding to the plurality of hash buckets, each of the plurality of occurrence count values respectively indicating a number of times ones of the plurality of generated terms occur in the text string having a hash value that respectively correspond to the plurality of occurrence count values'"'"' respective hash bucket;
discarding ones of the plurality of hash buckets having respective occurrence count values less than a first predetermined value;
adding dictionary terms to a dictionary, the dictionary terms comprising ones of the plurality of generated terms having respective hash values corresponding to any of the plurality of hash values respectively corresponding to the remaining plurality of hash buckets, the dictionary including a plurality of frequency count values respectively indicating the number of times each of the dictionary terms occurred in the text string; and
removing at least one dictionary term that was added to the dictionary due to hash collisions.
2 Assignments
0 Petitions
Accused Products
Abstract
Data compression and key word recognition may be provided. A first pass may walk a text string, generate terms, and calculate a hash value for each generated term. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, a second pass may walk the text string, generate terms, and calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then the term may be considered a frequent term. Consequently, the term may be added to a dictionary along with a corresponding frequency count. Then, the dictionary may be examined to remove terms that may not be frequent, but appeared due to hash collisions.
-
Citations
20 Claims
-
1. A method for recognizing text, the method comprising:
-
generating a plurality of generated terms used in a text string; calculating, by a calculating device, a plurality of hash values from the plurality of generated terms; creating a plurality of hash buckets respectively corresponding to the plurality of hash values; maintaining a plurality of occurrence count values respectively corresponding to the plurality of hash buckets, each of the plurality of occurrence count values respectively indicating a number of times ones of the plurality of generated terms occur in the text string having a hash value that respectively correspond to the plurality of occurrence count values'"'"' respective hash bucket; discarding ones of the plurality of hash buckets having respective occurrence count values less than a first predetermined value; adding dictionary terms to a dictionary, the dictionary terms comprising ones of the plurality of generated terms having respective hash values corresponding to any of the plurality of hash values respectively corresponding to the remaining plurality of hash buckets, the dictionary including a plurality of frequency count values respectively indicating the number of times each of the dictionary terms occurred in the text string; and removing at least one dictionary term that was added to the dictionary due to hash collisions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer-readable storage medium which stores a set of instructions which when executed performs a method for recognizing text, the method executed by the set of instructions comprising:
-
creating a plurality of hash buckets respectively corresponding to a plurality of hash values corresponding to a plurality of generated terms in a text string wherein at least a portion of the plurality of generated terms comprise sub-strings; maintaining a plurality of occurrence count values respectively corresponding to the plurality of hash buckets, each of the plurality of occurrence count values respectively indicating a number of times ones of the plurality of generated terms occur in the text string having a hash value that respectively correspond to the plurality of occurrence count values'"'"' respective hash bucket; discarding ones of the plurality of hash buckets having respective occurrence count values less than a first predetermined value; adding dictionary terms to a dictionary, the dictionary terms comprising ones of the plurality of generated terms having respective hash values corresponding to any of the plurality of hash values respectively corresponding to the remaining plurality of hash buckets, the dictionary including a plurality of frequency count values respectively indicating the number of times each of the dictionary terms occurred in the text string; removing at least one dictionary term that was added to the dictionary due to hash collisions, the at least one dictionary term not being a frequent term; and ranking the dictionary terms using an evaluation function configured for data compression wherein the dictionary terms are ranked based upon a plurality respective indexes respectively corresponding to each of the dictionary terms, each of the plurality indexes respectively comprising the frequency count value of each respective dictionary term multiplied by a length of each respective dictionary term. - View Dependent Claims (15, 16, 17)
-
-
18. A system for recognizing text, the system comprising:
-
a memory storage; and a processing unit coupled to the memory storage, wherein the processing unit is operative to; create a plurality of hash buckets respectively corresponding to a plurality of hash values corresponding to a plurality of generated terms in a text string wherein each of the plurality of generated terms comprise individual strings; maintain a plurality of occurrence count values respectively corresponding to the plurality of hash buckets, each of the plurality of occurrence count values respectively indicating a number of times ones of the plurality of generated terms occur in the text string having a hash value that respectively correspond to the plurality of occurrence count values'"'"' respective hash bucket; discard ones of the plurality of hash buckets having respective occurrence count values less than a first predetermined value; add dictionary terms to a dictionary, the dictionary terms comprising ones of the plurality of generated terms having respective hash values corresponding to any of the plurality of hash values respectively corresponding to the remaining plurality of hash buckets, the dictionary including a plurality of frequency count values respectively indicating the number of times each of the dictionary terms occurred in the text string; removing at least on dictionary term that was added to the dictionary due to hash collisions, the at least one dictionary term not being a frequent term; and rank the dictionary terms using an evaluation function configured for keyword recognition wherein the dictionary terms are ranked based upon their respective frequency count values. - View Dependent Claims (19, 20)
-
Specification