Mapping words, phrases using sequential-pattern to find user specific trends in a text database
First Claim
1. A computer executed method for discovering trends in a database, comprising:
- mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence;
mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items;
generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location;
partitioning the database into discrete portions using the time stamp for each word;
determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition containing the phrase; and
outputting trends based upon the support values of the phrases, by;
determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and
outputting trends using only the frequent phrases.
4 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for mining text databases, employing sequential pattern phrase identification and shape queries, to discover trends. The method passes over a desired database using a dynamically generated shape query. Documents within the database are selected based on specific classifications and user defined partitions. Once a partition is specified, transaction IDs are assigned to the words in the text documents depending on their placement within each document. The transaction IDs encode both the position of each word within the document as well as representing sentence, paragraph, and section breaks, and are represented in one embodiment as long integers with the sentence boundaries. A maximum and minimum gap between words in the phrases and the minimum support all phrases must meet for the selected time period may be specified. A generalized sequential pattern method is used to generate those phrases in each partition that meet the minimum support threshold. The shape query engine takes the set of phrases for the partition of interest and selects those that match a given shape query. A query may take the form of requesting a trend such as "recent upwards trend", "recent spikes in usage", "downward trends", and "resurgence of usage". Once the phrases matching the shape query are found, they are presented to the user.
160 Citations
16 Claims
-
1. A computer executed method for discovering trends in a database, comprising:
-
mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence; mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items; generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location; partitioning the database into discrete portions using the time stamp for each word; determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition containing the phrase; and outputting trends based upon the support values of the phrases, by; determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and outputting trends using only the frequent phrases. - View Dependent Claims (2, 3, 4)
-
-
5. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for discovering trends in a database, said method comprising:
-
mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence; mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items; generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location; partitioning the database into discrete portions using the time stamp for each word; determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition contain the phrase; and outputting trends based upon the support values of the phrases, by; determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and outputting trends using only the frequent phrases. - View Dependent Claims (6, 7, 8)
-
-
9. A digital processing machine used to discover trends in a database, the device comprising:
-
a database; a digital processing apparatus, the digital processing apparatus configured to receive data and commands from a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by the digital processing apparatus and used to perform a method for discovering trends in a database, said method comprising; mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence; mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items; generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location; partitioning the database into discrete portions using the time stamp for each word; determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition contain the phrase; and outputting trends based upon the support values of the phrases, by; determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and outputting trends using only the frequent phrases. - View Dependent Claims (10, 11, 12)
-
-
13. A digital processing machine for discovering trends in a database, the device comprising:
-
a database; a means for processing the database, the processing means configured to receive data and commands from a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by the processing means and used to perform a method for discovering trends in a database, said method comprising;
cleansing the database to remove undesired data from each data field;mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence; mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items; generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location; partitioning the database into discrete portions using the time stamp for each word; determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition contain the phrase; and outputting trends based upon the support values of the phrases, by; determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and outputting trends using only the frequent phrases. - View Dependent Claims (14, 15, 16)
-
Specification