Mapping words, phrases using sequential-pattern to find user specific trends in a text database

US 6,006,223 A
Filed: 08/12/1997
Issued: 12/21/1999
Est. Priority Date: 08/12/1997
Status: Expired due to Term

First Claim

Patent Images

1. A computer executed method for discovering trends in a database, comprising:

mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence;

mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items;

generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location;

partitioning the database into discrete portions using the time stamp for each word;

determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition containing the phrase; and

outputting trends based upon the support values of the phrases, by;

determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and

outputting trends using only the frequent phrases.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for mining text databases, employing sequential pattern phrase identification and shape queries, to discover trends. The method passes over a desired database using a dynamically generated shape query. Documents within the database are selected based on specific classifications and user defined partitions. Once a partition is specified, transaction IDs are assigned to the words in the text documents depending on their placement within each document. The transaction IDs encode both the position of each word within the document as well as representing sentence, paragraph, and section breaks, and are represented in one embodiment as long integers with the sentence boundaries. A maximum and minimum gap between words in the phrases and the minimum support all phrases must meet for the selected time period may be specified. A generalized sequential pattern method is used to generate those phrases in each partition that meet the minimum support threshold. The shape query engine takes the set of phrases for the partition of interest and selects those that match a given shape query. A query may take the form of requesting a trend such as "recent upwards trend", "recent spikes in usage", "downward trends", and "resurgence of usage". Once the phrases matching the shape query are found, they are presented to the user.

160 Citations

16 Claims

1. A computer executed method for discovering trends in a database, comprising:
- mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence;
  
  mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items;
  
  generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location;
  
  partitioning the database into discrete portions using the time stamp for each word;
  
  determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition containing the phrase; and
  
  outputting trends based upon the support values of the phrases, by;
  
  determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and
  
  outputting trends using only the frequent phrases.
- View Dependent Claims (2, 3, 4)
- - 2. The method recited in claim 1 where outputting trends further comprises pruning the frequent phrases based upon user-defined constraints to reduce the number of phrases used to identify trends.
  - 3. The method recited in claim 1 including cleansing the database to remove undesired data from each data field.
  - 4. The method recited in claim 1 including caching histories of the support values.

5. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for discovering trends in a database, said method comprising:
- mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence;
  
  mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items;
  
  generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location;
  
  partitioning the database into discrete portions using the time stamp for each word;
  
  determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition contain the phrase; and
  
  outputting trends based upon the support values of the phrases, by;
  
  determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and
  
  outputting trends using only the frequent phrases.
- View Dependent Claims (6, 7, 8)
- - 6. The signal-bearing medium recited in claim 5 and used in performing a method for discovering trends in a database, where the method step of identifying trends further comprises pruning the frequent phrases based upon user-defined constraints to reduce the number of phrases used to identify trends.
  - 7. The signal-bearing medium recited in claim 6 and used in performing a method for discovering trends in a database, the method including cleansing the database to remove undesired data from each data field.
  - 8. The signal-bearing medium recited in claim 6 and used in performing a method for discovering trends in a database, the method including caching histories of the support values.

9. A digital processing machine used to discover trends in a database, the device comprising:
- a database;
  
  a digital processing apparatus, the digital processing apparatus configured to receive data and commands from a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by the digital processing apparatus and used to perform a method for discovering trends in a database, said method comprising;
  
  mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence;
  
  mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items;
  
  generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location;
  
  partitioning the database into discrete portions using the time stamp for each word;
  
  determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition contain the phrase; and
  
  outputting trends based upon the support values of the phrases, by;
  
  determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and
  
  outputting trends using only the frequent phrases.
- View Dependent Claims (10, 11, 12)
- - 10. The machine recited in claim 9, where the method step of outputting trends performed by the digital processing apparatus further comprises pruning the frequent phrases based upon user-defined constraints to reduce the number of phrases used to identify trends.
  - 11. The machine recited in claim 10, where the method performed by the digital processing apparatus includes cleansing the database to remove undesired data from each data field.
  - 12. The machine recited in claim 11, where the method performed by the digital processing apparatus includes retaining histories of the support values.

13. A digital processing machine for discovering trends in a database, the device comprising:
- a database;
  
  a means for processing the database, the processing means configured to receive data and commands from a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by the processing means and used to perform a method for discovering trends in a database, said method comprising;
  
  cleansing the database to remove undesired data from each data field;
  
  mapping words in a plurality of words to a data-sequence of data contained in a data field and identifiable by a position identifier, the data-sequence having transactions where a transaction includes a set of items, a word mapped to a single-item transaction in a data-sequence;
  
  mapping phrases to a sequential-pattern of data contained in the data field and identifiable by a position identifier, the sequential-pattern of data having sets of items, a phrase mapped to a sequential-pattern having one item in each set of items;
  
  generating a time stamp for each word of a plurality of words mapped in a data field, the time stamp specifying a data field location;
  
  partitioning the database into discrete portions using the time stamp for each word;
  
  determining a support value for a phrase, the support value representing a number of data-sequences in a selected data partition contain the phrase; and
  
  outputting trends based upon the support values of the phrases, by;
  
  determining frequent phrases using the mapping of each phrase, a phrase being frequent if the presence of the phrase in the selected partition of the database exceeds a minimum required support value; and
  
  outputting trends using only the frequent phrases.
- View Dependent Claims (14, 15, 16)
- - 14. The digital processing machine recited in claim 13 where outputting trends further comprises pruning the frequent phrases based upon user-defined constraints to reduce the number of phrases used to identify trends.
  - 15. The digital processing machine recited in claim 13 including cleansing the database to remove undesired data from each data field.
  - 16. The digital processing machine recited in claim 13 including caching histories of the support values.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
GlobalFoundries, Inc.
Original Assignee
International Business Machines Corporation
Inventors
Lent, Brian Scott, Agrawal, Rakesh, Srikant, Ramakrishnan
Primary Examiner(s)
Amsbury, Wayne
Assistant Examiner(s)
CHANNAVAJJALA, SRIRAMA T

Application Number

US08/909,911
Time in Patent Office

861 Days
Field of Search

707/1, 707/2, 707/3, 707/6, 707/7, 707/8, 707/9, 707/10, 707/100, 707/203, 707/101, 707/102, 707/103, 707/201, 707/511, 707/531, 707/532, 707/535, 705/10, 704/1, 704/8, 704/9, 704/10
US Class Current

704/251
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 2216/03   Data mining

Y10S 707/968   Partitioning

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Y10S 707/99953   Recoverability

Mapping words, phrases using sequential-pattern to find user specific trends in a text database

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

160 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Mapping words, phrases using sequential-pattern to find user specific trends in a text database

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

160 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links