System for information discovery

US 20030097375A1
Filed: 11/16/2002
Published: 05/22/2003
Est. Priority Date: 09/13/1996
Status: Active Grant

First Claim

Patent Images

1. A method for analyzing and characterizing a database of electronically formatted natural language based documents comprising the steps of:

a) subjecting the database to a sequence of word filters to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set whose members are highly predictive of content;

b) defining a subset of the filtered word set as the topic set, said topic set being characterized as the set of filtered words which best discriminate the content of the documents which contain them, c) forming a two dimensional matrix with the words contained within the topic set defining one dimension of said matrix and the words contained within the filtered word set comprising the other dimension of said matrix d) calculating matrix entries as the conditional probability that a document in the database will contain each word in the topic set given that it contains each word in the filtered word set, and e) providing said matrix entries as vectors to interpret the document contents of said database.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A sequence of word filters are used to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set and a topic word set whose members are highly predictive of content. These two word sets are then formed into a two dimensional matrix with matrix entries calculated as the conditional probability that a document will contain a word in a row given that it contains the word in a column. The matrix representation allows the resultant vectors to be utilized to interpret document contents.

Citations

5 Claims

1. A method for analyzing and characterizing a database of electronically formatted natural language based documents comprising the steps of:
- a) subjecting the database to a sequence of word filters to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set whose members are highly predictive of content;
  
  b) defining a subset of the filtered word set as the topic set, said topic set being characterized as the set of filtered words which best discriminate the content of the documents which contain them, c) forming a two dimensional matrix with the words contained within the topic set defining one dimension of said matrix and the words contained within the filtered word set comprising the other dimension of said matrix d) calculating matrix entries as the conditional probability that a document in the database will contain each word in the topic set given that it contains each word in the filtered word set, and e) providing said matrix entries as vectors to interpret the document contents of said database.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein one of said sequence of filters comprises a frequency filter, a topicality filter and an overlap filter
  - 3. The method of claim 2 wherein one of said topicality filter comprises the steps of:
    - a) calculating the expected distribution of each word contained in said database, b) measuring the actual distribution of each word contained in said database, c) expressing the ratio of said actual distribution to said expected distribution, and d) defining said set of topic words as those which fall below a predetermined value of said ratio.
  - 4. The method of claim 2 wherein one of said frequency filter comprises the steps of:
    - a) defining a predetermined upper and lower limit for the frequency of said words in the database, b) determining the frequency of occurrence of each word contained in said database, c) further defining said set of topic words as those words whose frequency of occurrence in the database are above said predetermined lower limit and below said predetermined upper limit.
  - 5. The method of claim 2 wherein one of said overlap filter comprises the steps of:
    - a) defining a preset limit for joint distribution of word pairs occurring within said database, b) calculating the joint distribution of word pairs occurring within said database, c) defining the set of word pairs whose joint distribution falls above said preset limit c) further defining said set of topic words as not containing one of those words for each set of word pairs whose joint distribution falls above said preset upper limit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kelly A. Pennock, Nancy E. Miller
Original Assignee
Kelly A. Pennock, Nancy E. Miller
Inventors
Miller, Nancy E., Pennock, Kelly A.

Granted Patent

US 6,772,170 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 16/3332   Query translation

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

System for information discovery

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

System for information discovery

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links