System and method for database tomography

US 5,440,481 A
Filed: 10/28/1992
Issued: 08/08/1995
Est. Priority Date: 10/28/1992
Status: Expired due to Fees

First Claim

Patent Images

1. A system for full-text database searching, for identification of often repeated phrases which by virtue of their repeated occurrence, frequency of occurrence above a user-set threshold, or user input constitute phrases having a high user-interest designated as pervasive them areas (PTAs), said phrases consisting of one to n words (n*words), where n is an integer, in one or more documents defined as the database, relationships defined as connectivity among said PTAs, and phrases in close physical proximity to and which are supportive of said PTAs, comprising:

means for introducing document information content into a full-text database in digital form;

means for digitally storing said database;

means for processing said digitally stored database;

means operatively associated with said processing means and said storing means for identifying pervasive theme areas (PTAs) defined as often-repeating word phrases consisting of one or more adjacent words such that said phrases are one word phrases, adjacent 2 word phrases, adjacent 3 word phrases . . . and adjacent n* word phrases, and for entering said phrases in said storing means;

means for identifying phrases in said database related to said PTAs, said phrases being defined as m words, where m=1,2,3, . . . n and where each word phrase for m=2,3, . . . n is composed of adjacent words, said word phrase for m=1 being a single word phrase, for m=2 a double word phrase, for m=3 a triple word phrase . . . and for n=m an nth word phrase, by applying a user specified range of interest R expressed as a number of single words appearing both before and after said PTAs, and for storing said identified phrases in said storing means;

means for counting for each PTA the extracted phrases within said range of said PTA stored in said storage means, sorting all phases found for each PTA by frequency of occurrence, listing each PTA and its related sorted list of extracted phrases, and storing said counts and said lists of PTA'"'"'s and their related sorted list of extracted phrases in said storing means;

means for quantifying the strength of relationship between extracted phrases and each pervasive theme area (PTA) applying user-predefined numerical indices and figures of merit, and providing the results of said quantifying means to said storing means;

means for obtaining the results of said quantification from said quantifying means and said storing means and presenting said results to said user for user-selection of phrases having a relationship to each PTA predicated on the relationship strengths obtained by said quantifying means;

means for identifying PTAs which are closely related, said means employing user-input figure of merit threshold values above a user-predetermined number for selecting phrases of high-user interest, said means storing identified closely related PTAs in said storing means;

means for identifying phrases in common among PTA and storing those identified in said storing means;

means for identifying and grouping related PTA based upon the number of phrases in common among the PTA, said number being above a user-input predetermined threshold, each group having at least one PTA having extracted phrases in common with one or more other PTA in said group, said groupings of PTA'"'"'s stored in said storing means; and

means for displaying relationships among related PTA and between PTA and related phrases said display means connected to said processing means.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A Process for analyzing full-text is provided for identifying often-repea, high user interest, word phrases in a database. Often-repeated, high user interest, word phrases are defined as pervasive theme areas (PTAs). The process also allows the relationship defined as connectivity among the various PTAs to be identified. In addition, phrases that are in proximity to the PTAs and which are strongly supportive of the PTAs are identified. Numerical indices, figure of merit, and user defined thresholds are used to quantify relations between PTAs and among PTAs and phrases.

Citations

14 Claims

1. A system for full-text database searching, for identification of often repeated phrases which by virtue of their repeated occurrence, frequency of occurrence above a user-set threshold, or user input constitute phrases having a high user-interest designated as pervasive them areas (PTAs), said phrases consisting of one to n words (n*words), where n is an integer, in one or more documents defined as the database, relationships defined as connectivity among said PTAs, and phrases in close physical proximity to and which are supportive of said PTAs, comprising:
- means for introducing document information content into a full-text database in digital form;
  
  means for digitally storing said database;
  
  means for processing said digitally stored database;
  
  means operatively associated with said processing means and said storing means for identifying pervasive theme areas (PTAs) defined as often-repeating word phrases consisting of one or more adjacent words such that said phrases are one word phrases, adjacent 2 word phrases, adjacent 3 word phrases . . . and adjacent n* word phrases, and for entering said phrases in said storing means;
  
  means for identifying phrases in said database related to said PTAs, said phrases being defined as m words, where m=1,2,3, . . . n and where each word phrase for m=2,3, . . . n is composed of adjacent words, said word phrase for m=1 being a single word phrase, for m=2 a double word phrase, for m=3 a triple word phrase . . . and for n=m an nth word phrase, by applying a user specified range of interest R expressed as a number of single words appearing both before and after said PTAs, and for storing said identified phrases in said storing means;
  
  means for counting for each PTA the extracted phrases within said range of said PTA stored in said storage means, sorting all phases found for each PTA by frequency of occurrence, listing each PTA and its related sorted list of extracted phrases, and storing said counts and said lists of PTA'"'"'s and their related sorted list of extracted phrases in said storing means;
  
  means for quantifying the strength of relationship between extracted phrases and each pervasive theme area (PTA) applying user-predefined numerical indices and figures of merit, and providing the results of said quantifying means to said storing means;
  
  means for obtaining the results of said quantification from said quantifying means and said storing means and presenting said results to said user for user-selection of phrases having a relationship to each PTA predicated on the relationship strengths obtained by said quantifying means;
  
  means for identifying PTAs which are closely related, said means employing user-input figure of merit threshold values above a user-predetermined number for selecting phrases of high-user interest, said means storing identified closely related PTAs in said storing means;
  
  means for identifying phrases in common among PTA and storing those identified in said storing means;
  
  means for identifying and grouping related PTA based upon the number of phrases in common among the PTA, said number being above a user-input predetermined threshold, each group having at least one PTA having extracted phrases in common with one or more other PTA in said group, said groupings of PTA'"'"'s stored in said storing means; and
  
  means for displaying relationships among related PTA and between PTA and related phrases said display means connected to said processing means.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1 wherein said means for identifying pervasive theme areas in said database, comprises:
    - a means for counting frequency of occurrence of said n* word phrases;
      
      a means for creating a list of all n* word phrases and the frequency of occurrence for each of said n* word phrases;
      
      a means for sorting said list of n* word phrases by frequency of occurrence;
      
      a means for defining pervasive theme areas from said list of sorted phrases; and
      
      a means for selecting the number of said n* word phrases to be used as pervasive theme areas.
  - 3. The system of claim 2 wherein said means for identifying phrases in said database related to said pervasive theme areas (PTAs), comprises:
    - a means for extracting phrases for each PTA from the full-text database which occur within a set range of single words of the PTA; and
      
      ,a means for listing the extracted phrases for each PTA and their frequency of occurrence in the database.
  - 4. The system of claim 3 wherein said means for quantifying the strength of relationship between said phrases and each pervasive theme area, comprises:
    - a means for producing numerical indices and figures of merit from said database; and
      
      a means for applying said numerical indices and figures of merit to quantify the strength of relationship of each said phrase and each pervasive theme area.
  - 5. The system of claim 4 wherein said means for producing numerical indices and figures of merit employs definitions for numerical indices selected from the group consisting of absolute frequency of occurrence, denominated C_j, of the pervasive theme area, absolute frequency of occurrence of the extracted phrases C_i for single, adjacent double, adjacent triple . . . adjacent nth, and C_ij the frequency of occurrence of the extracted phrases (single, adjacent double, adjacent triple, . . . adjacent mth) within a set range of single words of the pervasive theme area.
  - 6. The system of claim 5 wherein said means for producing numerical indices and figures of merit employs definitions for figure of merit selected from the group consisting of the ratios of the frequencies of occurrence C_ij /C_i, C_ij /C_j, and (C_ij 2)/(C_i *C_j).
  - 7. The system of claim 6 wherein said means for identifying pervasive theme areas (PTAs) which are closely related comprises:
    - a means for defining threshold values of the figures of merit for selecting single, adjacent double, adjacent triple, . . . adjacent mth word phrases of high user interest;
      
      a means for selecting phrases of high interest having figures of merit above said threshold, said selection being made from the list of extracted phrases;
      
      a means for computing commonality defined as the degree of similarity or close relatedness of extracted phrases between different pervasive theme areas; and
      
      a means for generating groups of PTAs, each PTA in a group having extracted phrases in common with at least one other PTA in the group.
  - 8. The system of claim 1 wherein said means for identifying pervasive theme areas in said database comprises a means for creating a list of all phrases and for each phrase indicating its order in said list in accordance with a user prespecified sort criteria;
    - a means for sorting said phrases in accordance with said user pre-specified sort criteria;
      
      a means for defining pervasive theme areas from said phrases;
      
      a means for selecting a number of said phrases to be used as pervasive them areas.
  - 9. The system of claim 8 wherein said sort criteria is alphabetical by consecutive word order in said phrase.

10. The computer implemented method of full-text database searching, comprising the steps of:
- a. assembling information into a full-text database by scanning documents and storing digitized results in said computer;
  
  b. eliminating trivial phrases from said databases by comparing a user-input list of such phrases with the entire contents of said database and deleting matches with said list;
  
  c. using the definition of phrase as m*word=phrase where m=1,2,3, . . . n and where each word phrase for m=2,3 . . . n is composed of adjacent words, said word phrase for m=1 being a single word phrase, for m=2 an adjacent double word phrase, and for m=3 an adjacent triple word phrase, . . . and for m=n an adjacent nth word phrase, creating a list of all single word phrases, a list of all adjacent double word phrases, a list of all adjacent triple word phrases, . . . , and a list of all adjacent nth word phrases and their frequencies of occurrence in the database;
  
  d. sorting each list of said phrases by their frequency of occurrence in said list;
  
  e. identifying pervasive theme areas in the information in said database;
  
  f. defining pervasive theme areas from said sorted list of phrases as the most frequently occurring phrases of high user-interest.g. identifying phrases in said database that are related to said pervasive theme areas;
  
  h. quantifying strength of relationships between said identified phrases and said pervasive theme areas;
  
  i. identifying pervasive theme areas which are closely related;
  
  j. displaying relationships among related pervasive theme areas and pervasive theme areas and related phrases;
  
  wherein the step of identifying phrases related to pervasive theme area further comprises the steps ofk. extracting phrases for each pervasive theme area (PTA) from the full-text database which occur within a user-identified range of interest plus or minus a range of words of the PTA; and
  
  l. listing the extracted phrases and their frequency of occurrence in the database for each PTA.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The method of claim 10 wherein the step of quantifying the strength of relationship between phrases and each pervasive theme area (PTA) further comprises the steps of:
    - a. preparing numerical indices and figures of merit for quantifying strength of relationship between extracted phrases and their PTA; and
      
      b. applying said numerical indices and figures of merit to said phrases relative to their respective PTA.
  - 12. The method of claim 11 wherein the step of identifying pervasive theme areas (PTA) which are closely related further comprises the steps of:
    - a. defining threshold values above some predetermined number for the figures of merit for selecting phrases;
      
      b. selecting, from the list of extracted phrases for each PTA, phrases of high user interest having figures of merit above the threshold;
      
      c. computing commonality of extracted phrases among the different phrases in terms of the numbers of phrases in common among PTA, or equivalent; and
      
      ,d. generating groups of PTA such that each PTA in a given group has extracted phrases in common with at least one other PTA in the group.
  - 13. The method of claim 12 wherein said step of identifying pervasive theme area further comprises the steps:
    - creating a list of all phrases sorted and ordered in accordance with frequency of occurence of said phrases;
      
      sorting said ordered phrases in accordance with a user pre-specified user-interest criteria;
      
      defining pervasive theme areas from said user interest phrases; and
      
      selecting the number of said phrases to be used as pervasive theme areas.
  - 14. The system of claim 13 wherein said sort criteria is alphabetical by consecutive word order in said phrase.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Naval Air Warfare Center
Original Assignee
the united states of america as represented by the secretary of the navy
Inventors
Kostoff, Ronald N., Miles, David L., Eberhart, Henry J.
Primary Examiner(s)
Huntley, David M.
Assistant Examiner(s)
BODENDORF, ANDREW

Application Number

US07/967,341
Time in Patent Office

1,014 Days
Field of Search

364/419.08, 364/419.07, 364/419.13, 364/419.19, 395/600
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 40/216   using statistical methods

G06F 40/289   Phrasal analysis, e.g. fini...

Y10S 707/99935   Query augmenting and refini...

System and method for database tomography

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for database tomography

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links