System and method for database tomography
First Claim
1. A system for full-text database searching, for identification of often repeated phrases which by virtue of their repeated occurrence, frequency of occurrence above a user-set threshold, or user input constitute phrases having a high user-interest designated as pervasive them areas (PTAs), said phrases consisting of one to n words (n*words), where n is an integer, in one or more documents defined as the database, relationships defined as connectivity among said PTAs, and phrases in close physical proximity to and which are supportive of said PTAs, comprising:
- means for introducing document information content into a full-text database in digital form;
means for digitally storing said database;
means for processing said digitally stored database;
means operatively associated with said processing means and said storing means for identifying pervasive theme areas (PTAs) defined as often-repeating word phrases consisting of one or more adjacent words such that said phrases are one word phrases, adjacent 2 word phrases, adjacent 3 word phrases . . . and adjacent n* word phrases, and for entering said phrases in said storing means;
means for identifying phrases in said database related to said PTAs, said phrases being defined as m words, where m=1,2,3, . . . n and where each word phrase for m=2,3, . . . n is composed of adjacent words, said word phrase for m=1 being a single word phrase, for m=2 a double word phrase, for m=3 a triple word phrase . . . and for n=m an nth word phrase, by applying a user specified range of interest R expressed as a number of single words appearing both before and after said PTAs, and for storing said identified phrases in said storing means;
means for counting for each PTA the extracted phrases within said range of said PTA stored in said storage means, sorting all phases found for each PTA by frequency of occurrence, listing each PTA and its related sorted list of extracted phrases, and storing said counts and said lists of PTA'"'"'s and their related sorted list of extracted phrases in said storing means;
means for quantifying the strength of relationship between extracted phrases and each pervasive theme area (PTA) applying user-predefined numerical indices and figures of merit, and providing the results of said quantifying means to said storing means;
means for obtaining the results of said quantification from said quantifying means and said storing means and presenting said results to said user for user-selection of phrases having a relationship to each PTA predicated on the relationship strengths obtained by said quantifying means;
means for identifying PTAs which are closely related, said means employing user-input figure of merit threshold values above a user-predetermined number for selecting phrases of high-user interest, said means storing identified closely related PTAs in said storing means;
means for identifying phrases in common among PTA and storing those identified in said storing means;
means for identifying and grouping related PTA based upon the number of phrases in common among the PTA, said number being above a user-input predetermined threshold, each group having at least one PTA having extracted phrases in common with one or more other PTA in said group, said groupings of PTA'"'"'s stored in said storing means; and
means for displaying relationships among related PTA and between PTA and related phrases said display means connected to said processing means.
2 Assignments
0 Petitions
Accused Products
Abstract
A Process for analyzing full-text is provided for identifying often-repea, high user interest, word phrases in a database. Often-repeated, high user interest, word phrases are defined as pervasive theme areas (PTAs). The process also allows the relationship defined as connectivity among the various PTAs to be identified. In addition, phrases that are in proximity to the PTAs and which are strongly supportive of the PTAs are identified. Numerical indices, figure of merit, and user defined thresholds are used to quantify relations between PTAs and among PTAs and phrases.
-
Citations
14 Claims
-
1. A system for full-text database searching, for identification of often repeated phrases which by virtue of their repeated occurrence, frequency of occurrence above a user-set threshold, or user input constitute phrases having a high user-interest designated as pervasive them areas (PTAs), said phrases consisting of one to n words (n*words), where n is an integer, in one or more documents defined as the database, relationships defined as connectivity among said PTAs, and phrases in close physical proximity to and which are supportive of said PTAs, comprising:
-
means for introducing document information content into a full-text database in digital form; means for digitally storing said database; means for processing said digitally stored database; means operatively associated with said processing means and said storing means for identifying pervasive theme areas (PTAs) defined as often-repeating word phrases consisting of one or more adjacent words such that said phrases are one word phrases, adjacent 2 word phrases, adjacent 3 word phrases . . . and adjacent n* word phrases, and for entering said phrases in said storing means; means for identifying phrases in said database related to said PTAs, said phrases being defined as m words, where m=1,2,3, . . . n and where each word phrase for m=2,3, . . . n is composed of adjacent words, said word phrase for m=1 being a single word phrase, for m=2 a double word phrase, for m=3 a triple word phrase . . . and for n=m an nth word phrase, by applying a user specified range of interest R expressed as a number of single words appearing both before and after said PTAs, and for storing said identified phrases in said storing means; means for counting for each PTA the extracted phrases within said range of said PTA stored in said storage means, sorting all phases found for each PTA by frequency of occurrence, listing each PTA and its related sorted list of extracted phrases, and storing said counts and said lists of PTA'"'"'s and their related sorted list of extracted phrases in said storing means; means for quantifying the strength of relationship between extracted phrases and each pervasive theme area (PTA) applying user-predefined numerical indices and figures of merit, and providing the results of said quantifying means to said storing means; means for obtaining the results of said quantification from said quantifying means and said storing means and presenting said results to said user for user-selection of phrases having a relationship to each PTA predicated on the relationship strengths obtained by said quantifying means; means for identifying PTAs which are closely related, said means employing user-input figure of merit threshold values above a user-predetermined number for selecting phrases of high-user interest, said means storing identified closely related PTAs in said storing means; means for identifying phrases in common among PTA and storing those identified in said storing means; means for identifying and grouping related PTA based upon the number of phrases in common among the PTA, said number being above a user-input predetermined threshold, each group having at least one PTA having extracted phrases in common with one or more other PTA in said group, said groupings of PTA'"'"'s stored in said storing means; and means for displaying relationships among related PTA and between PTA and related phrases said display means connected to said processing means. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. The computer implemented method of full-text database searching, comprising the steps of:
-
a. assembling information into a full-text database by scanning documents and storing digitized results in said computer; b. eliminating trivial phrases from said databases by comparing a user-input list of such phrases with the entire contents of said database and deleting matches with said list; c. using the definition of phrase as m*word=phrase where m=1,2,3, . . . n and where each word phrase for m=2,3 . . . n is composed of adjacent words, said word phrase for m=1 being a single word phrase, for m=2 an adjacent double word phrase, and for m=3 an adjacent triple word phrase, . . . and for m=n an adjacent nth word phrase, creating a list of all single word phrases, a list of all adjacent double word phrases, a list of all adjacent triple word phrases, . . . , and a list of all adjacent nth word phrases and their frequencies of occurrence in the database; d. sorting each list of said phrases by their frequency of occurrence in said list; e. identifying pervasive theme areas in the information in said database; f. defining pervasive theme areas from said sorted list of phrases as the most frequently occurring phrases of high user-interest. g. identifying phrases in said database that are related to said pervasive theme areas; h. quantifying strength of relationships between said identified phrases and said pervasive theme areas; i. identifying pervasive theme areas which are closely related; j. displaying relationships among related pervasive theme areas and pervasive theme areas and related phrases;
wherein the step of identifying phrases related to pervasive theme area further comprises the steps ofk. extracting phrases for each pervasive theme area (PTA) from the full-text database which occur within a user-identified range of interest plus or minus a range of words of the PTA; and l. listing the extracted phrases and their frequency of occurrence in the database for each PTA. - View Dependent Claims (11, 12, 13, 14)
-
Specification