Creation of a category tree with respect to the contents of a data stock
First Claim
1. A system for analyzing data to establish a category tree comprising:
- a data source;
an inventory representation of data in communication with the data source;
a computer unit having a processor in communication with said data source and said inventory representation of data;
software executing on said processor to;
1. create a list of words of each element within the inventory representation of data;
2. filter out stop words in each of said list of words;
3. calculate a significance value for each word remaining in each said list of words;
4. sort said list of words in descending order according to the significance values to create a sorted list of words;
5. reduce said sorted list of words to a maximum number of top elements to create a reduced list of words;
6. store said reduced list of words in a persistent memory;
7. detect co-occurrences within the stored reduced list of words;
8. store said co-occurrences as a table in the persistent memory;
9. retrieve words from the stored reduced list of words which have the highest significance values but which have no co-occurrences with each other;
10. establish a first level of the category tree using said retrieved words;
11. retrieve a list of co-occurrences for each word of said first level from said stored reduced list of words;
12. create a corresponding list of words for each said list of co-occurrences having no co-occurrences with each other;
13. calculate a frequency of co-occurrences for each of said corresponding list of words;
14. sort said corresponding list of words in descending order according to the frequency to create a sorted corresponding list of words;
15. reduce said sorted corresponding list of words to a predetermined maximum number of top elements to create a reduced corresponding list of words;
16. establish a subordinate level of the category tree using said reduced corresponding list of words; and
,17. iteratively repeat steps 11 through 16 while no further co-occurrences can be retrieved from said persistent memory for a set of superior categories, wherein in step 11 the retrieved co-occurrences exists for all superior categories in said category tree;
wherein the category tree is consolidated for display on a display device.
4 Assignments
0 Petitions
Accused Products
Abstract
Methods for the automatic creation of a category tree with respect to the contents of a data stock, wherein a taxonomy of the data stock will be created on the base of co-occurrences. Another object of the present invention is furthermore a data processing system comprising data which represent information in at least one data stock which is accessible via at least one data source, which is designed and/or adapted to at least partially carry out a method according to the invention. Another object of the present invention is furthermore a data processing device for the electronic processing of data, comprising a control and/or computer unit, an input unit and an output unit, which is designed and/or adapted to at least partially carry out a method according to the invention, preferably using at least a part of a data processing system according to the invention.
10 Citations
29 Claims
-
1. A system for analyzing data to establish a category tree comprising:
-
a data source; an inventory representation of data in communication with the data source; a computer unit having a processor in communication with said data source and said inventory representation of data; software executing on said processor to; 1. create a list of words of each element within the inventory representation of data; 2. filter out stop words in each of said list of words; 3. calculate a significance value for each word remaining in each said list of words; 4. sort said list of words in descending order according to the significance values to create a sorted list of words; 5. reduce said sorted list of words to a maximum number of top elements to create a reduced list of words; 6. store said reduced list of words in a persistent memory; 7. detect co-occurrences within the stored reduced list of words; 8. store said co-occurrences as a table in the persistent memory; 9. retrieve words from the stored reduced list of words which have the highest significance values but which have no co-occurrences with each other; 10. establish a first level of the category tree using said retrieved words; 11. retrieve a list of co-occurrences for each word of said first level from said stored reduced list of words; 12. create a corresponding list of words for each said list of co-occurrences having no co-occurrences with each other; 13. calculate a frequency of co-occurrences for each of said corresponding list of words; 14. sort said corresponding list of words in descending order according to the frequency to create a sorted corresponding list of words; 15. reduce said sorted corresponding list of words to a predetermined maximum number of top elements to create a reduced corresponding list of words; 16. establish a subordinate level of the category tree using said reduced corresponding list of words; and
,17. iteratively repeat steps 11 through 16 while no further co-occurrences can be retrieved from said persistent memory for a set of superior categories, wherein in step 11 the retrieved co-occurrences exists for all superior categories in said category tree; wherein the category tree is consolidated for display on a display device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for analyzing data to establish a category tree comprising:
-
a data source; an inventory representation of data in communication with the data source; a computer unit having a processor in communication with said data source and said inventory representation of data; software executing on said processor to; 1. create sets of words having a pre-determinable number of significant words for each text of the inventory representation of data; 2. store each set of words in a persistent memory as a list of words with an identifier of the related set of words for each word; 3. retrieve a list of words from each set of words; 4. establish a first level of the category tree with said retrieved list of words; 5. retrieve co-occurrences within each set of words stored in said persistent memory for each word in said list of words of the first level of the category tree; 6. store the co-occurrences in said persistent memory as a list of words; 7. establish a subordinate level of the category tree based on the list of co-occurrences; 8;
determine co-occurrences for each word combination of the first and each subordinate level of the category tree within the stored sets of words in said persistent memory;9. store said co-occurrences of each word combinations in said persistent memory; 10. iteratively repeat steps 7 through 9 for subordinated levels of the category tree until no further co-occurrences can be determined in step 8 for each combination of words; wherein the category tree is consolidated for display on a display device. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A system for analyzing data to establish a category tree comprising:
-
a data source; an inventory representation of data in communication with the data source; a computer unit having a processor in communication with said data source and said inventory representation of data; software executing on said processor to; 1. create sets of words having a pre-determinable number of significant words for each text of the inventory representation of data; 2. store each set of words in a persistent memory as a list of words, with an identifier of the related set of words for each word; 3. retrieve a list of words from all words in said persistent memory; 4. establish a first level of the category tree with said retrieved list of words; 5. compare each word in said list of words to each word within the sets of words stored in the persistent memory, to determine whether two words match or achieve a predefined minimum similarity with respect to each other, wherein in case of no match of a word in said list of words this word will be skipped, and wherein in case of a match or given minimum similarity between the one word and all other words of said sets of words a weighted link having the weight 0.1 will be created if no link already exists, else the weight of the link will be increased by 0.1 and wherein if a weight of 1.0 is exceeded, the weight will be reset to 0.9 and all other links will be reduced to a value of 90%, else the increased weight will be used; 6. retrieve the links of each word on the retrieved list of words; 7. store the links in a list of words; 8. retrieve a subordinated level of the category tree based on its stored list of words; 9. retrieve the links of each word on the created list of words and at least one stored list of words; 10. store the links in a list of words; 11. iteratively repeat the steps 8 through 10 for subordinated levels of the category tree until the number of the links retrieved in step 9 is equal to zero; wherein the category tree is consolidated for display on a display device. - View Dependent Claims (25, 26, 27, 28, 29)
-
Specification