Method and apparatus for classifying documents within a class hierarchy creating term vector, term file and relevance ranking
First Claim
1. A method for classifying a document based on content within a class hierarchy, the method comprising:
- initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
displaying the class hierarchy;
accepting a user-selected command for manipulating the class hierarchy;
processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing the category command further comprising;
storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
creating a class hierarchy by providing a plurality of category nodes stored in a tree data structure within a memory, each of said plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms;
creating a plurality of terms files, each of said plurality of terms files corresponding to one of said plurality of category nodes and including a corresponding set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
creating one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms file according to frequency of occurrence in the corresponding terms file;
creating a document vector for the document, said document vector containing a weight assigned to the terms of the document according to frequency of occurrence;
providing a relevance ranking between said terms files and said document by comparing said document vector with said one or more term vectors; and
storing said document within said document directory hierarchy at a location corresponding to a category node having a term vector which has a relevance ranking that matches a selected criteria.
5 Assignments
0 Petitions
Accused Products
Abstract
A method for classifying a document based on content within a class hierarchy. The class hierarchy comprises a plurality of category nodes stored within a tree data structure. Each of the plurality of category nodes includes a category name corresponding to a unique directory and a category definition comprising a set of defining terms. The class hierarchy is searched to determine appropriate categories for classification of the document. The document is then stored in directories corresponding to the categories selected for classification. If no categories are produced by the search, a system administrator is notified of the unsuccessful search.
356 Citations
75 Claims
-
1. A method for classifying a document based on content within a class hierarchy, the method comprising:
-
initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
displaying the class hierarchy;
accepting a user-selected command for manipulating the class hierarchy;
processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing the category command further comprising;
storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
creating a class hierarchy by providing a plurality of category nodes stored in a tree data structure within a memory, each of said plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms;
creating a plurality of terms files, each of said plurality of terms files corresponding to one of said plurality of category nodes and including a corresponding set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
creating one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms file according to frequency of occurrence in the corresponding terms file;
creating a document vector for the document, said document vector containing a weight assigned to the terms of the document according to frequency of occurrence;
providing a relevance ranking between said terms files and said document by comparing said document vector with said one or more term vectors; and
storing said document within said document directory hierarchy at a location corresponding to a category node having a term vector which has a relevance ranking that matches a selected criteria. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for classifying a document based on content within a class hierarchy, the class hierarchy comprising a plurality of category nodes stored in a tree data structure within a memory, each of the plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms, the method comprising:
-
initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
displaying the class hierarchy;
accepting a user-selected command for manipulating the class hierarchy;
processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing the category command further comprising;
storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
creating a plurality of terms files, each of the plurality of terms files corresponding to one of the plurality of category nodes and including a corresponding set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
building a path-to-name translation listing containing each category name and unique directory pair for each of the plurality of category nodes;
indexing the class hierarchy using Fulcrum to create an index file containing term vectors corresponding to the plurality of terms files;
classifying at least one document within a document directory hierarchy using said term vectors; and
indexing the document directory hierarchy using Fulcrum. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
retrieving a set of defining terms for one of the plurality of category nodes;
forming a terms file within the unique directory in the class hierarchy corresponding to the one of the plurality of category nodes; and
storing the set of defining terms within the terms file.
-
-
11. The method according to claim 8, the step of classifying further comprising the following steps:
-
creating a document vector for a document to be classified within the class hierarchy;
searching the class hierarchy for the term vectors which are relevant to the document to determine appropriate categorization of the document, the step of searching if successful returning a list of matching category names, and otherwise notifying a system administrator that the step of searching was unsuccessful; and
storing the document in directories corresponding to the matching category names if the step of searching is successful.
-
-
12. The method according to claim 11, wherein the step of searching comprises:
comparing the document vector to the term vectors using Fulcrum, the step of comparing returning a list of matching categories within the class hierarchy according to user-defined criteria, the list of matching categories corresponding to the at least one of the plurality of category nodes.
-
13. The method according to claim 11, the step of storing further comprising the following steps:
-
retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing;
creating directories within the document directory hierarchy corresponding to the directory path if not already existing within the document directory hierarchy; and
adding the document to a leaf directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
14. The method according to claim 13, wherein the step of adding comprises linking the document to the directory.
-
15. The method according to claim 11, the step of storing further comprising the following steps:
-
creating a directory within the document directory hierarchy corresponding to each of the unique directories within the class hierarchy;
retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing; and
adding the document to a directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
16. The method according to claim 15, wherein the step of adding comprises linking the document to the directory.
-
17. The method of claim 8, wherein each of said contiguous multi-term portions is one sentence or longer.
-
18. The method of claim 8, wherein each of said contiguous multi-term portions is one paragraph or longer.
-
19. The method of claim 8, wherein each of said contiguous multi-term portions is 25 words or longer.
-
20. A computer system for classifying a document comprising:
-
a processor; and
a memory having stored therein the following;
means for initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
means for displaying the class hierarchy;
accepting a user-selected command for manipulating the class hierarchy;
means for processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing means further comprising;
means for storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
means for storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
means for storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
means for storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
means for storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
means for creating a class hierarchy having a plurality of category nodes stored in a tree data structure within a memory, each of the plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms;
means for creating a plurality of terms files, each of the plurality of terms files corresponding to one of the plurality of category nodes and including a corresponding set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
means for creating one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms file according to frequency of occurrence in the corresponding terms file;
means for building a path-to-name translation listing containing a directory path and category name for each of the plurality of category nodes;
an indexing means for providing a relevance ranking between the terms files and the document by comparing the document vector with the at least one term vector; and
means for storing the document within the document directory hierarchy at a location corresponding to a category node having a term vector which has a relevance ranking that matches a selected criteria. - View Dependent Claims (21, 22, 23, 24, 25)
-
-
26. A computer system for classifying a document based on content within a class hierarchy, the class hierarchy comprising a plurality of category nodes stored in a tree data structure within a memory, each of the plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms, the computer system comprising:
-
a processor; and
a memory having stored therein the following;
means for initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
means for displaying the class hierarchy;
means for accepting a user-selected command for manipulating the class hierarchy;
means for processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing means further comprising;
means for storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
means for storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
means for storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
means for storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
means for storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
means for creating a plurality of terms files, each of the plurality of terms files corresponding to one of the plurality of category nodes and including a corresponding set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
means for making one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms file according to the frequency of occurrence in the corresponding terms file;
means for making a document vector for the document, said document vector containing a weight assigned to the terms of the document according to frequency of occurrence, means for building a path-to-name translation listing containing each category name and unique directory pair for each of the plurality of category nodes;
means for indexing the class hierarchy using Fulcrum to create an index file containing term file vectors corresponding to the plurality of terms files;
means for classifying at least one document within a document directory hierarchy using the term file vectors; and
means for indexing the document directory hierarchy using Fulcrum. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
means for retrieving a set of defining terms for one of the plurality of category nodes;
means for forming a terms file within the unique directory in the class hierarchy corresponding to the one of the plurality of category nodes; and
means for storing the set of defining terms within the terms file.
-
-
30. The computer system according to claim 26, the means for classifying further comprising:
-
means for creating a document vector for a document to be classified within the class hierarchy;
means for searching the class hierarchy for the terms file vectors which are relevant to the document to determine appropriate categorization of the document, the step of searching if successful returning a list of matching category names, and otherwise notifying a system administrator that the step of searching was unsuccessful; and
means for storing the document in directories corresponding to the matching category names if the step of searching is successful.
-
-
31. The computer system according to claim 30, wherein the means for searching comprises:
means for comparing the document vector to the term vectors using Fulcrum, the step of comparing returning a list of matching categories within the class hierarchy according to the user-defined criteria, the list of matching categories corresponding to the at least one of the plurality of category nodes.
-
32. The computer system according to claim 30, the means for storing further comprising:
-
means for retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing;
means for creating directories within the document directory hierarchy corresponding to the directory path if not already existing within the document directory hierarchy; and
means for adding the document to a leaf directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
33. The computer system according to claim 32, wherein the means for adding comprises linking the document to the directory.
-
34. The computer system according to claim 30, the means for storing further comprising:
-
means for creating a directory within the document directory hierarchy corresponding to each of the unique directories within the class hierarchy;
means for retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing; and
means for adding the document to a directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
35. The computer system according to claim 34, wherein the means for adding comprises linking the document to the directory.
-
36. The computer system of claim 26, wherein each of said contiguous multi-term portions is one sentence or longer.
-
37. The computer system of claim 26, wherein each of said contiguous multi-term portions is one paragraph or longer.
-
38. An article of manufacture, comprising:
-
a computer usable medium having a computer readable program code means embodied therein for classifying a document based on content within a class hierarchy, the computer readable program code means in the article of manufacture comprising;
computer-readable program means for initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
computer-readable program means for displaying the class hierarchy;
computer-readable program means for accepting a user-selected command for manipulating the class hierarchy;
computer-readable program means for processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing the category command further comprising;
computer-readable program means for storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
computer-readable program means for storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
computer-readable program means for storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
computer-readable program means for storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
computer-readable program means for storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
computer-readable program means for creating the class hierarchy by providing a plurality of category nodes stored in a tree data structure within a memory, each of the plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
computer-readable program means for creating a plurality of terms files, each of the plurality of terms files corresponding to one of the plurality of category nodes and including a corresponding set of defining terms;
computer readable program means for making one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms file according to frequency of occurrence in the corresponding terms file;
computer readable program means for creating a document vector for the document, said document vector containing a weight assigned to the terms of the document according to frequency of occurrence, computer-readable program means for providing a relevance ranking between the terms files and the document by comparing the document vector with said one or more term vectors; and
computer-readable program means for storing the document within the document directory hierarchy at a location corresponding to a category node having a term vector which has a relevance ranking that matches a selected criteria. - View Dependent Claims (39, 40, 41, 42, 43, 44)
computer-readable program means for creating at least one term file vector by indexing at least one of the plurality of terms files within the class hierarchy; and
computer-readable program means for creating a document vector by indexing a document which is selected for classification within a document directory hierarchy.
-
-
41. The computer program product of claim 38, wherein the computer-readable program means for storing further includes computer-readable program means for defining the location according to a directory path within a path-to-name translation listing which corresponds to the category node having a term vector which has a relevance ranking that matches a selected criteria.
-
42. The article of manufacture of claim 38, wherein each of said contiguous multi-term portions is one sentence or longer.
-
43. The article of manufacture of claim 38, wherein each of said contiguous multi-term portions is one paragraph or longer.
-
44. The article of manufacture of claim 38, wherein each of said contiguous multi-term portions is 25 words or longer.
-
45. A computer-readable medium recording software, the software disposed on a computer to perform a method for classifying a document based on content within a class hierarchy, the class hierarchy comprising a plurality of category nodes stored in a tree data structure within a memory, each of the plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms, the method comprising:
-
initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
displaying the class hierarchy;
accepting a user-selected command for manipulating the class hierarchy;
processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing the category command further comprising;
storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
creating a plurality of terms files, each of the plurality of terms files corresponding to one of the plurality of category nodes and including a corresponding set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
making one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms file according to frequency of occurrence in the corresponding terms file;
building a path-to-name translation listing containing each category name and unique directory pair for each of the plurality of category nodes;
indexing the class hierarchy using Fulcrum to create an index file containing term file vectors corresponding to the plurality of terms files;
classifying at least one document within a document directory hierarchy using terms vectors; and
indexing the document directory hierarchy using Fulcrum. - View Dependent Claims (46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56)
retrieving a set of defining terms for one of the plurality of category nodes;
forming a terms file within the unique directory in the class hierarchy corresponding to the one of the plurality of category nodes; and
storing the set of defining terms within the terms file.
-
-
48. The computer-readable medium according to claim 45, the step of classifying further comprising the following steps:
-
creating a document vector for a document to be classified within the class hierarchy;
searching the class hierarchy for the terms vectors which are relevant to the document to determine appropriate categorization of the document, the step of searching if successful returning a list of matching category names, and otherwise notifying a system administrator that the step of searching was unsuccessful; and
storing the document in directories corresponding to the matching category names if the step of searching is successful.
-
-
49. The computer readable medium according to claim 48, wherein the step of searching comprises:
comparing the document vector to the term vectors using Fulcrum, the step of comparing returning a list of matching categories within the class hierarchy according to user-defined criteria, the list of matching categories corresponding to the at least one of the plurality of category nodes.
-
50. The computer-readable medium according to claim 48, the step of storing further comprising the following steps:
-
retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing;
creating directories within the document directory hierarchy corresponding to the directory path if not already existing within the document directory hierarchy; and
adding the document to a leaf directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
51. The computer-readable medium according to claim 50, wherein the step of adding comprises linking the document to the directory.
-
52. The computer-readable medium according to claim 48, the step of storing further comprising the following steps:
-
creating a directory within the document directory hierarchy corresponding to each of the unique directories within the class hierarchy;
retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing; and
adding the document to a directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
53. The computer-readable medium according to claim 52, wherein the step of adding comprises linking the document to the directory.
-
54. The computer-readable medium recording software of claim 45, wherein each of said contiguous multi-term portions is one sentence or longer.
-
55. The computer-readable medium recording software of claim 45, wherein each of said contiguous multi-term portions is one paragraph or longer.
-
56. The computer-readable medium recording software of claim 45, wherein each of said contiguous multi-term portions is 48 words or longer.
-
57. A computer data signal embodied in a carrier wave and representing sequences of instructions which, when executed by a processor, cause said processor to classify a document based on content within a class hierarchy, by performing the following steps:
-
initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
displaying the class hierarchy;
accepting a user-selected command for manipulating the class hierarchy;
processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing the category command further comprising;
storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
creating the class hierarchy by providing a plurality of category nodes stored in a tree data structure within a memory, each of the plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
creating a plurality of terms files, each of the plurality of terms files corresponding to one of the plurality of category nodes and including a corresponding set of defining terms;
making one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms files according to frequency of occurrence in the corresponding terms file;
providing a relevance ranking between the term vectors and the document by comparing said document vector with the at least one term vector; and
storing the document within the document directory hierarchy at a location corresponding to a category node having a term vector which has a relevance ranking that matches a selected criteria. - View Dependent Claims (58, 59, 60, 61, 62, 63)
computer-readable program means for creating at least one term file vector by indexing at least one of the plurality of terms files within the class hierarchy; and
computer-readable program means for creating a document vector by indexing a document which is selected for classification within a document directory hierarchy.
-
-
60. The computer data signal of claim 57, wherein the step of storing further includes a step of defining the location according to a directory path within a path-to-name translation listing which corresponds to the category node having a term vector which has a relevance ranking that matches a selected criteria.
-
61. The computer data signal of claim 57, wherein each of said contiguous multi-term portions is one sentence or longer.
-
62. The computer data signal of claim 57, wherein each of said contiguous multi-term portions is one paragraph or longer.
-
63. The computer data signal of claim 57, wherein each of said contiguous multi-term portions is 25 words or longer.
-
64. A computer data signal embodied in a carrier wave comprising:
-
means for initializing the class hierarchy, the class hierarchy having a root category node within a tree data structure, the root category node having a user-defined category name;
means for displaying the class hierarchy;
means for accepting a user-selected command for manipulating the class hierarchy;
means for processing a category command in response to the user-selected command having a first predefined state, causing the class hierarchy to contain a plurality of category nodes, said processing means further comprising;
means for storing a category name in one of the plurality of category nodes, wherein each of the plurality of category nodes corresponds to a unique directory;
means for storing a NodeID within one of the plurality of category nodes, the NodeID defining the unique directory;
means for storing a nodetype within one of the plurality of category nodes, the nodetype when having a predefined type allowing a new category node to be added to a selected one of the plurality of category nodes, and otherwise preventing the new category node from being added to the selected one of the plurality of category nodes;
means for storing a ParentID within one of the plurality of category nodes, the ParentID indicating a NodeID of a parent category node;
means for storing a LinkID within a first one of the plurality of category nodes, the LinkID indicating a NodeID of a second one of the plurality of category nodes when the nodetype is of a predefined type;
means for creating a class hierarchy by providing a plurality of category nodes in a tree data structure within a memory, each of the plurality of category nodes having a category name corresponding to a unique directory and a set of defining terms;
means for creating a plurality of terms files, each of the plurality of terms files corresponding to one of the plurality of category nodes and including a corresponding set of defining terms and one or more document fragments stored under said one of said plurality of category nodes, said set of defining terms including a term corresponding to one of said plurality of category nodes and said one or more document fragments including a reference to one or more documents and indexing information indicating contiguous multi-term portions of said documents to be extracted during indexing, said set of defining terms and said document fragments together providing a definition of files to be contained in said unique directory referenced by said one of said plurality of category nodes;
means for creating one or more term vectors for each of said terms files, each of said term vectors containing a weight assigned to each of one or more common terms of the corresponding terms files according to frequency of occurrence in the corresponding terms file;
means for creating a path-to-name translation listing containing each category name and unique directory pair for each of the plurality of category nodes;
means for indexing the class hierarchy using Fulcrum to create an index file containing term vectors corresponding to the plurality of term vectors;
means for classifying at least one document within a document directory hierarchy using the term vectors; and
means for indexing the document directory hierarchy using Fulcrum. - View Dependent Claims (65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75)
retrieving a set of defining terms for one of the plurality of category nodes;
forming a terms file within the unique directory in the class hierarchy corresponding to the one of the plurality of category nodes; and
storing the set of defining terms within the terms file.
-
-
67. The computer data signal according to claim 64, the step of classifying further comprising the following steps:
-
creating a document vector for a document to be classified within the class hierarchy;
searching the class hierarchy for the terms vectors which are relevant to the document to determine appropriate categorization of the document, the step of searching if successful returning a list of matching category names, and otherwise notifying a system administrator that the step of searching was unsuccessful; and
storing the document in directories corresponding to the matching category names if the step of searching is successful.
-
-
68. The computer data signal according to claim 67, wherein the step of searching comprises:
comparing the document vector to the term vectors using Fulcrum, the step of comparing returning a list of matching categories within the class hierarchy according to user-defined criteria, the list of matching categories corresponding to the at least one of the plurality of category nodes.
-
69. The computer data signal according to claim 67, the step of storing further comprising the following steps:
-
retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing;
creating directories within the document directory hierarchy corresponding to the directory path if not already existing within the document directory hierarchy; and
adding the document to a leaf directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
70. The computer data signal according to claim 69, wherein the step of adding comprises linking the document to the directory.
-
71. The computer data signal according to claim 67, the step of storing further comprising the following steps:
-
creating a directory within the document directory hierarchy corresponding to each of the unique directories within the class hierarchy;
retrieving a directory path corresponding to one of the matching category names utilizing the path-to-name translation listing; and
adding the document to a directory within the document directory hierarchy corresponding to the retrieved directory path.
-
-
72. The computer data signal according to claim 71, wherein the step of adding comprises linking the document to the directory.
-
73. The computer data signal of claim 64, wherein each of said contiguous multi-term portions is one sentence or longer.
-
74. The computer data signal of claim 64, wherein each of said contiguous multi-term portions is one paragraph or longer.
-
75. The computer data signal of claim 64, wherein each of said contiguous multi-term portions is 25 words or longer.
Specification