Computer aided document retrieval
First Claim
1. A computer-implemented method of determining cluster attractors for use in clustering a plurality of documents, each document comprising at least one term, each term comprising one or more words, the method comprising:
- causing a computer to calculate, in respect of each term, a probability distribution that is indicative ofin the instance where a document comprises said term and said one other term that co-occurs with said term in at least one of said documents, the frequency of occurrence of said one other term, andin the instance where a document comprises said term and more than one other term that co-occurs with said term in at least one of said documents, the respective frequency of occurrence of each other term, that co-occurs with said term in at least one of said documents;
causing a computer to calculate, in respect of each term, the entropy of the respective probability distribution; and
causing the computer to select at least one of said probability distributions as a cluster attractor depending on the respective entropy value;
wherein the selected cluster attractor is a clustering focus for at least some of said documents, and wherein said probability distribution is calculated as where tf(x, y) is a term frequency of a term y in a document x and X(z) is a set of all documents of said plurality of documents that contain a term z and where t is a term index.
8 Assignments
0 Petitions
Accused Products
Abstract
A method of determining cluster attractors for a plurality of documents comprising at least one term. The method comprises calculating, in respect of each term, a probability distribution indicative of the frequency of occurrence of the, or each, other term that co-occurs with said term in at least one of said documents. Then, the entropy of the respective probability distribution is calculated. Finally, at least one of said probability distributions is selected as a cluster attractor depending on the respective entropy value. The method facilitates very small clusters to be formed enabling more focused retrieval during a document search.
60 Citations
18 Claims
-
1. A computer-implemented method of determining cluster attractors for use in clustering a plurality of documents, each document comprising at least one term, each term comprising one or more words, the method comprising:
-
causing a computer to calculate, in respect of each term, a probability distribution that is indicative of in the instance where a document comprises said term and said one other term that co-occurs with said term in at least one of said documents, the frequency of occurrence of said one other term, and in the instance where a document comprises said term and more than one other term that co-occurs with said term in at least one of said documents, the respective frequency of occurrence of each other term, that co-occurs with said term in at least one of said documents; causing a computer to calculate, in respect of each term, the entropy of the respective probability distribution; and causing the computer to select at least one of said probability distributions as a cluster attractor depending on the respective entropy value; wherein the selected cluster attractor is a clustering focus for at least some of said documents, and wherein said probability distribution is calculated as where tf(x, y) is a term frequency of a term y in a document x and X(z) is a set of all documents of said plurality of documents that contain a term z and where t is a term index. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. An apparatus for determining cluster attractors for a plurality of documents, each document comprising at least one term, each term comprising one or more words, the apparatus comprising:
-
means for calculating, in respect of each term, a probability distribution indicative of in the instance where a document comprises said term and one other term that co-occurs with said term in at least one of said documents, the frequency of occurrence of said one other term, and in the instance where a document comprises said term and more than one other term that co-occurs with said term in at least one of said documents, the respective frequency of occurrence of each other term; means for calculating, in respect of each term, the entropy of the respective probability distribution; and means for selecting at least one of said probability distributions as a cluster attractor depending on the respective entropy value, wherein the selected cluster attractor is a clustering focus for at least some of said documents, and wherein said probability distribution is calculated as where tf(x, y) is a term frequency of a term y in a document x and X(z) is a set of all documents of said plurality of documents that contain a term z and where t is a term index. - View Dependent Claims (18)
-
Specification