System and method for generating taxonomies with applications to content-based recommendations
First Claim
1. A method of generating a graph taxonomy of information which is represented by a plurality of vectors, the graph taxonomy including a plurality of nodes and a plurality of edges, said method comprising the steps of:
- (a) receiving the information represented by the plurality of vectors in the form of a plurality of data objects;
(b) generating the plurality of nodes, each node of the plurality of nodes associated with ones of the plurality of vectors;
(c) establishing a tree hierarchy in a data memory based on the plurality of nodes;
(d) calculating a plurality of distances between ones of the plurality of nodes; and
(e) connecting ones of the plurality of nodes with other ones of the plurality of nodes by ones of the plurality of edges based on the plurality of distances to obtain the graph taxonomy, wherein the plurality of distances are respectively computed using the plurality of vectors associated with the plurality of nodes.
6 Assignments
0 Petitions
Accused Products
Abstract
A graph taxonomy of information which is represented by a plurality of vectors is generated. The graph taxonomy includes a plurality of nodes and a plurality of edges. The plurality of nodes is generated, and each node of the plurality of nodes is associated with ones of the plurality of vectors. A tree hierarchy is established based on the plurality of nodes. A plurality of distances between ones of the plurality of nodes is calculated. Ones of the plurality of nodes are connected with other ones of the plurality of nodes by ones of the plurality of edges based on the plurality of distances. The information represented by the plurality of vectors may be, for example, a plurality of documents such as Web Pages.
427 Citations
36 Claims
-
1. A method of generating a graph taxonomy of information which is represented by a plurality of vectors, the graph taxonomy including a plurality of nodes and a plurality of edges, said method comprising the steps of:
-
(a) receiving the information represented by the plurality of vectors in the form of a plurality of data objects;
(b) generating the plurality of nodes, each node of the plurality of nodes associated with ones of the plurality of vectors;
(c) establishing a tree hierarchy in a data memory based on the plurality of nodes;
(d) calculating a plurality of distances between ones of the plurality of nodes; and
(e) connecting ones of the plurality of nodes with other ones of the plurality of nodes by ones of the plurality of edges based on the plurality of distances to obtain the graph taxonomy, wherein the plurality of distances are respectively computed using the plurality of vectors associated with the plurality of nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 34, 35, 36)
providing a test document; and
recommending, to a user, ones of the plurality of subject categories based on an other plurality of distances between the test document and at least one of the plurality of nodes.
-
-
4. The method of generating the graph taxonomy of claim 1, wherein the step of generating the plurality of nodes, further comprises the steps of:
-
partitioning the plurality of vectors into a plurality of clusters;
calculating a pseudo-centroid for each of the plurality of clusters;
calculating a projected pseudo-centroid based on the pseudo-centroid; and
associating the projected pseudo-centroid with one of the plurality of nodes.
-
-
5. The method of generating the graph taxonomy of claim 1, wherein the plurality of nodes includes a plurality of high level nodes and a plurality of low level nodes, and the step of generating the plurality of nodes further comprises the steps of:
-
generating the plurality of low level nodes;
generating the plurality of high level nodes based on the plurality of low level nodes; and
connecting one of the plurality of low level nodes with one of the plurality of high level nodes by one of the plurality of edges.
-
-
6. The method of generating the graph taxonomy of claim 5, wherein the plurality of high level nodes includes a parent node which is connected to at least one child node of the plurality of low level nodes, and step (e) further comprises the steps of:
-
calculating a mean of at least one of the plurality of distances between the parent node and said at least one child node;
calculating a standard deviation of said at least one of the plurality of distances between the parent node and said at least one child node; and
connecting the parent node with a node of the plurality of low level nodes based on the mean and the standard deviation.
-
-
7. The method of generating the graph taxonomy of claim 1, wherein the step of generating the plurality of nodes, further comprises the steps of:
-
partitioning the plurality of vectors into a plurality of clusters;
calculating a plurality of pseudo-centroids corresponding to the plurality of clusters;
calculating a plurality of projected pseudo-centroids based of the plurality of pseudo-centroids;
selecting at least one of the plurality of projected pseudo-centroids based on the number of vectors belonging to each of the plurality of cluster; and
associating said at least one of the plurality of projected pseudo-centroids with at least one of the plurality of nodes.
-
-
8. The method of generating the graph taxonomy of claim 1, wherein the step of generating the plurality of nodes, further comprises the steps of:
-
(i) selecting the ones of the plurality of vectors;
(ii) partitioning the ones of the plurality of vectors into a plurality of clusters;
(iii) calculating a plurality of projected pseudo-centroids corresponding to the plurality of clusters;
(iv) selecting at least one of the plurality of projected pseudo-centroids; and
(v) associating said at least one of the plurality of pseudo-centroids with at least one of the plurality of nodes.
-
-
9. The method of generating the graph taxonomy of claim 8, wherein steps (i) through (iv) are repeated.
-
10. The method of generating the graph taxonomy of claim 9, wherein the number of ones of the plurality of vectors selected is based on the number of repetitions performed of steps (i) through (iv).
-
11. The method of generating the graph taxonomy of claim 9, wherein the number of said at least one of the plurality of projected pseudo-centroids selected is based on the number of repetitions performed of steps (i) through (iv).
-
12. The method of generating the graph taxonomy of claim 8, further comprising the step of selecting a plurality of seeds from the plurality of vectors, wherein step (ii) is performed based on the plurality of seeds.
-
13. The method of generating the graph taxonomy of claim 12, wherein steps (i) through (iv) are repeated.
-
14. The method of generating the graph taxonomy of claim 13, wherein the number of seeds selected from the plurality of vectors is based on the number of repetitions performed of steps (i) through (iv).
-
15. The method of generating the graph taxonomy of claim 1, wherein the plurality of nodes includes a plurality of low level nodes and a plurality of high level nodes, and the step of establishing a tree hierarchy, further comprises the step of:
generating the plurality of high level nodes from the plurality of low level nodes.
-
16. The method of generating the graph taxonomy of claim 15, wherein the step of generating the plurality of high level nodes from the plurality of low level nodes, further comprises the steps of:
-
creating a plurality of meta-documents corresponding to the plurality of low level nodes; and
generating the plurality of high level nodes based on the plurality of meta-documents.
-
-
17. The method of generating the graph taxonomy of claim 16, further comprising the steps of:
-
partitioning the plurality of vectors into a plurality of clusters based on the plurality of meta-documents;
calculating a pseudo-centroid for each of the plurality of clusters; and
calculating a projected pseudo-centroid based on the pseudo-centroid.
-
-
18. The method of generating the graph taxonomy of claim 17, wherein the step of generating the plurality of high level nodes from the plurality of low level nodes, further comprises the step of
associating the projected pseudo-centroids with one of the plurality of high level nodes. -
19. The method of generating the graph taxonomy of claim 15, wherein the step of calculating a plurality of distances between ones of the plurality of nodes, further includes the step of
calculating the plurality of distances between each of the plurality of nodes and any of the plurality of nodes. -
20. The method of generating the graph taxonomy of claim 19, wherein the step of connecting ones of the plurality of nodes with other ones of the plurality of nodes by ones of the plurality of edges, further includes the step of
connecting ones of the plurality of low level nodes with ones of the plurality of high level nodes by ones of the plurality of edges based on the plurality of distances. -
21. The method of generating the graph taxonomy of claim 2, wherein the plurality of nodes includes a plurality of leaf nodes, further comprising the steps of:
-
providing a test document;
calculating an other plurality of distances between the plurality of leaf nodes and the test document;
selecting ones of the plurality of leaf nodes based on the other plurality of distances;
calculating a plurality of similarity measures between selected ones of the plurality of leaf nodes; and
sorting the selected ones of the plurality of leaf nodes based on the plurality of similarity measures.
-
-
22. The method of generating the graph taxonomy of claim 21, wherein a first leaf node of the selected ones of the plurality of leaf nodes is associated with a first meta-document and a second leaf node of the selected ones of the plurality of leaf nodes is associated with a second meta-documents, the first and second meta-documents including the plurality of word frequencies, and the step of calculating a plurality of similarity measures between selected ones of the plurality of leaf nodes, further comprises the steps of:
-
setting to zero ones of the plurality of word frequencies included in the first meta-document if corresponding ones of the plurality of word frequencies are nonzero in the second meta-document; and
setting to zero ones of the plurality of word frequencies included in the second meta-document if corresponding ones of the plurality of word frequencies are nonzero in the first meta-document.
-
-
23. The method of generating the graph taxonomy of claim 2, wherein the plurality of nodes includes a plurality of leaf nodes and a plurality of high level nodes, further comprising the steps of:
-
providing a test document;
calculating an other plurality of distances between the plurality of leaf nodes and the test document;
selecting ones of the plurality of leaf nodes based on the other plurality of distances; and
recommending to a user ones of the plurality of subject categories corresponding to ones of the plurality of high level nodes from which there is path in the graph taxonomy to the selected ones of the plurality of leaf nodes.
-
-
24. The method of generating the graph taxonomy of claim 23, wherein the step of recommending to a user ones of the plurality of subject categories, further comprises the step of
calculating a plurality of similarity measures between the test document and the selected one of the plurality of leaf nodes to which there is the path in the graph taxonomy to the ones of the plurality of high level nodes; - and
calculating a sum of ones of the plurality of similarity measures associated with one of the plurality of high level nodes.
- and
-
34. The method of generating the graph taxonomy of claim 1, wherein said step of calculating a pseudo-centroid for each of the plurality of clusters comprises the steps of applying a damping function to each of a set of vectors of a cluster, and calculating a vector sum of a result of the damping function.
-
35. The method of generating the graph taxonomy of claim 1, wherein said step of applying a damping function comprises the step of applying a predetermined function to all components of a vector.
-
36. The method of generating the graph taxonomy of claim 1, wherein the plurality of distances correspond to textual distances, based on words within documents.
-
25. An article of manufacture comprising a computer usable medium having computer readable program code means embodied therein for generating a graph taxonomy of information which is represented by a plurality of vectors, the graph taxonomy including a plurality of nodes and a plurality of edges, the computer readable program code means in said article of manufacture comprising computer readable program code means for causing a computer to effect:
-
(a) receiving the information represented by the plurality of vectors in the form of a plurality of data objects;
(b) generating the plurality of nodes, each node of the plurality of nodes associated with ones of the plurality of vectors;
(c) establishing a tree hierarchy in a data memory based on the plurality of nodes;
(d) calculating a plurality of distances between ones of the plurality of nodes; and
(e) connecting ones of the plurality of nodes with other ones of the plurality of nodes by ones of the plurality of edges based on the plurality of distances to obtain the graph taxonomy, wherein the plurality of distances are respectively computed using the plurality of vectors associated with the plurality of nodes. - View Dependent Claims (26, 27)
providing a test document; and
recommending, to a user, ones of the plurality of subject categories based on an other plurality of distances between the test document and at least one of the plurality of nodes.
-
-
28. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for generating a graph taxonomy of information which is represented by a plurality of vectors, the graph taxonomy including a plurality of nodes and a plurality of edges, said method comprising the steps of:
-
(a) receiving the information represented by the plurality of vectors in the form of a plurality of data objects;
(b) generating the plurality of nodes, each node of the plurality of nodes associated with ones of the plurality of vectors;
(c) establishing a tree hierarchy in a data memory based on the plurality of nodes;
(d) calculating a plurality of distances between ones of the plurality of nodes; and
(e) connecting ones of the plurality of nodes with other ones of the plurality of nodes by ones of the plurality of edges based on the plurality of distances to obtain the graph taxonomy, wherein the plurality of distances are respectively computed using the plurality of vectors associated with the plurality of nodes. - View Dependent Claims (29, 30)
providing a test document; and
recommending, to a user, ones of the plurality of subject categories based on an other plurality of distances between the test document and at least one of the plurality of nodes.
-
-
31. A computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing a generation of a graph taxonomy of information which is represented by a plurality of vectors, the graph taxonomy including a plurality of nodes and a plurality of edges, the computer readable program code means in said computer program product comprising computer readable program code means for causing a computer to effect:
-
(a) receiving the information represented by the plurality of vectors in the form of a plurality of data objects;
(b) generating the plurality of nodes, each node of the plurality of nodes associated with ones of the plurality of vectors;
(c) establishing a tree hierarchy in a data memory based on the plurality of nodes;
(d) calculating a plurality of distances between ones of the plurality of nodes; and
(e) connecting ones of the plurality of nodes with other ones of the plurality of nodes by ones of the plurality of edges based on the plurality of distances to obtain the graph taxonomy, wherein the plurality of distances are respectively computed using the plurality of vectors associated with the plurality of nodes. - View Dependent Claims (32, 33)
providing a test document; and
recommending, to a user, ones of the plurality of subject categories based on an other plurality of distances between the test document and at least one of the plurality of nodes.
-
Specification