Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
First Claim
1. A method of operating a computer system to organize items by constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, the method comprising the steps:
- constructing a table for storing information indicative of a relationship between items represented by predetermined data points and predetermined categories corresponding to dimensions in the semantic space;
determining the significance of a selected data point with respect to each of the predetermined categories;
constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each, of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and
wherein the step of determining comprises the steps of;
determining a first index representing the proportion of each category containing the selected data point; and
determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus and method are disclosed for producing a semantic representation of information in a semantic space. The information is first represented in a table that stores values which indicate a relationship with predetermined categories. The categories correspond to dimensions in the semantic space. The significance of the information with respect to the predetermined categories is then determined. A trainable semantic vector (TSV) is constructed to provide a semantic representation of the information. The TSV has dimensions equal to the number of predetermined categories and represents the significance of the information relative to each of the predetermined categories. Various types of manipulation and analysis, such as searching, classification, and clustering, can subsequently be performed on a semantic level.
374 Citations
33 Claims
-
1. A method of operating a computer system to organize items by constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, the method comprising the steps:
-
constructing a table for storing information indicative of a relationship between items represented by predetermined data points and predetermined categories corresponding to dimensions in the semantic space;
determining the significance of a selected data point with respect to each of the predetermined categories;
constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each, of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and
wherein the step of determining comprises the steps of;
determining a first index representing the proportion of each category containing the selected data point; and
determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
the first index represents the percentage of the predetermined datasets that contain the selected data point in each category; - and
the second index is the probability distribution of the selected data point'"'"'s occurrences in the predetermined datasets across all categories.
-
-
6. The method of claim 5 further comprising a step of minimizing the number of dimensions in the semantic representation of the selected data point.
-
7. The method of claim 6, wherein the step of minimizing comprises the steps;
-
sorting the values of each dimension in the semantic representation of the selected data point in decreasing order;
determining a minimum number of dimensions for the semantic representation of the selected data point based on the sorted values; and
discarding all dimensions below the minimum number of dimensions.
-
-
8. The method of claim 7, wherein the minimum number of dimensions is determined when at least 90% of the total mass of the semantic representation of the selected data point has been reached.
-
9. The method of claim 6, further comprising a step of normalizing the value of the percentage of data points occurring in each category.
-
10. The method of claim 9, further comprising a step of determining a weighted average of the normalized percentage of data points occurring in each category and the probability distribution of a data point'"'"'s occurrence across all categories for each category.
-
11. The method of claim 10, wherein the step of determining a weighted average is performed based on the formula
-
12. The method of claim 11, wherein the predetermined weighting factor has a value of about 0.75.
-
13. A method of operating a computer system to organize items by constructing a trainable semantic vector representative of a data point contained within predetermined datasets in a semantic space, wherein the data point corresponds to at least one word, character string or document, and each of the predetermined datasets corresponds to a collection of one or more words, character strings and/or documents, the method comprising the steps:
-
clustering the predetermined datasets into a plurality of unspecified clusters;
defining a plurality of categories such that each category corresponds to one of the plurality of unspecified clusters;
assigning each predetermined dataset to the category corresponding to the cluster to which the dataset belongs;
constructing a table for storing information indicative of a relationship between items represented by predetermined data points contained within the predetermined datasets and said plurality of categories, wherein each category corresponds to a dimension in a semantic space;
determining the significance of a selected data point with respect to each of the plurality of categories;
constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and
wherein the step of determining comprises the steps of;
determining a first index representing the proportion of each category containing the selected data point; and
determining a second index representing the probability distribution of the selected data point'"'"'s occurrences in the predetermined datasets across all categories.
-
-
14. A method of operating a computer system to organize items by producing a semantic representation of a dataset in a semantic space, wherein the dataset corresponds to a collection of one or more words, character strings or documents, the method comprising the steps:
-
constructing a table for storing information indicative of a relationship between items represented by predetermined data points within the dataset and predetermined categories corresponding to dimensions in the semantic space, wherein the data point corresponds to at least one word, character string or document;
determining the significance of each data point with respect to the predetermined categories;
constructing a trainable semantic vector for each data point based on the significance of each data point with respect to the predetermined categories, wherein each trainable semantic vector has dimensions equal to the number of predetermined categories and represents the relative strength of its corresponding data point with aspect to each of the predetermined categories; and
combining the trainable semantic vectors for the data points in the dataset to form the semantic representation of the dataset. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
sorting the values of each dimension in the semantic representation of the dataset in decreasing order;
determining a minimum number of dimensions or the semantic representation of the dataset based on the sorted values; and
discarding all dimensions below the minimum number of dimensions.
-
-
23. The method of claim 22, where n the minimum number of dimensions is determined when at least 90% of the total mass of the semantic representation of the dataset has been reached.
-
24. The method of claim 22, wherein the step of determining a minimum number of dimensions comprises the steps:
-
calculating the first derivative and second derivative of the semantic representation of the dataset at prescribed dimensions;
comparing the first derivative and second derivative to predetermined first and second pruning thresholds, respectively; and
identifying the minimum number of dimensions based on the step of comparing.
-
-
25. The method of claim 24, wherein the first pruning threshold is about 0.05, and the second pruning threshold is about 0.005.
-
26. The method of claim 24, wherein the derivatives are calculated in intervals of 10.
-
27. The method of claim 24 wherein the step of identifying comprises the steps:
-
detecting a dimension at which the first derivative is lower than the first pruning threshold, and the second derivative is lower than the second pruning threshold;
doubling the value of the detected dimension;
comparing the doubled value of the detected dimension to a predetermined limit to determine a stop point corresponding to the lower value of the two; and
setting the minimum number of dimensions for the semantic representation of the dataset equal to the value of the stop point.
-
-
28. The method of claim 27, where n the predetermined limit is 1000.
-
29. A system for constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, the system comprising:
-
a computer configured to;
construct a table for storing information indicative of a relationship between predetermined data points and predetermined categories corresponding to dimensions in the semantic space;
determine the significance of a selected data point with respect to each of said predetermined categories by;
determining a first index representing the proportion of each category containing the selected data point; and
determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories; and
construct a trainable semantic vector for said selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories.
-
-
30. A system for producing a semantic representation of a dataset in a semantic space, wherein the dataset corresponds to a collection of one or more words, character strings and/or documents, the system comprising:
-
a computer configured to;
construct a table for storing information indicative of a relationship between predetermined data points within the dataset and predetermined categories corresponding to dimensions in the semantic space, wherein each of the data point corresponds to at least one word, character string or document;
determine the significance of each data point with respect to said predetermined categories construct a trainable semantic vector for each data point based on the significance of each data point with respect to the predetermined categories, wherein each said trainable semantic vector has dimensions equal to the number of said predetermined categories and represents the relative strength of its corresponding data point with respect to each of said predetermined categories; and
combine the trainable semantic vectors for the data points in said dataset to form the semantic representation of said dataset.
-
-
31. A computer-readable medium carrying one or more sequences of instructions for constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, and execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform the steps of:
-
constructing a table for storing information indicative of a relationship between predetermined data points and predetermined categories corresponding to dimensions in the semantic space;
determining the significance of a selected data point with respect to each of the predetermined categories;
constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and
wherein the step of determining comprises the steps of;
determining a first index representing the proportion of each category containing the selected data point; and
determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories.
-
-
32. A computer-readable medium carrying one or more sequences of instructions for producing a semantic representation of a dataset in a semantic space, wherein the dataset corresponds to a collection of one or more words, character strings and/or documents, and execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform steps of:
-
constructing a table for storing information indicative of a relationship between predetermined data points and predetermined categories corresponding to dimensions in the semantic space, wherein each of the data points corresponds to at least one word, character string or document;
determining the significance of each data point with respect to the predetermined categories;
constructing a trainable semantic vector for each data point based on the significance of each data point with respect to the predetermined categories, wherein each trainable semantic vector has dimensions equal to the number of predetermined categories and represents the relative strength of its corresponding data point with respect to each of the predetermined categories; and
combining the trainable semantic vectors for the data points in the dataset to form the semantic representation of the dataset.
-
-
33. In a document comprising at least two data points, wherein each of the at least two data points correspond to words or character strings occurring in the document, a method for operating a computer system to organize items by constructing a trainable semantic vector representative of the data points in a semantic space, the method comprising the steps:
-
constructing a table for storing information indicative of a relationship between items represented by the data points and predetermined categories corresponding to dimensions in the semantic space;
determining the significance of a selected data with respect to each of the predetermined categories;
constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data with respect to the predetermined categories.
-
Specification