Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors

US 6,751,621 B1
Filed: 05/02/2000
Issued: 06/15/2004
Est. Priority Date: 01/27/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of operating a computer system to organize items by constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, the method comprising the steps:

constructing a table for storing information indicative of a relationship between items represented by predetermined data points and predetermined categories corresponding to dimensions in the semantic space;

determining the significance of a selected data point with respect to each of the predetermined categories;

constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each, of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and

wherein the step of determining comprises the steps of;

determining a first index representing the proportion of each category containing the selected data point; and

determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus and method are disclosed for producing a semantic representation of information in a semantic space. The information is first represented in a table that stores values which indicate a relationship with predetermined categories. The categories correspond to dimensions in the semantic space. The significance of the information with respect to the predetermined categories is then determined. A trainable semantic vector (TSV) is constructed to provide a semantic representation of the information. The TSV has dimensions equal to the number of predetermined categories and represents the significance of the information relative to each of the predetermined categories. Various types of manipulation and analysis, such as searching, classification, and clustering, can subsequently be performed on a semantic level.

374 Citations

33 Claims

1. A method of operating a computer system to organize items by constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, the method comprising the steps:
- constructing a table for storing information indicative of a relationship between items represented by predetermined data points and predetermined categories corresponding to dimensions in the semantic space;
  
  determining the significance of a selected data point with respect to each of the predetermined categories;
  
  constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each, of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and
  
  wherein the step of determining comprises the steps of;
  
  determining a first index representing the proportion of each category containing the selected data point; and
  
  determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the relative strength of the data points corresponds to the number of times each data point occurs in each category.
  - 3. The method of claim 1, wherein the data points correspond to words or other character strings occurring in a document.
  - 4. The method of claim 1, wherein the predetermined data points are contained within one or more predetermined datasets, wherein each of the one or more datasets corresponds to a collection of one or more words, character strings, and/or documents.
  - 5. The method of claim 4, wherein the
6. The method of claim 5 further comprising a step of minimizing the number of dimensions in the semantic representation of the selected data point.
7. The method of claim 6, wherein the step of minimizing comprises the steps;
- sorting the values of each dimension in the semantic representation of the selected data point in decreasing order;
  
  determining a minimum number of dimensions for the semantic representation of the selected data point based on the sorted values; and
  
  discarding all dimensions below the minimum number of dimensions.
8. The method of claim 7, wherein the minimum number of dimensions is determined when at least 90% of the total mass of the semantic representation of the selected data point has been reached.
9. The method of claim 6, further comprising a step of normalizing the value of the percentage of data points occurring in each category.
10. The method of claim 9, further comprising a step of determining a weighted average of the normalized percentage of data points occurring in each category and the probability distribution of a data point'"'"'s occurrence across all categories for each category.
11. The method of claim 10, wherein the step of determining a weighted average is performed based on the formula
12. The method of claim 11, wherein the predetermined weighting factor has a value of about 0.75.

13. A method of operating a computer system to organize items by constructing a trainable semantic vector representative of a data point contained within predetermined datasets in a semantic space, wherein the data point corresponds to at least one word, character string or document, and each of the predetermined datasets corresponds to a collection of one or more words, character strings and/or documents, the method comprising the steps:
- clustering the predetermined datasets into a plurality of unspecified clusters;
  
  defining a plurality of categories such that each category corresponds to one of the plurality of unspecified clusters;
  
  assigning each predetermined dataset to the category corresponding to the cluster to which the dataset belongs;
  
  constructing a table for storing information indicative of a relationship between items represented by predetermined data points contained within the predetermined datasets and said plurality of categories, wherein each category corresponds to a dimension in a semantic space;
  
  determining the significance of a selected data point with respect to each of the plurality of categories;
  
  constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and
  
  wherein the step of determining comprises the steps of;
  
  determining a first index representing the proportion of each category containing the selected data point; and
  
  determining a second index representing the probability distribution of the selected data point'"'"'s occurrences in the predetermined datasets across all categories.

14. A method of operating a computer system to organize items by producing a semantic representation of a dataset in a semantic space, wherein the dataset corresponds to a collection of one or more words, character strings or documents, the method comprising the steps:
- constructing a table for storing information indicative of a relationship between items represented by predetermined data points within the dataset and predetermined categories corresponding to dimensions in the semantic space, wherein the data point corresponds to at least one word, character string or document;
  
  determining the significance of each data point with respect to the predetermined categories;
  
  constructing a trainable semantic vector for each data point based on the significance of each data point with respect to the predetermined categories, wherein each trainable semantic vector has dimensions equal to the number of predetermined categories and represents the relative strength of its corresponding data point with aspect to each of the predetermined categories; and
  
  combining the trainable semantic vectors for the data points in the dataset to form the semantic representation of the dataset.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 15. The method of claim 14, wherein the semantic representation of the dataset is in the form of a vector having dimensions equal to the predetermined number of categories.
  - 16. The method of claim 14, wherein the dataset corresponds to a document, and each data point corresponds to a word or character string occurring in the document.
  - 17. The method of claim 14, further comprising a step of scaling the semantic representation of the dataset using a vote vector having entries that store a vote value for each dimension of the semantic representation of the dataset.
  - 18. The method of claim 17, wherein each vote value is at least 10.
  - 19. The method of claim 17, wherein each vote value is representative of the number of data points whose corresponding TSV dimension is greater than a predetermined minimum value.
  - 20. The method of claim 19, wherein the predetermined minimum value is about 0.5.
  - 21. The method of claim 14, further comprising a step of minimizing the number of dimensions in the semantic representation of the dataset.
  - 22. The method of claim 21, wherein the step of minimizing comprises the steps:
23. The method of claim 22, where n the minimum number of dimensions is determined when at least 90% of the total mass of the semantic representation of the dataset has been reached.
24. The method of claim 22, wherein the step of determining a minimum number of dimensions comprises the steps:
- calculating the first derivative and second derivative of the semantic representation of the dataset at prescribed dimensions;
  
  comparing the first derivative and second derivative to predetermined first and second pruning thresholds, respectively; and
  
  identifying the minimum number of dimensions based on the step of comparing.
25. The method of claim 24, wherein the first pruning threshold is about 0.05, and the second pruning threshold is about 0.005.
26. The method of claim 24, wherein the derivatives are calculated in intervals of 10.
27. The method of claim 24 wherein the step of identifying comprises the steps:
- detecting a dimension at which the first derivative is lower than the first pruning threshold, and the second derivative is lower than the second pruning threshold;
  
  doubling the value of the detected dimension;
  
  comparing the doubled value of the detected dimension to a predetermined limit to determine a stop point corresponding to the lower value of the two; and
  
  setting the minimum number of dimensions for the semantic representation of the dataset equal to the value of the stop point.
28. The method of claim 27, where n the predetermined limit is 1000.

29. A system for constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, the system comprising:
- a computer configured to;
  
  construct a table for storing information indicative of a relationship between predetermined data points and predetermined categories corresponding to dimensions in the semantic space;
  
  determine the significance of a selected data point with respect to each of said predetermined categories by;
  
  determining a first index representing the proportion of each category containing the selected data point; and
  
  determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories; and
  
  construct a trainable semantic vector for said selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories.

30. A system for producing a semantic representation of a dataset in a semantic space, wherein the dataset corresponds to a collection of one or more words, character strings and/or documents, the system comprising:
- a computer configured to;
  
  construct a table for storing information indicative of a relationship between predetermined data points within the dataset and predetermined categories corresponding to dimensions in the semantic space, wherein each of the data point corresponds to at least one word, character string or document;
  
  determine the significance of each data point with respect to said predetermined categories construct a trainable semantic vector for each data point based on the significance of each data point with respect to the predetermined categories, wherein each said trainable semantic vector has dimensions equal to the number of said predetermined categories and represents the relative strength of its corresponding data point with respect to each of said predetermined categories; and
  
  combine the trainable semantic vectors for the data points in said dataset to form the semantic representation of said dataset.

31. A computer-readable medium carrying one or more sequences of instructions for constructing a trainable semantic vector representative of a data point in a semantic space, wherein the data point corresponds to at least one word, character string or document, and execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform the steps of:
- constructing a table for storing information indicative of a relationship between predetermined data points and predetermined categories corresponding to dimensions in the semantic space;
  
  determining the significance of a selected data point with respect to each of the predetermined categories;
  
  constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data point with respect to the predetermined categories; and
  
  wherein the step of determining comprises the steps of;
  
  determining a first index representing the proportion of each category containing the selected data point; and
  
  determining a second index representing the distribution of the selected data point'"'"'s occurrences across all categories.

32. A computer-readable medium carrying one or more sequences of instructions for producing a semantic representation of a dataset in a semantic space, wherein the dataset corresponds to a collection of one or more words, character strings and/or documents, and execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform steps of:
- constructing a table for storing information indicative of a relationship between predetermined data points and predetermined categories corresponding to dimensions in the semantic space, wherein each of the data points corresponds to at least one word, character string or document;
  
  determining the significance of each data point with respect to the predetermined categories;
  
  constructing a trainable semantic vector for each data point based on the significance of each data point with respect to the predetermined categories, wherein each trainable semantic vector has dimensions equal to the number of predetermined categories and represents the relative strength of its corresponding data point with respect to each of the predetermined categories; and
  
  combining the trainable semantic vectors for the data points in the dataset to form the semantic representation of the dataset.

33. In a document comprising at least two data points, wherein each of the at least two data points correspond to words or character strings occurring in the document, a method for operating a computer system to organize items by constructing a trainable semantic vector representative of the data points in a semantic space, the method comprising the steps:
- constructing a table for storing information indicative of a relationship between items represented by the data points and predetermined categories corresponding to dimensions in the semantic space;
  
  determining the significance of a selected data with respect to each of the predetermined categories;
  
  constructing a trainable semantic vector for the selected data point based on the significance of the selected data point with respect to each of the predetermined categories, wherein the trainable semantic vector has dimensions equal to the number of predetermined categories and represents the strength of the selected data with respect to the predetermined categories.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Manning & Napier Information Services LLC
Original Assignee
Manning & Napier Information Services LLC
Inventors
Yuan, Bo, Snyder, David L., Calistri-Yeh, Randall J., Osborne, George B.
Primary Examiner(s)
Robinson, Greta
Assistant Examiner(s)
Pannala, Sathyanaraya R

Application Number

US09/562,916
Time in Patent Office

1,505 Days
Field of Search

707/1-5, 707/102, 707/104.1, 707/100, 704/9-10, 704/1-2, 706/45-50
US Class Current

1/1
CPC Class Codes

G06F 16/36   Creation of semantic tools,...

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99939   Privileged access

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

374 Citations

33 Claims

Specification

Use Cases

Quick Links

Others

Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

374 Citations

33 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others