Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors

US 7,406,456 B2
Filed: 04/14/2004
Issued: 07/29/2008
Est. Priority Date: 01/27/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method for a data processing system to efficiently identify at least one dataset from a collection of datasets according to a query containing information indicative of desired datasets, wherein each dataset is a document and includes one or more data points and each data point corresponds to at least one of a word, a phase, and a sentence, the method comprising the machine-executed steps:

for each dataset, constructing a semantic vector representing each dataset;

receiving the query containing information indicative of desired datasets;

for the query, constructing a semantic vector representing the query;

selecting datasets based on a distance between the semantic vector for the query and the semantic vector of each dataset; and

displaying information of the selected datasets to be corresponding to the desired datasets identified in the query;

wherein;

the query or each of the datasets includes at least one data point; and

the semantic vector for the query or each of the datasets is constructed by the steps of;

for each data point, identifying a relationship between each data point and multiple predetermined categories corresponding to dimensions in the semantic space;

determining the significance of each data point with respect to the multiple predetermined categories according to a predetermined formula;

for each data point, constructing a semantic vector representing each data point, wherein each semantic vector has dimensions equal to the number of multiple predetermined categories and represents the significance of its corresponding data point with respect to each of the multiple predetermined categories; and

based on the semantic vector for each of the at least one data point, form the semantic vector representing the query or each of the datasets; and

wherein the significance of each data point is determined by calculating a probability distribution of each data point occurring in each predetermined category and a probability distribution of the data point'"'"'s occurrence across all predetermined categories.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus and method are disclosed for producing a semantic representation of information in a semantic space. The information is first represented in a table that stores values which indicate a relationship with predetermined categories. The categories correspond to dimensions in the semantic space. The significance of the information with respect to the predetermined categories is then determined. A trainable semantic vector (TSV) is constructed to provide a semantic representation of the information. The TSV has dimensions equal to the number of predetermined categories and represents the significance of the information relative to each of the predetermined categories. Various types of manipulation and analysis, such as searching, classification, and clustering, can subsequently be performed on a semantic level.

Citations

8 Claims

1. A method for a data processing system to efficiently identify at least one dataset from a collection of datasets according to a query containing information indicative of desired datasets, wherein each dataset is a document and includes one or more data points and each data point corresponds to at least one of a word, a phase, and a sentence, the method comprising the machine-executed steps:
- for each dataset, constructing a semantic vector representing each dataset;
  
  receiving the query containing information indicative of desired datasets;
  
  for the query, constructing a semantic vector representing the query;
  
  selecting datasets based on a distance between the semantic vector for the query and the semantic vector of each dataset; and
  
  displaying information of the selected datasets to be corresponding to the desired datasets identified in the query;
  
  wherein;
  
  the query or each of the datasets includes at least one data point; and
  
  the semantic vector for the query or each of the datasets is constructed by the steps of;
  
  for each data point, identifying a relationship between each data point and multiple predetermined categories corresponding to dimensions in the semantic space;
  
  determining the significance of each data point with respect to the multiple predetermined categories according to a predetermined formula;
  
  for each data point, constructing a semantic vector representing each data point, wherein each semantic vector has dimensions equal to the number of multiple predetermined categories and represents the significance of its corresponding data point with respect to each of the multiple predetermined categories; and
  
  based on the semantic vector for each of the at least one data point, form the semantic vector representing the query or each of the datasets; and
  
  wherein the significance of each data point is determined by calculating a probability distribution of each data point occurring in each predetermined category and a probability distribution of the data point'"'"'s occurrence across all predetermined categories.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein the datasets correspond to documents and the query is a natural language query.
  - 3. The method of claim 1, further comprising a step of clustering the selected datasets in real time.

4. A method for efficiently identifying data points in a semantic lexicon related to a dataset, wherein the dataset is a document and includes one or more data points and each data point corresponds to at least one of a word, a phase, and a sentence, the method comprising the machine-executed steps:
- constructing a semantic vector representing the dataset;
  
  selecting data points based on a distance between the semantic vector for the dataset and the semantic vector of each data point;
  
  identifying said selected data points to be related to the dataset; and
  
  displaying a result of the identifying stepwherein;
  
  the semantic vector for the dataset is constructed by the steps of;
  
  for each data point, identifying a relationship between each data point and multiple predetermined categories corresponding to dimensions in the semantic space;
  
  determining the significance of each data point with respect to the multiple predetermined categories according to a predetermined formula;
  
  constructing a semantic vector representing each data point, wherein each semantic vector has dimensions equal to the number of multiple predetermined categories and represents the significance of its corresponding data point with respect to each of the multiple predetermined categories; and
  
  based on the semantic vector representing each of the at least one data point, form the semantic vector of the dataset; and
  
  wherein the significance of each data point is determined by calculating a probability distribution of each data point occurring in each predetermined category and a probability distribution of the data point'"'"'s occurrence across all predetermined categories.
- View Dependent Claims (5, 6)
- - 5. The method of claim 4, wherein the dataset is a document and the data points are words.
  - 6. The method of claim 4, wherein the dataset is a natural language query in a search system and the data points are words.

7. A system for identifying at least one data set from a collection of datasets according to a query containing information indicative of desired datasets, wherein each dataset is a document and includes one or more data points and each data point corresponds to at least one of a word, a phrase, and a sentence, the system comprising:
- a computer configured to;
  
  construct a semantic vector representing each dataset;
  
  receive the query containing information indicative of desired datasets;
  
  construct a semantic vector representing the query;
  
  select datasets based on a distance between the semantic vector for the query and the semantic vector of each dataset; and
  
  display information of the selected datasets to be corresponding to the desired datasets identified in the query;
  
  wherein;
  
  the query or each of the datasets includes at least one data point; and
  
  the semantic vector for the query or each of the datasets is constructed by the machine-executed steps of;
  
  for each data point, identifying a relationship between each data point and multiple predetermined categories corresponding to dimensions in the semantic space;
  
  determining the significance of each data point with respect to the multiple predetermined categories according to a predetermined formula;
  
  constructing a semantic vector representing each data point, wherein each semantic vector has dimensions equal to the number of multiple predetermined categories and represents the significance of its corresponding data point with respect to each of the multiple predetermined categories; and
  
  based on the semantic vector for each of the at least one data point, form the semantic vector of the query or each of the datasets; and
  
  wherein the significance of each data point is determined by calculating a probability distribution of each data point occurring in each predetermined category and a probability distribution of the data point'"'"'s occurrence across all predetermined categories.

8. A computer-readable medium carrying one or more sequences of instructions for efficiently identifying at least one data set from a collection of datasets according to an query containing information indicative of desired datasets, each dataset being a document and including one or more data points and each data point corresponding to at least one of a word, a phase, and a sentence, wherein execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform the steps of:
- constructing a semantic vector representing each dataset;
  
  receiving the query containing information indicative of desired datasets;
  
  constructing a semantic vector for the query;
  
  selecting datasets based on a distance between the semantic vector for the query and the semantic vector of each dataset; and
  
  displaying information of the selected datasets to be corresponding to the desired datasets identified in the query;
  
  wherein;
  
  the query or each of the datasets includes at least one data point; and
  
  the semantic vector for the query or each of the datasets is constructed by the steps of;
  
  for each data point, identifying a relationship between each data point and multiple predetermined categories corresponding to dimensions in the semantic space;
  
  determining the significance of each data point with respect to the multiple predetermined categories according to a predetermined formula;
  
  constructing a semantic vector representing each data point, wherein each semantic vector has dimensions equal to the number of multiple predetermined categories and represents the significance of its corresponding data point with respect to each of the multiple predetermined categories; and
  
  based on the semantic vector for each of the at least one data point, form the semantic vector of the query or each of the datasets; and
  
  wherein the significance of each data point is determined by calculating a probability distribution of each data point occurring in each predetermined category and a probability distribution of the data point'"'"'s occurrence across all predetermined categories.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Manning & Napier Information Services LLC
Original Assignee
Manning & Napier Information Services LLC
Inventors
Yuan, Bo, Snyder, David L., Calistri-Yeh, Randall J., Osborne, George B.
Primary Examiner(s)
Vincent; David
Assistant Examiner(s)
BUSS, BENJAMIN J

Application Number

US10/823,685
Publication Number

US 20040199505A1
Time in Patent Office

1,567 Days
Field of Search

706/55, 706/934, 706 45- 50, 707/6, 707 1- 5, 707/100, 707/102, 707/104.1, 704/9
US Class Current

706/55
CPC Class Codes

G06F 16/36   Creation of semantic tools,...

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99939   Privileged access

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links