Method for document retrieval and for word sense disambiguation using neural networks
First Claim
1. A method for storing a searchable representation of a document comprising the steps of:
- inputting a document containing a series of words in machine readable form into a processing system;
removing from consideration any words in said series of words that are also found in a predetermined list of uninteresting words;
locating in a dictionary of context vectors a context vector for each word remaining in said series of words, each context vector providing for each of a plurality of word-based features, a component value representative of a conceptual relationship between said word and said word-based feature;
combining the context vectors for each word remaining in said series of words to obtain a summary vector for said document;
normalizing said summary vector to produce a normalized summary vector; and
storing said normalized summary vector.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for storing and searching documents also useful in disambiguating word senses and a method for generating a dictionary of context vectors. The dictionary of context vectors provides a context vector for each word stem in the dictionary. A context vector is a fixed length list of component values corresponding to a list of word-based features, the component values being an approximate measure of the conceptual relationship between the word stem and the word-based feature. Documents are stored by combining the context vectors of the words remaining in the document after uninteresting words are removed. The summary vector obtained by adding all of the context vectors of the remaining words is normalized. The normalized summary vector is stored for each document. The data base of normalized summary vectors is searched using a query vector and identifying the document whose vector is closest to that query vector. The normalized summary vectors of each document can be stored using cluster trees according to a centroid consistent algorithm to accelerate the searching process. Said searching process also gives an efficient way of finding nearest neighbor vectors in high-dimensional spaces.
584 Citations
23 Claims
-
1. A method for storing a searchable representation of a document comprising the steps of:
-
inputting a document containing a series of words in machine readable form into a processing system; removing from consideration any words in said series of words that are also found in a predetermined list of uninteresting words; locating in a dictionary of context vectors a context vector for each word remaining in said series of words, each context vector providing for each of a plurality of word-based features, a component value representative of a conceptual relationship between said word and said word-based feature; combining the context vectors for each word remaining in said series of words to obtain a summary vector for said document; normalizing said summary vector to produce a normalized summary vector; and storing said normalized summary vector. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for generating a searchable representation of a query comprising the steps of:
-
inputting a query comprising a plurality of query words or texts, each text containing a series of words in machine readable form, into a processing system; assigning a weight to each query word or text; for each query word or text, locating a context vector or normalized summary vector, respectively, each of said vectors providing for each of a plurality of word-based features a component value representative of a conceptual relationship between said query word or text and said word-based feature; multiplying the vector of each query word or text by the weight assigned to said query word or text to produce a weighted context vector for each query word and a weighted summary vector for each text; and summing the weighted contect vectors and weighted summary of said plurality of query words and texts to generate a summary for said query. - View Dependent Claims (9)
-
-
10. A method for cataloging searchable representations of a plurality of documents comprising the steps of
(a) generating a normalized summary vector for each document of said plurality of documents to create a corresponding plurality of normalized summary vectors; -
(b) assigning each of said normalized summary vectors to one of a plurality of nodes in accordance with a centroid consistent clustering algorithm; (c) calculating a centroid for each of said nodes; (d) repeating steps (b) and (c) for the normalized summary vectors on one or more of said nodes to create a tree of nodes. - View Dependent Claims (11, 12)
-
-
13. A document cataloging and retrieval method comprising the steps of:
-
inputting a plurality of documents in machine readable form into a processing system; generating a normalized summary vector for each document of said plurality of documents to create a corresponding plurality of normalized summary vectors; assigning each of said normalized summary vectors in said plurality of normalized summary vectors to one of a plurality of nodes in accordance with a centroid consistent clustering algorithm; for a plurality of said nodes, assigning each of the normalized summary vectors assigned to said node to one of a plurality of nodes on a subsequent level in accordance with a centroid consistent clustering algorithm and repeating this step for nodes on subsequent levels to form a cluster tree of nodes, each node characterized by an approximate centroid; forming a query vector; searching said tree of nodes for a normalized summary vector which is closest to said query vector by conducting a depth first tree walk through said tree of nodes and pruning a node and any nodes branching therefrom if, upon comparing the approximate centroid of said node with said query vector and the approximate centroid of another node branching from the same node that said node branches from, it is not possible for a closer normalized summary vector to be on said node than the closest normalized summary vector found so far without violating centroid consistency; and retrieving the document corresponding to the normalized summary vector obtained after searching or pruning all nodes on said tree.
-
-
14. A method for locating on a cluster tree the closest vector to a query vector comprising the steps of:
-
providing a cluster tree for a plurality of vectors, said tree having a parent node to which all of the vectors in said plurality of vectors are assigned and subsequent levels each with a plurality of nodes branching from a node on a previous level, each node including a subset of the vectors from the node it branches from characterized by an approximate centroid wherein the vectors on a node of a subsequent level are each closer to the approximate centroid of its node than to the approximate centroid of any other node on said subsequent level branching from the same node; forming a query vector; searching said cluster tree of nodes for a normalized summary vector which is closest to said query vector by conducting a depth first tree walk talking the node branching from a parent having the closest approximate centroid of all the other nodes branching from the parent and pruning a node and any nodes branching therefrom if it is not possible for a closer normalized summary vector to be on said node than the closest normalized summary vector found so far without violating centroid consistency and; identifying the closest normalized summary vector obtained from said searching. - View Dependent Claims (15)
-
-
16. A word sense disambiguation method comprising the steps of:
-
inputting into a processing system in machine readable form a series of words including and surrounding an ambiguous word in a text; removing from consideration any words in said series of words that are also found in a predetermined list of uninteresting words; locating in a dictionary of context vectors a context vector for each word remaining in said series of words; combining the context vectors for each remaining word to obtain a summary vector for said series of words; locating a plurality of context vectors in said dictionary of context vectors corresponding to a plurality of meanings for said ambiguous word; and combining said summary vector with each of said context vectors associated with said ambiguous word to obtain a relative distance between each of said context vectors and said summary vector, said relative distances serving as a measure of the relative appropriateness of each of said meanings. - View Dependent Claims (17, 18, 19)
-
-
20. A method for generating a dictionary of context vectors comprising:
-
providing a corpus of documents, each document including a series of words; creating a list of all of said words in said corpus of documents; inputting component values to generate context vectors for a core group of words; temporarily assigning a zero context vector to the words on said list not included in said core group; for each word with a zero vector in order of appearance on said list, combining the context vectors for words appearing close to said word within each of the series of words in said documents to generate a context vector for said word. - View Dependent Claims (21, 22, 23)
-
Specification