Process and system for retrieval of documents using context-relevant semantic profiles
First Claim
1. A method for extracting a semantic profile from a text corpus, the method comprising the steps of:
- parsing the text corpus into paragraphs and words;
processing the text corpus to remove stop words and inflectional morphemes;
arranging the vocabulary of the parsed and processed text corpus in an order;
creating a reference set of vectors from the text corpus by transforming each paragraph of the parsed and processed text corpus into a corresponding vector of K elements arranged in the order, K being the size of the vocabulary, each element of the vector corresponding to said each paragraph determined by application of a first predetermined function to the number of occurrences within said each paragraph of the word corresponding to said each element;
presenting the reference set of vectors to a neural network to train the neural network in the word relationships of the text corpus; and
storing activation patterns of the hidden units of the trained neural network for use as the semantic profile of the text corpus.
21 Assignments
0 Petitions
Accused Products
Abstract
A process and system for database storage and retrieval are described along with methods for obtaining semantic profiles from a training text corpus, i.e., text of known relevance, a method for using the training to guide context-relevant document retrieval, and a method for limiting the range of documents that need to be searched after a query. A neural network is used to extract semantic profiles from text corpus. A new set of documents, such as world wide web pages obtained from the Internet, is then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents. These semantic profiles are then organized into clusters in order to minimize the time required to answer a query. When a user queries the database, i.e., the set of documents, his or her query is similarly transformed into a semantic profile and compared with the semantic profiles of each cluster of documents. The query profile is then compared with each of the documents in that cluster. Documents with the closest weighted match to the query are returned as search results.
-
Citations
18 Claims
-
1. A method for extracting a semantic profile from a text corpus, the method comprising the steps of:
-
parsing the text corpus into paragraphs and words;
processing the text corpus to remove stop words and inflectional morphemes;
arranging the vocabulary of the parsed and processed text corpus in an order;
creating a reference set of vectors from the text corpus by transforming each paragraph of the parsed and processed text corpus into a corresponding vector of K elements arranged in the order, K being the size of the vocabulary, each element of the vector corresponding to said each paragraph determined by application of a first predetermined function to the number of occurrences within said each paragraph of the word corresponding to said each element;
presenting the reference set of vectors to a neural network to train the neural network in the word relationships of the text corpus; and
storing activation patterns of the hidden units of the trained neural network for use as the semantic profile of the text corpus. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
parsing the document into word units;
processing the document to remove inflectional morphemes;
transforming the parsed and processed document into a text vector of K elements arranged in the order, each element of the text vector determined by application of a second predetermined function to the number of occurrences within said parsed and processed document of the word corresponding to said each element;
presenting the text vector to the trained neural network for processing; and
storing the resulting hidden unit activation patterns of the trained neural network after the processing of the text vector as the document'"'"'s semantic profile.
-
-
3. A method for determining whether the document represented in terms of the document'"'"'s semantic profile as in claim 2 is relevant to the text corpus, the method comprising the steps of:
-
applying a first relevance criterion to the document'"'"'s semantic profile;
excluding the document from a database of relevant documents if the document fails the first relevance criterion; and
storing an identifier and a summary of the document if the document passes the first relevance criterion.
-
-
4. A method according to claim 3, wherein the step of applying the first relevance criterion further includes the steps of:
-
calculating a normalized inner product between the text vector and the output of the trained neural network resulting from processing the text vector; and
comparing the normalized inner product to a first empirically determined threshold;
wherein the normalized inner product being less than the first empirically determined threshold signifies failing the first relevance criterion.
-
-
5. A method according to claim 3, wherein the application of said second predetermined function to the number of occurrences within said parsed and processed document of the word corresponding to said each element includes weighting the number of occurrences by the inverse length of the document and by an inverse document frequency.
-
6. A method according to claim 3, wherein the step of applying the first relevance criterion further includes the steps of:
-
counting the total number of words in the parsed and processed document; and
comparing the total number of words to a second threshold;
wherein the total number of words being more than the second threshold signifies failing the first relevance criterion.
-
-
7. A method according to claim 6, wherein said second threshold is approximately 250,000 words.
-
8. A method according to claim 3, wherein the step of applying the first relevance criterion further includes the steps of:
-
counting m, m being the number of words in the parsed and processed document that are also in the vocabulary;
counting n, n being the number of words in the parsed and processed document;
andcomparing r to a first empirically determined range;
wherein r being outside of the first empirically determined range signifies failing the first relevance criterion.
-
-
9. A method for building a contextual semantic space of a set of documents in terms of semantic profiles of the documents in the set relative to the semantic profile of the text corpus derived according to claim 1, the method comprising the steps of:
-
parsing each document into word units;
processing said each document to remove inflectional morphemes;
transforming said each parsed and processed document into a text vector of K elements arranged in the order, each element of said each document'"'"'s text vector determined by application of a second predetermined function to the number of occurrences within said each parsed and processed document of the word corresponding to said each element;
presenting each text vector to the trained neural network for processing; and
storing resulting hidden unit activation patterns of the trained neural network after the processing of said each text vector as said each document'"'"'s semantic profile.
-
-
10. A method for clustering semantic profiles of the documents of the set within the contextual semantic space built according to claim 9, the method comprising the step of organizing the stored semantic profiles of the documents into clusters within the contextual semantic space.
-
11. A method according to claim 10, wherein said organizing step comprises the step of using a self-organizing feature map neural network.
-
12. A method according to claim 10, wherein said organizing step comprises the step of using K-means clustering statistical procedure.
-
13. A method for clustering semantic profiles of the documents of the set within the contextual semantic space built according to claim 9, the method comprising the step for clustering the stored semantic profiles of the documents within the contextual semantic space.
-
14. A method for ranking, by relevance to a query, the documents with semantic profiles clustered in accordance with claim 13, the method comprising the steps of:
-
parsing the query into word units;
processing the query to remove inflectional morphemes;
transforming the parsed and processed query into a query text vector of K elements arranged in the order, each element of the query text vector determined by application of a third predetermined function to the number of occurrences within said parsed and processed query of the word corresponding to said each element of the query text vector;
presenting the query text vector to the trained neural network for processing;
storing the resulting hidden unit activation patterns of the trained neural network after the processing of the query text vector as the query semantic profile;
determining centroids of the clusters;
comparing the query semantic profile to the centroids;
identifying a first cluster having a first centroid that is similar to the query semantic profile; and
comparing semantic profiles of documents belonging to the first cluster with the query semantic profile to rank the documents belonging to the cluster by relevance to the query.
-
-
15. A method according to claim 14, wherein said comparing step comprises the steps of:
-
calculating a normalized inner product between the query semantic profile and semantic profile of each document in the first cluster; and
ranking said each document in the first cluster according to the calculated normalized inner product corresponding to said each document in the first cluster.
-
-
16. A method according to claim 14, further comprising the step of applying a second relevance criterion to the ranked documents.
-
17. A method for ranking, by relevance to a query, the documents with semantic profiles clustered in accordance with claim 10, and for selecting j of the ranked documents, j being a first predetermined number, the method comprising the steps of:
-
parsing the query into word units;
processing the query to remove inflectional morphemes;
transforming the parsed and processed query into a query text vector of K elements arranged in the order, each element of the query text vector determined by application of a third predetermined function to the number of occurrences within said parsed and processed query of the word corresponding to said each element of the query text vector;
presenting the query text vector to the trained neural network for processing;
storing the resulting hidden unit activation patterns of the trained neural network after the processing of the query text vector as the query semantic profile;
determining centroids of the clusters;
comparing the query semantic profile to the centroids;
identifying a first cluster having a first centroid that is similar to the query semantic profile;
comparing semantic profiles of documents belonging to the first cluster with the query semantic profile to rank the documents belonging to the first cluster by relevance to the query;
applying a second relevance criterion to the ranked documents belonging to the first cluster;
if fewer than j documents pass the second relevance criterion, identifying a second cluster having a second centroid that is similar to the query semantic profile;
comparing semantic profiles of documents belonging to the second cluster with the query semantic profile to rank the documents belonging to the second cluster by relevance to the query;
applying the second relevance criterion to the ranked documents belonging to the second cluster; and
selecting j documents from the ranked documents that passed the second relevance criterion.
-
-
18. A system for ranking of documents by relevance to a query within a context defined by a text corpus, the system comprising:
-
means for parsing the text corpus into paragraphs and words;
means for arranging the vocabulary of the parsed text corpus in an order;
means for creating a reference set of vectors from the text corpus by transforming each paragraph of the parsed text corpus into a corresponding vector of K elements arranged in the order, K being the size of the vocabulary, each element of the vector corresponding to said each paragraph determined by application of a first predetermined function to the number of occurrences within said each paragraph of the word corresponding to said each element;
a neural network;
means for presenting the reference set of vectors to the neural network to train the neural network in the word relationships of the text corpus;
means for retrieving and storing activation patterns of the hidden units of the trained neural network for use as a semantic profile of the text corpus;
means for parsing each document into word units;
means for transforming said each parsed document into a text vector of K elements arranged in the order, each element of said each document'"'"'s text vector determined by application of a second predetermined function to the number of occurrences within said each parsed document of the word corresponding to said each element;
means for presenting each text vector to the trained neural network for processing;
means for retrieving and storing a semantic profile of said each document, said semantic profile of said each document being hidden unit activation patterns of the trained neural network resulting from the processing of said each text vector;
means for organizing stored semantic profiles of the documents into clusters;
means for determining centroids of the clusters;
means for parsing the query into word units;
means for transforming the parsed query into a query text vector of K elements arranged in the order, each element of the query text vector determined by application of a third predetermined function to the number of occurrences within said parsed query of the word corresponding to said each element of the query text vector;
means for presenting the query text vector to the trained neural network for processing;
means for retrieving and storing a query semantic profile, the query semantic profile being hidden unit activation patterns of the trained neural network resulting from the processing of the query text vector;
means for comparing the query semantic profile to the centroids;
means for identifying a first cluster having a first centroid that is similar to the query semantic profile; and
means for comparing semantic profiles of documents belonging to the first cluster with the query semantic profile to rank the documents belonging to the first cluster by relevance to the query.
-
Specification