Process and system for retrieval of documents using context-relevant semantic profiles

US 6,189,002 B1
Filed: 12/08/1999
Issued: 02/13/2001
Est. Priority Date: 12/14/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method for extracting a semantic profile from a text corpus, the method comprising the steps of:

parsing the text corpus into paragraphs and words;

processing the text corpus to remove stop words and inflectional morphemes;

arranging the vocabulary of the parsed and processed text corpus in an order;

creating a reference set of vectors from the text corpus by transforming each paragraph of the parsed and processed text corpus into a corresponding vector of K elements arranged in the order, K being the size of the vocabulary, each element of the vector corresponding to said each paragraph determined by application of a first predetermined function to the number of occurrences within said each paragraph of the word corresponding to said each element;

presenting the reference set of vectors to a neural network to train the neural network in the word relationships of the text corpus; and

storing activation patterns of the hidden units of the trained neural network for use as the semantic profile of the text corpus.

View all claims

21 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A process and system for database storage and retrieval are described along with methods for obtaining semantic profiles from a training text corpus, i.e., text of known relevance, a method for using the training to guide context-relevant document retrieval, and a method for limiting the range of documents that need to be searched after a query. A neural network is used to extract semantic profiles from text corpus. A new set of documents, such as world wide web pages obtained from the Internet, is then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents. These semantic profiles are then organized into clusters in order to minimize the time required to answer a query. When a user queries the database, i.e., the set of documents, his or her query is similarly transformed into a semantic profile and compared with the semantic profiles of each cluster of documents. The query profile is then compared with each of the documents in that cluster. Documents with the closest weighted match to the query are returned as search results.

Citations

18 Claims

1. A method for extracting a semantic profile from a text corpus, the method comprising the steps of:
- parsing the text corpus into paragraphs and words;
  
  processing the text corpus to remove stop words and inflectional morphemes;
  
  arranging the vocabulary of the parsed and processed text corpus in an order;
  
  creating a reference set of vectors from the text corpus by transforming each paragraph of the parsed and processed text corpus into a corresponding vector of K elements arranged in the order, K being the size of the vocabulary, each element of the vector corresponding to said each paragraph determined by application of a first predetermined function to the number of occurrences within said each paragraph of the word corresponding to said each element;
  
  presenting the reference set of vectors to a neural network to train the neural network in the word relationships of the text corpus; and
  
  storing activation patterns of the hidden units of the trained neural network for use as the semantic profile of the text corpus.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. A method for representing a document in terms of the document'"'"'s semantic profile relative to the semantic profile of the text corpus derived in claim 1, the method comprising the steps of:
3. A method for determining whether the document represented in terms of the document'"'"'s semantic profile as in claim 2 is relevant to the text corpus, the method comprising the steps of:
- applying a first relevance criterion to the document'"'"'s semantic profile;
  
  excluding the document from a database of relevant documents if the document fails the first relevance criterion; and
  
  storing an identifier and a summary of the document if the document passes the first relevance criterion.
4. A method according to claim 3, wherein the step of applying the first relevance criterion further includes the steps of:
- calculating a normalized inner product between the text vector and the output of the trained neural network resulting from processing the text vector; and
  
  comparing the normalized inner product to a first empirically determined threshold;
  
  wherein the normalized inner product being less than the first empirically determined threshold signifies failing the first relevance criterion.
5. A method according to claim 3, wherein the application of said second predetermined function to the number of occurrences within said parsed and processed document of the word corresponding to said each element includes weighting the number of occurrences by the inverse length of the document and by an inverse document frequency.
6. A method according to claim 3, wherein the step of applying the first relevance criterion further includes the steps of:
- counting the total number of words in the parsed and processed document; and
  
  comparing the total number of words to a second threshold;
  
  wherein the total number of words being more than the second threshold signifies failing the first relevance criterion.
7. A method according to claim 6, wherein said second threshold is approximately 250,000 words.
8. A method according to claim 3, wherein the step of applying the first relevance criterion further includes the steps of:
- counting m, m being the number of words in the parsed and processed document that are also in the vocabulary;
  
  counting n, n being the number of words in the parsed and processed document;
  
  $calculating r = \frac{m}{n - m};$
  
  and comparing r to a first empirically determined range;
  
  wherein r being outside of the first empirically determined range signifies failing the first relevance criterion.
9. A method for building a contextual semantic space of a set of documents in terms of semantic profiles of the documents in the set relative to the semantic profile of the text corpus derived according to claim 1, the method comprising the steps of:
- parsing each document into word units;
  
  processing said each document to remove inflectional morphemes;
  
  transforming said each parsed and processed document into a text vector of K elements arranged in the order, each element of said each document'"'"'s text vector determined by application of a second predetermined function to the number of occurrences within said each parsed and processed document of the word corresponding to said each element;
  
  presenting each text vector to the trained neural network for processing; and
  
  storing resulting hidden unit activation patterns of the trained neural network after the processing of said each text vector as said each document'"'"'s semantic profile.
10. A method for clustering semantic profiles of the documents of the set within the contextual semantic space built according to claim 9, the method comprising the step of organizing the stored semantic profiles of the documents into clusters within the contextual semantic space.
11. A method according to claim 10, wherein said organizing step comprises the step of using a self-organizing feature map neural network.
12. A method according to claim 10, wherein said organizing step comprises the step of using K-means clustering statistical procedure.
13. A method for clustering semantic profiles of the documents of the set within the contextual semantic space built according to claim 9, the method comprising the step for clustering the stored semantic profiles of the documents within the contextual semantic space.
14. A method for ranking, by relevance to a query, the documents with semantic profiles clustered in accordance with claim 13, the method comprising the steps of:
- parsing the query into word units;
  
  processing the query to remove inflectional morphemes;
  
  transforming the parsed and processed query into a query text vector of K elements arranged in the order, each element of the query text vector determined by application of a third predetermined function to the number of occurrences within said parsed and processed query of the word corresponding to said each element of the query text vector;
  
  presenting the query text vector to the trained neural network for processing;
  
  storing the resulting hidden unit activation patterns of the trained neural network after the processing of the query text vector as the query semantic profile;
  
  determining centroids of the clusters;
  
  comparing the query semantic profile to the centroids;
  
  identifying a first cluster having a first centroid that is similar to the query semantic profile; and
  
  comparing semantic profiles of documents belonging to the first cluster with the query semantic profile to rank the documents belonging to the cluster by relevance to the query.
15. A method according to claim 14, wherein said comparing step comprises the steps of:
- calculating a normalized inner product between the query semantic profile and semantic profile of each document in the first cluster; and
  
  ranking said each document in the first cluster according to the calculated normalized inner product corresponding to said each document in the first cluster.
16. A method according to claim 14, further comprising the step of applying a second relevance criterion to the ranked documents.
17. A method for ranking, by relevance to a query, the documents with semantic profiles clustered in accordance with claim 10, and for selecting j of the ranked documents, j being a first predetermined number, the method comprising the steps of:
- parsing the query into word units;
  
  processing the query to remove inflectional morphemes;
  
  transforming the parsed and processed query into a query text vector of K elements arranged in the order, each element of the query text vector determined by application of a third predetermined function to the number of occurrences within said parsed and processed query of the word corresponding to said each element of the query text vector;
  
  presenting the query text vector to the trained neural network for processing;
  
  storing the resulting hidden unit activation patterns of the trained neural network after the processing of the query text vector as the query semantic profile;
  
  determining centroids of the clusters;
  
  comparing the query semantic profile to the centroids;
  
  identifying a first cluster having a first centroid that is similar to the query semantic profile;
  
  comparing semantic profiles of documents belonging to the first cluster with the query semantic profile to rank the documents belonging to the first cluster by relevance to the query;
  
  applying a second relevance criterion to the ranked documents belonging to the first cluster;
  
  if fewer than j documents pass the second relevance criterion, identifying a second cluster having a second centroid that is similar to the query semantic profile;
  
  comparing semantic profiles of documents belonging to the second cluster with the query semantic profile to rank the documents belonging to the second cluster by relevance to the query;
  
  applying the second relevance criterion to the ranked documents belonging to the second cluster; and
  
  selecting j documents from the ranked documents that passed the second relevance criterion.

18. A system for ranking of documents by relevance to a query within a context defined by a text corpus, the system comprising:
- means for parsing the text corpus into paragraphs and words;
  
  means for arranging the vocabulary of the parsed text corpus in an order;
  
  means for creating a reference set of vectors from the text corpus by transforming each paragraph of the parsed text corpus into a corresponding vector of K elements arranged in the order, K being the size of the vocabulary, each element of the vector corresponding to said each paragraph determined by application of a first predetermined function to the number of occurrences within said each paragraph of the word corresponding to said each element;
  
  a neural network;
  
  means for presenting the reference set of vectors to the neural network to train the neural network in the word relationships of the text corpus;
  
  means for retrieving and storing activation patterns of the hidden units of the trained neural network for use as a semantic profile of the text corpus;
  
  means for parsing each document into word units;
  
  means for transforming said each parsed document into a text vector of K elements arranged in the order, each element of said each document'"'"'s text vector determined by application of a second predetermined function to the number of occurrences within said each parsed document of the word corresponding to said each element;
  
  means for presenting each text vector to the trained neural network for processing;
  
  means for retrieving and storing a semantic profile of said each document, said semantic profile of said each document being hidden unit activation patterns of the trained neural network resulting from the processing of said each text vector;
  
  means for organizing stored semantic profiles of the documents into clusters;
  
  means for determining centroids of the clusters;
  
  means for parsing the query into word units;
  
  means for transforming the parsed query into a query text vector of K elements arranged in the order, each element of the query text vector determined by application of a third predetermined function to the number of occurrences within said parsed query of the word corresponding to said each element of the query text vector;
  
  means for presenting the query text vector to the trained neural network for processing;
  
  means for retrieving and storing a query semantic profile, the query semantic profile being hidden unit activation patterns of the trained neural network resulting from the processing of the query text vector;
  
  means for comparing the query semantic profile to the centroids;
  
  means for identifying a first cluster having a first centroid that is similar to the query semantic profile; and
  
  means for comparing semantic profiles of documents belonging to the first cluster with the query semantic profile to rank the documents belonging to the first cluster by relevance to the query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
DTI of Washington LLC
Original Assignee
Dolphin Search
Inventors
Roitblat, Herbert L.
Primary Examiner(s)
Alam, Hosain T.
Assistant Examiner(s)
Shah, Sanjiv

Application Number

US09/457,190
Time in Patent Office

433 Days
Field of Search

707/1, 707/3, 707/5, 707/532, 707/4, 707/6, 707/202, 707/531, 706/15, 706/16, 706/25, 704/9, 704/231
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 16/313   Selection or weighting of t...

G06F 16/367   Ontology

Y10S 707/99931   Database or file accessing

Y10S 707/99935   Query augmenting and refini...

Process and system for retrieval of documents using context-relevant semantic profiles

First Claim

21 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Process and system for retrieval of documents using context-relevant semantic profiles

First Claim

21 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links