System and method for analysis and clustering of documents for search engine
First Claim
1. A method for analyzing and processing documents, comprising the steps of:
- building a dictionary based on keywords from an entire text of the documents, analyzing text of the documents for the keywords or a number of occurrences of the keywords and a context in which the keywords appear in the text; and
clustering documents into groups of clusters based on information obtained in the analyzing step, wherein each cluster of the groups of clusters includes a set of documents containing a same word or phrase.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for searching documents in a data source and more particularly, to a system and method for analyzing and clustering of documents for a search engine. The system and method includes analyzing and processing documents to secure the infrastructure and standards for optimal document processing. By incorporating Computational Intelligence (CI) and statistical methods, the document information is analyzed and clustered using novel techniques for knowledge extraction. A comprehensive dictionary is built based on the keywords identified by the these techniques from the entire text of the document. The text is parsed for keywords or the number of its occurrences and the context in which the word appears in the documents. The whole document is identified by the knowledge that is represented in its contents. Based on such knowledge extracted from all the documents, the documents are clustered into meaningful groups in a catalog tree. The results of document analysis and clustering information are stored in a database.
-
Citations
33 Claims
-
1. A method for analyzing and processing documents, comprising the steps of:
-
building a dictionary based on keywords from an entire text of the documents, analyzing text of the documents for the keywords or a number of occurrences of the keywords and a context in which the keywords appear in the text; and
clustering documents into groups of clusters based on information obtained in the analyzing step, wherein each cluster of the groups of clusters includes a set of documents containing a same word or phrase. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
-
-
32. A system for analyzing and processing documents, comprising the steps of:
-
a module for building a dictionary based on the keywords from an entire text of the documents, a module for analyzing text of the documents for the keywords or a number of occurrences of the keywords and a context in which the keywords appear in the text; and
a module for clustering documents into groups of clusters based on information obtained in the analyzing step, wherein each cluster of the group of clusters is a set of documents containing a same word or phrase.
-
-
33. A machine readable medium containing code for analyzing and processing documents, comprising the steps of:
-
building a dictionary based on the keywords from an entire text of the documents, analyzing text of the documents for the keywords or a number of occurrences of the keywords and a context in which the keywords appear in the text; and
clustering documents into groups of clusters based on information obtained in the analyzing step, wherein each cluster of the group of clusters is a set of documents containing a same word or phrase.
-
Specification