Granular knowledge based search engine
First Claim
1. A system of indexing documents comprising the steps of:
- a. preprocessing documents to extract words;
b. then extracting keywords by calculating a TFIDF for each word, wherein the step of calculating a TFIDF further comprises the substeps of;
i. calculating a term frequency;
ii. calculating a document frequency;
iii. calculating a total number of documents in which a term appears at least once;
c. then comparing the TFIDF for each word with a TFIDF predefined threshold;
d. then finding keyword association by generating a plurality of keyword sets, wherein the step of generating a plurality of keyword sets further comprises the sub steps of;
i. filtering keyword sets that do not meet a predefined within distance threshold; and
ii. filtering keyword sets that do not meet a predefined support threshold, wherein the support threshold is compared to a support level which is proportional to the percentage of documents that contain the keyword set;
e. then providing a clustering of keyword sets and building a document index having a clustering of keyword sets;
f. then providing a search result in the form of a document cluster.
0 Assignments
0 Petitions
Accused Products
Abstract
The application borrows terminology from data mining, association rule learning and topology. A geometric structure represents a collection of concepts in a document set. The geometric structure has a high-frequency keyword set that co-occurs closely which represents a concept in a document set. Document analysis seeks to automate the understanding of knowledge representing the author'"'"'s idea. Granular computing theory deals with rough sets and fuzzy sets. One of the key insights of rough set research is that selection of different sets of features or variables will yield different concept granulations. Here, as in elementary rough set theory, by “concept” we mean a set of entities that are indistinguishable or indiscernible to the observer (i.e., a simple concept), or a set of entities that is composed from such simple concepts (i.e., a complex concept).
-
Citations
20 Claims
-
1. A system of indexing documents comprising the steps of:
-
a. preprocessing documents to extract words; b. then extracting keywords by calculating a TFIDF for each word, wherein the step of calculating a TFIDF further comprises the substeps of; i. calculating a term frequency; ii. calculating a document frequency; iii. calculating a total number of documents in which a term appears at least once; c. then comparing the TFIDF for each word with a TFIDF predefined threshold; d. then finding keyword association by generating a plurality of keyword sets, wherein the step of generating a plurality of keyword sets further comprises the sub steps of; i. filtering keyword sets that do not meet a predefined within distance threshold; and ii. filtering keyword sets that do not meet a predefined support threshold, wherein the support threshold is compared to a support level which is proportional to the percentage of documents that contain the keyword set; e. then providing a clustering of keyword sets and building a document index having a clustering of keyword sets; f. then providing a search result in the form of a document cluster. - View Dependent Claims (2, 3, 4, 5, 11)
-
-
6. A system of indexing documents comprising the steps of:
-
a. preprocessing documents to extract words; b. then extracting keywords by calculating a TFIDF for each word, c. then comparing the TFIDF for each word with a TFIDF predefined threshold; d. then finding keyword association by generating a plurality of keyword sets, e. then providing a clustering of keyword sets and building a document index having a clustering of keyword sets; f. then allowing user selection of a query presented in the clustering of keyword sets; g. then receiving a user selection of a query presented in the clustering of keyword sets; h. then providing a search result in the form of a document cluster. - View Dependent Claims (7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification