System and method of automatic discovery of terms in a document that are relevant to a given target topic
First Claim
1. A system of automatically discovering terms in a document that are relevant to a given target topic, comprising:
- a new terms discoverer for identifying the terms in the document;
a candidate terms discoverer for identifying potentially relevant terms from the terms identified by the new terms discoverer;
the candidate terms discoverer comprising;
an association module that performs a statistical analysis, regardless of an occurrence frequency of the terms within a single document; and
a filtering module that filters association rules with relevance that surpasses a user specified threshold; and
a relevant terms discoverer that identifies relevant terms by applying the association rules to the potentially relevant terms identified by the candidate terms discoverer, to refine a relevance of the potentially relevant terms by filtering false relevance.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer program product is provided as an automatic mining system to discover terms that are relevant to a given target topic from a large databases of unstructured information such as the World Wide Web. The operation of the automatic mining system is performed in three stages: The first stage is carried out by a new terms discoverer for discovering the terms in a document, the second stage is carried out by a candidate terms discoverer for discovering potentially relevant terms, and the third stage is carried out by a relevant terms discoverer for refining or testing the discovered relevance to filter false relevance. The new terms discoverer includes a system for the automatic mining of patterns and relations, a system for the automatic mining of new relationships, and a system for selecting new terms from relations. In one embodiment, the system for the automatic mining of patterns and relations identifies a set of related terms on the WWW with a high degree of confidence, using a duality concept, and includes a terms database and two identifiers: a relation identifier and a pattern identifier. The system for the automatic mining of new relationships includes a database a knowledge module and a statistics module. The knowledge module includes a stemming unit, a synonym check unit, and a domain knowledge check unit. The candidate terms discoverer includes a metadata extractor, a document vector module, an association module, a filtering module, and a database. The relevant terms discoverer includes a stop word filter and a system for the automatic construction of generalization—specialization hierarchy of terms comprised of a terms database, an augmentation module, a generalization detection module, and a hierarchy database.
-
Citations
22 Claims
-
1. A system of automatically discovering terms in a document that are relevant to a given target topic, comprising:
-
a new terms discoverer for identifying the terms in the document;
a candidate terms discoverer for identifying potentially relevant terms from the terms identified by the new terms discoverer;
the candidate terms discoverer comprising;
an association module that performs a statistical analysis, regardless of an occurrence frequency of the terms within a single document; and
a filtering module that filters association rules with relevance that surpasses a user specified threshold; and
a relevant terms discoverer that identifies relevant terms by applying the association rules to the potentially relevant terms identified by the candidate terms discoverer, to refine a relevance of the potentially relevant terms by filtering false relevance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system of automatically discovering terms in a document that are relevant to a given target topic, comprising:
-
a new terms discoverer for identifying the terms in the document;
a candidate terms discoverer for identifying potentially relevant terms from the terms identified by the new terms discoverer;
the candidate terms discoverer comprising;
an association module that performs a statistical analysis, regardless of an occurrence frequency of the terms within a single document; and
a filtering module that filters association rules with relevance that surpasses a user specified threshold; and
a relevant terms discoverer that identifies relevant terms by applying the association rules to the potentially relevant terms identified by the candidate terms discoverer, to refine a relevance of the potentially relevant terms by filtering false relevance. - View Dependent Claims (15, 16, 17)
wherein the knowledge module includes one or more of;
a stemming unit, a synonym check unit, or a domain knowledge check unit.
-
-
17. The system according to claim 15, wherein the candidate terms discoverer includes a metadata extractor, a document vector module, an association module, a filtering module, and a database for storing relevant terms;
- and
wherein the system for the automatic construction of the generalization hierarchy of terms includes an augmentation module, a generalization detection module, and a hierarchy database.
- and
-
18. A method of automatically discovering terms in a document that are relevant to a given target topic, comprising:
-
identifying the terms in the document by means of a new terms discoverer;
identifying potentially relevant terms from the terms identified by the new terms discoverer by means of a candidate terms discoverer that performs a statistical analysis, regardless of an occurrence frequency of the terms within a single document and that filters association rules with relevance that surpasses a user specified threshold; and
identifying relevant terms by applying the association rules to the potentially relevant terms identified by the candidate terms discoverer, to refine a relevance of the potentially relevant terms by filtering false relevance, by means of a relevant terms discoverer. - View Dependent Claims (19, 20)
-
-
21. A computer usable medium having instruction codes for automatically discovering terms in a document that are relevant to a given target topic, comprising:
-
a first set of instruction codes for identifying the terms in the document;
a second set of instruction codes for identifying potentially relevant terms from the terms identified by the first set of instruction codes;
for performing a statistical analysis regardless of an occurrence frequency of the terms within a single document; and
for filtering association rules with relevance that surpasses a user specified threshold; and
a third set of instruction codes that identifies relevant terms by applying the association rules to the potentially relevant terms identified by the second set of instruction codes, to refine a relevance of the potentially relevant terms by filtering false relevance. - View Dependent Claims (22)
-
Specification