System and method for the automatic recognition of relevant terms by mining link annotations
First Claim
1. A system for automatically and iteratively mining relevant terms comprising:
- a metadata extractor for extracting hypertext links from a document, the hypertext links containing metadata terms cn,m;
a document vector module for creating a vector for the document, using the hypertext links;
an association module for measuring the number of documents that contain the metadata terms cn,m in the hypertext links to perform a statistical analysis;
wherein the association module discovers association rules from the document vector based primarily on the hypertext links;
wherein the association rules comprise a support metric for an association rule (X|Y), where X and Y are sets of terms, and where a support p(X, Y) is defined as a joint probability of the frequency of co-occurrence of the sets of terms X and Y; and
wherein the association rules further comprise a hybrid metric H(s,c) that normalize a support function n(s) and a confidence function n(c), and is expressed as follows;
3 Assignments
0 Petitions
Accused Products
Abstract
A computer program product is provided as an automatic mining system to identify a set of relevant terms from a large text database of unstructured information, such as the World Wide Web (WWW), with a high degree of confidence, by association mining and refinement of co-occurrences using hypertext link metadata. The automatic mining system includes a software package comprised of a metadata extractor, a document vector module, an association module, and a filtering module. The automatic mining system further includes a database for storing the mined sets of relevant terms. The automatic mining system scans the downloaded hypertext links, rather than the entire body of the documents for related information. As a result, the crawler is not required to provide a relatively lengthy download of the document content, and thus, the automatic mining system minimizes the download and processing time.
69 Citations
30 Claims
-
1. A system for automatically and iteratively mining relevant terms comprising:
-
a metadata extractor for extracting hypertext links from a document, the hypertext links containing metadata terms cn,m;
a document vector module for creating a vector for the document, using the hypertext links;
an association module for measuring the number of documents that contain the metadata terms cn,m in the hypertext links to perform a statistical analysis;
wherein the association module discovers association rules from the document vector based primarily on the hypertext links;
wherein the association rules comprise a support metric for an association rule (X|Y), where X and Y are sets of terms, and where a support p(X, Y) is defined as a joint probability of the frequency of co-occurrence of the sets of terms X and Y; and
wherein the association rules further comprise a hybrid metric H(s,c) that normalize a support function n(s) and a confidence function n(c), and is expressed as follows;
- View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product for automatically and iteratively mining relevant terms comprising:
-
a metadata extractor for extracting hypertext links from a document, the hypertext links containing metadata terms cn,m;
a document vector module for creating a vector for the document, using the hypertext links;
an association module for measuring the number of documents that contain the metadata terms cn,m in the hypertext links to perform a statistical analysis;
wherein the association module discovers association rules from the document vector based primarily on the hypertext links;
wherein the association rules comprise a support metric for an association rule (X|Y), where X and Y are sets of terms, and where a support p(X, Y) is defined as a joint probability of the frequency of co-occurrence of the sets of terms X and Y; and
wherein the association rules further comprise a hybrid metric H(s,c) that normalize a support function n(s) and a confidence function n(c), and is expressed as follows;
- View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A method for automatically and iteratively mining relevant terms comprising:
-
extracting hypertext links containing metadata terms cn,m from a document;
creating a vector for the document, using the hypertext links;
measuring the number of documents that contain the metadata terms cn,m in the hypertext links to perform a statistical analysis;
discovering association rules from the document vector based primarily on the hypertext links;
wherein the association rules comprise a support metric for an association rule (X|Y), where X and Y are sets of terms, and where a support p(X, Y) is defined as a joint probability of the frequency of co-occurrence of the sets of terms X and Y; and
wherein the association rules further comprise a hybrid metric H(s,c) that normalize a support function n(s) and a confidence function n(c), and is expressed as follows;
- View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A computer program product having instruction codes for automatically and iteratively mining relevant terms comprising:
-
a first set of instruction codes for extracting hypertext links from a document, the hypertext links containing metadata terms cn,m;
a second set of instruction codes for creating a vector for the document, using the hypertext links;
a third set of instruction codes for measuring the number of documents that contain the metadata terms cn,m in the hypertext links to perform a statistical analysis;
wherein the third set of instruction codes discovers association rules from the document vector based primarily on the hypertext links;
wherein the association rules comprise a support metric for an association rule (X|Y), where X and Y are sets of terms, and where a support p(X, Y) is defined as a joint probability of the frequency of co-occurrence of the sets of terms X and Y; and
wherein the association rules further comprise a hybrid metric H(s,c) that normalize a support function n(s) and a confidence function n(c), and is expressed as follows;
- View Dependent Claims (20, 21, 22, 23, 24)
-
-
25. A system for automatically and iteratively mining relevant terms comprising:
-
means for extracting hypertext links from a document, the hypertext links containing metadata terms cn,m;
means for creating a vector for the document, using the hypertext links;
means for measuring the number of documents that contain the metadata terms cn,m in the hypertext links to perform a statistical analysis;
wherein the means for measuring the number of documents that contain the metadata terms cn,m in the hypertext, discovers association rules from the document vector based primarily on the hypertext links;
wherein the association rules comprise a support metric for an association rule (X|Y), where X and Y are sets of terms, and where a support p(X, Y) is defined as a joint probability of the frequency of co-occurrence of the sets of terms X and Y; and
wherein the association rules further comprise a hybrid metric H(s,c) that normalize a support function n(s) and a confidence function n(c), and is expressed as follows;
- View Dependent Claims (26, 27, 28, 29, 30)
-
Specification