EXTRACTING STRUCTURED DATA FROM WEB FORUMS
First Claim
1. A computer-implemented process for extracting structured data from web forums, comprising:
- training a model for predicting the probability of given data structures existing a web forum by using training web forum sites, and an associated set of features and a web forum sitemap for each of the training web forum sites;
inputting a set of one or more target web forum sites and associated target web forum sitemaps;
extracting features from the one or more input target web forum sites using the associated target web forum sitemaps; and
using the trained model and the extracted features from the one or more input web forum sites to extract data from the one or more input target web forum sites.
1 Assignment
0 Petitions
Accused Products
Abstract
The web forum data extraction technique is designed for the structured data extraction of data on web forums using both page-level information and site-level knowledge. To do this, the technique finds the kinds of page objects a forum site has, which object a page belongs to, and how different page objects are connected with each other. This information can be obtained by re-constructing the sitemap of the target forum which is based on a Data Object Model of the target forum. The web forum data extraction technique collects three kinds of evidence for data extraction: 1) inner-page features which cover both semantic and layout information on an individual page; 2) inter-vertex features which describe linkage-related observations; and 3) inner-vertex features which characterize interrelationships among pages in one vertex. The technique employs Markov Logic Networks to combine the types of evidence statistically for inference and thereby can extract the desired structures.
-
Citations
20 Claims
-
1. A computer-implemented process for extracting structured data from web forums, comprising:
-
training a model for predicting the probability of given data structures existing a web forum by using training web forum sites, and an associated set of features and a web forum sitemap for each of the training web forum sites; inputting a set of one or more target web forum sites and associated target web forum sitemaps; extracting features from the one or more input target web forum sites using the associated target web forum sitemaps; and using the trained model and the extracted features from the one or more input web forum sites to extract data from the one or more input target web forum sites. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for extracting data from web forums, comprising:
-
a general purpose computing device; a computer program comprising program modules executable by the general purpose computing device, wherein the computing device is directed by the program modules of the computer program to, input at least one web forum site for which data is to be extracted; perform sitemap recovery to recover the web forum sitemap structure of the at least one web forum site; perform feature extraction using the web forum sitemap structure to extract features from the at least one web forum site; and input the extracted features into a joint inference model to obtain the location of the data to be extracted. - View Dependent Claims (13, 14)
-
-
15. A computer-implemented process for extracting data from web forums, comprising:
-
recovering the sitemap of a target web forum site; extracting features of an input target web forum site using the recovered sitemap; inputting the extracted features of the target web forum site to a joint inference model that employs Markov Logic Networks to predict the likelihood of given data structures existing in pages of the input target forum site; using the joint inference model to predict the likelihood of given data structures existing in pages of the input target web forum site; and extracting the predicted data structures from the input target web forum site. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification