USING STRUCTURED DATABASE FOR WEBPAGE INFORMATION EXTRACTION
First Claim
1. A computer-implemented method of obtaining webpage training samples, the method comprising:
- accessing a structured database having a plurality of entries, wherein each entry comprises a plurality of fields, one of the fields comprising a URL (uniform resource locater) and another one of the fields comprising first information at least similar to second information to be located in a webpage associated with the URL; and
for each of the plurality of entries in the structured database, retrieving a webpage associated with the URL; and
analyzing the webpage to find the second information therein corresponding to the first information in the structured database, and if the second information is found in the webpage storing information indicative of the webpage as a training sample.
2 Assignments
0 Petitions
Accused Products
Abstract
A structured database is used for webpage information extraction, and in particular, to obtain training data from the webpage for training a statistical model. The structured database has a plurality of entries, wherein each entry comprises a plurality of fields. One of the fields comprises a URL (uniform resource locater), while another field comprises information at least similar to other information to be located in a webpage associated with the URL. For at least some of the entries in the structured database, a web page associated with the URL is retrieved. The webpage is analyzed and if information is found in the webpage similar to the information in the structured database, the webpage is identified as being suitable to be considered as a training sample.
-
Citations
20 Claims
-
1. A computer-implemented method of obtaining webpage training samples, the method comprising:
-
accessing a structured database having a plurality of entries, wherein each entry comprises a plurality of fields, one of the fields comprising a URL (uniform resource locater) and another one of the fields comprising first information at least similar to second information to be located in a webpage associated with the URL; and for each of the plurality of entries in the structured database, retrieving a webpage associated with the URL; and analyzing the webpage to find the second information therein corresponding to the first information in the structured database, and if the second information is found in the webpage storing information indicative of the webpage as a training sample. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method of obtaining webpage training samples, the method comprising:
-
accessing a structured database having a plurality of entries, wherein each entry comprises a plurality of fields, one of the fields comprising a URL (uniform resource locater) and another one of the fields comprising first information at least similar to second information to be located in a webpage associated with the URL; and for each of the plurality of entries in the structured database, retrieving a webpage associated with the URL; and analyzing the webpage to obtain an indication of the similarity of the second information therein with the first information in the structured database, and if the indication indicates substantial correspondence analyzing the webpage so as to obtain values of markup language related features pertaining to the second information. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A system for obtaining webpage training samples, the system comprising:
-
a structured database having a first plurality of entries and a second plurality of entries, wherein each entry of the first plurality of entries and the second plurality of entries comprises a plurality of fields, one of the fields comprising a URL (uniform resource locater) and another one of the fields in the first plurality of entries comprises first information at least similar to second information to be located in a webpage associated with the URL, and wherein said another one of the fields in the second plurality of entries lacks information; a webpage processing module configured to operate with the structured database and access the Internet, the webpage processing module configured to retrieve a webpage associated with the URL for each entry of only the first plurality of entries in the database and not the second plurality of entries, configured to obtain a score for each webpage retrieved and rank the webpages based on the score. - View Dependent Claims (19, 20)
-
Specification