Automatic Extraction of Structured Web Content
First Claim
1. In a computing environment, a method performed on at least one processor comprising, extracting structured information from sets of URLs, including using search trail data to a determine a wrapper for extracting data items from web pages corresponding to each set, and determining relevant data items from the data items extracted from the web pages.
2 Assignments
0 Petitions
Accused Products
Abstract
Described is extracting structured information from web pages for use in directly answering queries with data items from the structured data. Users'"'"' post-search browsing behaviors (search trails) are treated as implicit labels as to the relevance between web content and user queries, and are used to determine wrappers for extracting structured information. In one implementation, a system identifies websites from web search logs, builds wrappers from users'"'"' search trails, filters out bad wrappers (from inconsistent user clicks), and combines structured information from different web sites, e.g., for each query.
-
Citations
20 Claims
- 1. In a computing environment, a method performed on at least one processor comprising, extracting structured information from sets of URLs, including using search trail data to a determine a wrapper for extracting data items from web pages corresponding to each set, and determining relevant data items from the data items extracted from the web pages.
- 15. In a computing environment, a system comprising, a URL pattern summarizer that determines patterns of URLs among URLs clicked for named entity queries and provides sets of uniformly formatted URLs based upon the patterns, and an information extractor that consumes the sets of uniformly formatted URLs and search trail data to determine one or more wrappers for each set, and extracts structured information from web pages in that set.
-
18. One or more computer-readable media having computer-executable instructions, which when executed perform steps, comprising:
-
summarizing patterns of URLs to provide sets of uniformly formatted URLs, each set associated with a named entity; and for each set; (a) using search trail data to determine wrappers for extracting data items from the URL pages corresponding to that set; and (b) selecting a wrapper for extracting structured data corresponding to the named entity associated with that set. - View Dependent Claims (19, 20)
-
Specification