Automatic Extraction of Structured Web Content

US 20110307479A1
Filed: 06/10/2010
Published: 12/15/2011
Est. Priority Date: 06/10/2010
Status: Abandoned Application

First Claim

Patent Images

1. In a computing environment, a method performed on at least one processor comprising, extracting structured information from sets of URLs, including using search trail data to a determine a wrapper for extracting data items from web pages corresponding to each set, and determining relevant data items from the data items extracted from the web pages.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is extracting structured information from web pages for use in directly answering queries with data items from the structured data. Users'"'"' post-search browsing behaviors (search trails) are treated as implicit labels as to the relevance between web content and user queries, and are used to determine wrappers for extracting structured information. In one implementation, a system identifies websites from web search logs, builds wrappers from users'"'"' search trails, filters out bad wrappers (from inconsistent user clicks), and combines structured information from different web sites, e.g., for each query.

Citations

20 Claims

1. In a computing environment, a method performed on at least one processor comprising, extracting structured information from sets of URLs, including using search trail data to a determine a wrapper for extracting data items from web pages corresponding to each set, and determining relevant data items from the data items extracted from the web pages.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1 wherein using the search trail data to determine a wrapper comprises processing pages to generate a set of candidate wrappers, and to determine an entity name for each page.
  - 3. The method of claim 2 further comprising, selecting a wrapper from among the candidate wrappers, including applying each candidate wrapper to the pages to obtain one or more strings extracted by that candidate wrapper, and selecting the wrapper based on the entity name inferred from the queries and clicks of web search users versus each string extracted by that candidate wrapper.
  - 4. The method of claim 2 further comprising, removing candidate wrappers having low coverage from the set of candidate wrappers.
  - 5. The method of claim 2 further comprising, removing candidate wrappers having low uniqueness from the set of candidate wrappers.
  - 6. The method of claim 1 further comprising, summarizing patterns of URLs to provide the sets of URLs as uniformly formatted URLs, including inputting name entities of different categories, and processing a query log that indicates user clicks on URLs returned in a search page, to find common patterns.
  - 7. The method of claim 6 wherein summarizing the patterns comprises comparing a pattern against patterns in a pattern set, generalizing a generalized pattern corresponding to an existing pattern in the pattern set, and adding the generalized pattern into the pattern set.
  - 8. The method of claim 6 wherein summarizing the patterns comprises performing a comparison of a pattern against a pattern set, and adding the pattern into the pattern set based upon a result of the comparison.
  - 9. The method of claim 1 wherein determining the relevant data items from the data items extracted from the web pages comprises using a graph regularization-based approach to identify the relevant data items, including representing each item as a node in the graph, and adding an edge between each pair of data items that are extracted from parts of pages having a common format.
  - 10. The method of claim 9 further comprising, assigning scores to the nodes, each score indicating a likelihood of relevance for that node'"'"'s associated data item.
  - 11. The method of claim 9 further comprising, processing the graph to determine whether a wrapper provides relevant or irrelevant items.
  - 12. The method of claim 1 further comprising, accessing the structured data to provide a more directed search result in response to a query.
  - 13. The method of claim 12 further comprising ranking search results based upon predicted relevance of data items determined from one or more search and browsing logs.
  - 14. The method of claim 1 further comprising propagating semantics among uniformly formatted web pages in a website.

15. In a computing environment, a system comprising, a URL pattern summarizer that determines patterns of URLs among URLs clicked for named entity queries and provides sets of uniformly formatted URLs based upon the patterns, and an information extractor that consumes the sets of uniformly formatted URLs and search trail data to determine one or more wrappers for each set, and extracts structured information from web pages in that set.
- View Dependent Claims (16, 17)
- - 16. The system of claim 15 further comprising an authority analyzer that determines relevant data items from the structured information extracted from the web pages, by processing data extracted from similarly or uniformly formatted parts in web pages.
  - 17. The system of claim 15 wherein the information extractor determines one or more wrappers for each set by processing the web pages to generate a set of candidate wrappers

18. One or more computer-readable media having computer-executable instructions, which when executed perform steps, comprising:
- summarizing patterns of URLs to provide sets of uniformly formatted URLs, each set associated with a named entity; and
  
  for each set;
  
  (a) using search trail data to determine wrappers for extracting data items from the URL pages corresponding to that set; and
  
  (b) selecting a wrapper for extracting structured data corresponding to the named entity associated with that set.
- View Dependent Claims (19, 20)
- - 19. The one or more computer-readable media of claim 18 wherein selecting the wrapper includes determining relevance of data items in the structured data extracted by a wrapper.
  - 20. The one or more computer-readable media of claim 18 wherein determining the relevance of the data items comprises representing each data item as a node in a regularization graph, adding an edge between each pair of data items that are extracted from parts of pages having a common format, and assigning scores to the nodes, each score indicating a likelihood of relevance for that node'"'"'s associated data item.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Yin, Xiaoxin, Tan, Wenzhao, Li, Xiao, Suzue, Yutaka, Apacible, Johnson T., Tu, Yi-Chin

Application Number

US12/797,614
Publication Number

US 20110307479A1
Time in Patent Office

Days
Field of Search
US Class Current

707/728
CPC Class Codes

G06F 16/9535 Search customisation based ...

Automatic Extraction of Structured Web Content

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic Extraction of Structured Web Content

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links