Computer method and apparatus for extracting data from web pages
First Claim
1. A method for extracting data from a Web page comprising the computer-implemented steps of:
- using natural language processing, finding possible formal names on a given Web page, the step of finding producing a first found set of formal names;
searching the given Web page for formal names not found by the natural language processing step of finding, said searching producing a second set of formal names; and
refining a combined set of formal names formed of the first found set and the second set, said refining producing a working set of people and organization names extracted from the given Web page.
7 Assignments
0 Petitions
Accused Products
Abstract
Computer method and apparatus for extracting information from a Web page is disclosed. The invention apparatus is formed of an extractor coupled to receive Web pages from a source. The extractor uses natural language processing to extract desired information from the Web page. A storage subsystem receives from the extractor the extracted desired information and stores the extracted desired information in a database. The invention method for extracting data from a Web page includes the computer implemented steps of (i) using natural language processing, fmding possible formal names on a given Web page, (ii) using pattern matching, searching the given Web page for formal names not found by the natural language processing, and (iii) refining a combined set of the found formal names to produce a working set of people and organization names extracted from the given Web page. The refining includes determining aliases of respective people and organization names, so as to effectively reduce duplicate names.
-
Citations
48 Claims
-
1. A method for extracting data from a Web page comprising the computer-implemented steps of:
-
using natural language processing, finding possible formal names on a given Web page, the step of finding producing a first found set of formal names;
searching the given Web page for formal names not found by the natural language processing step of finding, said searching producing a second set of formal names; and
refining a combined set of formal names formed of the first found set and the second set, said refining producing a working set of people and organization names extracted from the given Web page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
-
15. A method for extracting information from a Web page document comprising the computer implemented steps of:
-
performing a lexical analysis on a given Web page document to identify elements of interest, the elements of interest producing formal names;
detecting a regular recurrence of a certain type of element, the detecting producing additional formal names;
resolving aliases of the produced formal names and additional formal names to form a working set of names of people and/or organizations named in the given Web page document.
-
-
38. Computer apparatus for extracting information from a Web page comprising:
-
a source of Web pages of interest;
an extractor coupled to receive Web pages from the source, the extractor being computer implemented and using natural language processing to extract desired information from the Web pages; and
a storage subsystem coupled to the extractor for storing the extracted desired information in a data store.
-
Specification