Computer method and apparatus for extracting data from web pages
First Claim
1. A method for extracting data from a Web page comprising the computer-implemented steps of:
- using natural language processing, finding possible formal names on a given Web page, the step of finding producing a first found set of formal names;
searching the given Web page for formal names not found by the natural language processing step of finding, said searching producing a second set of formal names; and
refining a combined set of formal names formed of the first found set and the second set, said refining producing a working set of people and organization names extracted from the given Web page.
1 Assignment
0 Petitions
Accused Products
Abstract
Computer method and apparatus for extracting information from a Web page is disclosed. The invention apparatus is formed of an extractor coupled to receive Web pages from a source. The extractor uses natural language processing to extract desired information from the Web page. A storage subsystem receives from the extractor the extracted desired information and stores the extracted desired information in a database. The invention method for extracting data from a Web page includes the computer implemented steps of (i) using natural language processing, finding possible formal names on a given Web page, (ii) using pattern matching, searching the given Web page for formal names not found by the natural language processing, and (iii) refining a combined set of the found formal names to produce a working set of people and organization names extracted from the given Web page. The refining includes determining aliases of respective people and organization names, so as to effectively reduce duplicate names.
192 Citations
20 Claims
-
1. A method for extracting data from a Web page comprising the computer-implemented steps of:
-
using natural language processing, finding possible formal names on a given Web page, the step of finding producing a first found set of formal names;
searching the given Web page for formal names not found by the natural language processing step of finding, said searching producing a second set of formal names; and
refining a combined set of formal names formed of the first found set and the second set, said refining producing a working set of people and organization names extracted from the given Web page.
-
-
2. A method as claimed in claim 1 wherein the step of refining includes rejecting predefined formal names as not being people names of interest.
-
3. A method as claimed in claim 1 wherein the step of refining includes determining aliases of respective people and organization names in the combined set, so as to reduce effective duplicate names.
-
4. A method as claimed in claim 1 wherein the step of finding further finds professional titles and determines organization for which a person named on the given Web page holds that title.
-
5. A method as claimed in claim 4 wherein the step of finding includes employing rules to extract at least title and formal names.
-
6. A method as claimed in claim 1 wherein the step of finding further includes determining educational background of a person named on the given Web page, the educational background including at least one of name of institution, degree earned from the institution and date of graduation from the institution.
-
7. A method as claimed in claim 1 wherein the step of finding further includes determining biographical information relating to a person named on the given Web page.
-
8. A method as claimed in claim 7 wherein the step of determining biographical information includes determining current and previous employment history of the named person.
-
9. A method as claimed in claim 1 further comprising the steps of:
-
determining type of the given Web page; and
from the determined type, defining contents of different portions of the Web page, such that the steps of finding and searching are performed as a function of the defined contents.
-
-
10. A method as claimed in claim 9 wherein the step of determining type of the given Web page includes determining structure or arrangement of contents of the Web page.
-
11. A method as claimed in claim 10 further comprising the step of using the determined type, deducing additional information regarding a named person or organization on the given Web page, the additional information supplementing information found on another Web page of a same Web site as the given Web page.
-
12. A method as claimed in claim 1 wherein the step of finding further includes determining at least one of addresses, telephone number, and email address relating to a person or organization named on the given Web page.
-
13. A method as claimed in claim 1 wherein the step of searching employs pattern matching.
-
14. A database having records formed by data extracted from Web pages by the method of claim 1.
-
15. A method for extracting information from a Web page document comprising the computer implemented steps of:
-
performing a lexical analysis on a given Web page document to identify elements of interest, the elements of interest producing formal names;
detecting a regular recurrence of a certain type of element, the detecting producing additional formal names;
resolving aliases of the produced formal names and additional formal names to form a working set of names of people and/or organizations named in the given Web page document.
-
-
16. A method as claimed in claim 15, further comprising the step of transforming the given Web page document into a standardized form, the step of transforming including identifying page structure of the Web page document.
-
17. A method as claimed in claim 15, further comprising the step of assigning a type to each line in the given Web page document, the step of assigning a type indicating purpose of each line in the given Web page document.
-
18. A method as claimed in claim 17 wherein the step of performing a lexical analysis further identifies elements of interest on lines of certain assigned types.
-
19. A method as claimed in claim 17 wherein the step of detecting includes using pattern matching, detecting a regular recurrence of a certain type of line, to produce additional formal names.
-
20. A method as claimed in claim 15 wherein the step of performing a lexical analysis includes syntactically and grammatically identifying elements of interest.
Specification