Method and system for identifying and matching companies to business event information
First Claim
1. A system for identifying and matching company names and business events occurring in a document, the system comprising:
- a. a crawler for downloading documents;
b. a parser for parsing the downloaded documents;
c. an evaluator for evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in each of the parsed documents; and
d. an information extractor for identifying and matching business events to company names, the business events and company names being present in the selected documents.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a system, method and computer program product for identifying and matching company names to business event information. A crawler crawls and downloads documents by starting from a pre-defined set of links. A parser breaks down the downloaded documents into components like text, titles and links. An evaluator evaluates the parsed documents and selects documents on the basis of amount of relevant information contained in the documents. An information extractor identifies the occurrences of company names in the text contained in the selected documents. It also identifies occurrences of business events, specified by a pre-defined set of event phrases, in the text contained in the selected documents. Further, the information extractor matches the identified company names to the identified business events in order to generate company-business event pairs.
-
Citations
22 Claims
-
1. A system for identifying and matching company names and business events occurring in a document, the system comprising:
-
a. a crawler for downloading documents;
b. a parser for parsing the downloaded documents;
c. an evaluator for evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in each of the parsed documents; and
d. an information extractor for identifying and matching business events to company names, the business events and company names being present in the selected documents. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for identifying and matching company names and business events, the method comprising the steps of:
-
a. crawling a first set of links on a network to download documents;
b. parsing the downloaded documents;
c. evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the document; and
d. processing the selected documents to generate company-event pairs from information present in text contained in the document. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer program product comprising a computer usable medium having a computer readable program code embodied therein for identifying and matching company names and business events, the computer program code performing the steps of:
-
a. crawling a pre-defined first set of links to download documents referenced by the pre-defined first set of links;
b. parsing the downloaded documents;
c. evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the document;
d. identifying company names and business events in the text contained in each of the selected documents; and
e. matching the identified business events to the identified company names for each of the selected documents.
-
-
17. A system for identifying and matching company names and business events, the system comprising:
-
a. a crawler for downloading documents, the documents being referenced by links present in a pre-defined first set of links;
b. a parser for parsing the downloaded documents to break the downloaded documents into components including at least one of free text, title and a second set of links;
c. an evaluator for evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the documents; and
d. an information extractor for identifying and matching business events to company names, the business events and company names being present in the text contained in the selected documents;
wherein the information extractor comprises;
i. a company name extractor for identifying company names in the text contained in the selected documents;
ii. a business event extractor for identifying business events in the text contained in the selected documents; and
iii. an entity-event matcher for matching the identified business events to the identified company names for each of the selected documents and computing a match score for each of the matches in each of the selected documents. iv. a confidence rating generator for generating a confidence rating for each of the selected documents. - View Dependent Claims (18, 19)
-
-
20. A method for identifying and matching company names and business events, the method comprising the steps of:
-
a. crawling a network to download documents referenced by a pre-defined first set of links;
b. parsing the downloaded documents to break down the downloaded documents into components, the components comprising at least one of free text, titles and a second set of links to other documents;
c. evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the parsed document;
d. identifying the occurrences of business events in text contained in the selected documents;
wherein identifying the occurrences of business events in text contained in the selected documents involves;
i. identifying the business events in the text by locating phrases exactly as they occur in the pre-defined set of phrases; and
ii. identifying the business events by searching the text for variations of the phrases present in the pre-defined set of phrases; and
e. identifying occurrences of company names in text contained in the selected documents;
wherein identifying the occurrences of company names in text contained in the selected documents involves;
i. identifying the occurrences of company names in the text by searching for a set of company name suffix indicators in the text;
ii. applying a pre-defined set of heuristics to identify the company name preceding the identified company name suffix indicator; and
f. matching identified business events to identified company names to generate company-business event pairs;
wherein matching identified business events to identified company names to generate company-business event pairs involves;
i. determining a match between the identified business events and the identified company names for each of the selected documents;
ii. computing a match score for each of the matches in each of the selected documents, the score being based on a distance between the identified company name and the identified business event in the selected document. iii. calculating a confidence rating for each of the selected documents, wherein the confidence rating is calculated on the basis of contribution from matches between business events and company names and contribution from orphan events within each of the selected documents, the orphan events being business events that are not associated with any company name in the selected document. - View Dependent Claims (21, 22)
-
Specification