Systems and methods for enhancing web-based searching
First Claim
1. An information gathering system implemented in a computer system for optimizing searching comprising:
- a processor and memory;
a data extraction tool executing in the computer system, in communication with a database, extracting website content to enable full text searching, the website content being extracted from a plurality of websites associated with business entities that are classified according to a standard industry classification system (SIC), which is a predefined taxonomy of business activities having verified information about the business entities;
the database, in communication with the data extraction tool, storing the extracted website content according to a classification system that is based on the predefined taxonomy of SIC business activities;
a content analyzer, in communication with the database, identifying commonly occurring keywords in the extracted website content from the websites of business entities that are similarly classified in the SIC predefined taxonomy of SIC business activities, where the commonly occurring keywords identified are used to update the classification system, the updated classification system being used to optimize searching in response to search queries;
the content analyzer identifying commonly occurring keywords that are used to create a new category to update the classification system by;
identifying keyword matches in the extracted website content by identifying any commonly occurring keywords or phrases in the extracted website content; and
processing the matches identified by determining whether any of the keywords or phrases in the identified matches contain one or more keywords associated with any of the business activities in the SIC predefined taxonomy; and
a full text indexed search engine, in communication with the database, processing a search query by matching the search query against the database, where at least a portion of the search results are clustered based on their respective SIC business activity category.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for enhancing web-based searching is provided. Categorizing and clustering techniques are used to optimize searching. Businesses are classified using a control group of predetermined categories. The predetermined categories may be SIC codes or headings that are used to describe business activities. The website addresses for a business listed in the control group is determined, and the content of the business'"'"'s website is extracted. The extracted content is associated with the predetermined category that the business is classified under. The extracted content is used to further enhance the overall classification scheme. The system may compare and match the extracted content with content of other business'"'"' websites, which are similarly categorized. If a relevant keyword match is identified in several of the websites, the keyword may be used to update the classification scheme. A new category or sub-category can be created based on this keyword. Furthermore, when a search is performed, the search results are organized by these categories, and using various processes, the most common results are kept and the less relevant results are discarded.
-
Citations
53 Claims
-
1. An information gathering system implemented in a computer system for optimizing searching comprising:
-
a processor and memory; a data extraction tool executing in the computer system, in communication with a database, extracting website content to enable full text searching, the website content being extracted from a plurality of websites associated with business entities that are classified according to a standard industry classification system (SIC), which is a predefined taxonomy of business activities having verified information about the business entities; the database, in communication with the data extraction tool, storing the extracted website content according to a classification system that is based on the predefined taxonomy of SIC business activities; a content analyzer, in communication with the database, identifying commonly occurring keywords in the extracted website content from the websites of business entities that are similarly classified in the SIC predefined taxonomy of SIC business activities, where the commonly occurring keywords identified are used to update the classification system, the updated classification system being used to optimize searching in response to search queries; the content analyzer identifying commonly occurring keywords that are used to create a new category to update the classification system by; identifying keyword matches in the extracted website content by identifying any commonly occurring keywords or phrases in the extracted website content; and processing the matches identified by determining whether any of the keywords or phrases in the identified matches contain one or more keywords associated with any of the business activities in the SIC predefined taxonomy; and a full text indexed search engine, in communication with the database, processing a search query by matching the search query against the database, where at least a portion of the search results are clustered based on their respective SIC business activity category. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A method of optimizing searching in a data processing system comprising the steps of:
-
identifying websites of business entities that are associated with a standard industry classification system (SIC) predefined taxonomy of business activities having verified information of the business entities; extracting website content from the websites of business entities that are associated with the SIC predefined taxonomy of SIC business activities to enable full text searching; processing the extracted content to identify commonly occurring keywords in the extracted content from websites of business entities that have been similarly classified in the SIC predefined taxonomy of SIC business activities, where one or more of the commonly occurring keywords are used to update a classification system, where the classification system is based on the SIC predefined taxonomy of SIC business activities; creating a new category to update the classification system by; identifying keyword matches in the extracted website content by identifying any commonly occurring keywords or phrases in the extracted website content; and processing the matches identified by determining whether any of the keywords or phrases in the identified matches contain one or more keywords associated with any of the business activities in the SIC predefined taxonomy; processing a search query using full text index searching by matching the search query against at least a portion of the extracted content, where at least a portion of the search results are clustered according to their respective SIC business activity category. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41)
-
-
42. A computer implemented system for optimizing searching comprising:
-
a processor and memory; means for identifying websites of business entities that are associated with a standard industry classification system (SIC) predefined taxonomy of business activities having verified information of the business entities; means for extracting website content from the websites of business entities associated with the SIC predefined taxonomy of SIC business activities to enable full text searching; means for processing the extracted content to identify commonly occurring words in the extracted content from websites of business entities that have been similarly classified by the SIC predefined taxonomy of SIC business activities, where one or more of the commonly occurring keywords are used to update the classification system, where the classification scheme is based on the SIC predefined taxonomy of SIC business activities; means for creating a new category to update the classification system by; means for identifying keyword matches in the extracted website content by identifying any commonly occurring keywords or phrases in the extracted website content; and means for processing the matches identified by determining whether any of the keywords or phrases in the identified matches contain one or more keywords associated with any of the business activities in the SIC predefined taxonomy; means for processing a search query using a full text index searching by matching the search query against at least a portion of the extracted content, where at least a portion of the search results are clustered according to their respective SIC business activity category.
-
-
43. A method of optimizing searching in a data processing system comprising the steps of:
-
obtaining website content of business entities that are included in a control group, where the control group defines a classification system that is based on a standard industry classification system (SIC) predefined taxonomy of SIC business activities having verified information about the business entities, the classification system being used to classify and store the website content of the business entities; searching for unclassified business entities using at least a portion of the website content of business entities in the control group; classifying the unclassified business entities based on the portion of website content of business entities in the control group; identifying commonly occurring keywords in website content of business entities that are similarly classified in the classification system of the control group; allowing the control group to grow if a substantial amount of commonly occurring keywords have been identified; and creating a new business entities category to update the classification system of the control group allowing the control group to grow by; identifying keyword matches in the stored website content by identifying any commonly occurring keywords or phrases in the website content; and processing the matches identified by determining whether any of the keywords or phrases in the identified matches contain one or more keywords associated with any of the SIC business activities in the SIC predefined taxonomy; and using the control group to optimize full text index searching of the website content by organizing search results according to categories in the classification system of the control group. - View Dependent Claims (44, 45, 46, 47, 48, 49, 50)
-
-
51. A method of optimizing searching in a data processing system comprising the steps of:
-
storing website content associated with business entities that are included in a control group, where the control group defines a classification system that is based on a standard industry classification system (SIC) predefined taxonomy of business activities having verified information about the business entities, the classification system being used to classify and store the website content of the business entities; responding to a search request by using full text index searching of the website content; identifying commonly occurring keywords in website content from the search results and website content of business entities in the control group that are classified in the SIC predefined taxonomy of SIC business activities; grouping the search results based on the commonly occurring keywords identified; using website content from the search results to allow the control group to grow if a substantial amount of groups of commonly occurring keywords have been identified; and creating a new business entities category to update the classification system of the control group, allowing the control group to grow by; identifying keyword matches in the stored website content by identifying any commonly occurring keywords or phrases in the website content; and processing the matches identified by determining whether any of the keywords or phrases in the identified matches contain one or more keywords associated with any of the SIC business activities in the SIC predefined taxonomy. - View Dependent Claims (52, 53)
-
Specification