System and method for aggregating and ranking data from a plurality of web sites
First Claim
Patent Images
1. A method for automatically collecting data from a plurality of targeted web sites to aggregate said data;
- the method comprising a plurality of stages;
automatically and periodically querying for said data from a plurality of related sites, said related sites comprising at least one web page that was not previously analyzed;
analyzing the results from said querying, said results comprising at least one webpage, said analyzing comprising;
geometrical analyzing of a page layout of the webpage, wherein said geometrical analyzing comprises determining one or more geometrical properties of the webpage, wherein said determining one or more geometrical properties comprises decomposing said page layout of the document into a plurality of layout subareas to render said page layout to form a rendered layout, determining one or more rectangles in each of said layout subarea, and determining height, width and position of each of said rectangles to form said geometrical properties of said rendered layout;
locating recurring patterns of said rectangles in said rendered layout;
searching for a plurality of record containers within said recurring patterns of said rectangles according to said layout subareas wherein said record containers are defined as having an organized inner structure of said rectangles;
selecting a record container to form a selected record container;
semantically analyzing a record from said selected record container to form a previously semantically analyzed record if a previously semantically analyzed record is not stored;
determining a relevancy of a record to form a relevant record from said selected record container according to said one or more geometrical properties by comparing said recurring patterns of said rectangles and said organized inner rectangles of records to said recurring patterns of said rectangles and said organized inner rectangles of records of a previously semantically analyzed relevant record;
storing the relevant record data in an aggregated data base to aggregate said data;
storing said recurring patterns of rectangles to form stored recurring patterns of rectangles;
comparing recurring patterns of rectangles on said at least one webpage that was not previously analyzed to said stored recurring patterns of rectangles to search for a match;
if no match is found, performing said above stages of the method for said at least one webpage that was not previously analyzed; and
retrieving said data from said aggregated data base, upon demand from user.
1 Assignment
0 Petitions
Accused Products
Abstract
System and method for collecting information from a plurality of related sites, analyzing the information and storing the relevant information in a data base for future use. According to one embodiment of the present invention, the system uses the provided list of sites, whether obtained automatically or separately, queries them and analyzes the result retrieved from each site. The information may also optionally and preferably be ranked.
31 Citations
17 Claims
-
1. A method for automatically collecting data from a plurality of targeted web sites to aggregate said data;
- the method comprising a plurality of stages;
automatically and periodically querying for said data from a plurality of related sites, said related sites comprising at least one web page that was not previously analyzed; analyzing the results from said querying, said results comprising at least one webpage, said analyzing comprising;
geometrical analyzing of a page layout of the webpage, wherein said geometrical analyzing comprises determining one or more geometrical properties of the webpage, wherein said determining one or more geometrical properties comprises decomposing said page layout of the document into a plurality of layout subareas to render said page layout to form a rendered layout, determining one or more rectangles in each of said layout subarea, and determining height, width and position of each of said rectangles to form said geometrical properties of said rendered layout;locating recurring patterns of said rectangles in said rendered layout; searching for a plurality of record containers within said recurring patterns of said rectangles according to said layout subareas wherein said record containers are defined as having an organized inner structure of said rectangles; selecting a record container to form a selected record container; semantically analyzing a record from said selected record container to form a previously semantically analyzed record if a previously semantically analyzed record is not stored; determining a relevancy of a record to form a relevant record from said selected record container according to said one or more geometrical properties by comparing said recurring patterns of said rectangles and said organized inner rectangles of records to said recurring patterns of said rectangles and said organized inner rectangles of records of a previously semantically analyzed relevant record; storing the relevant record data in an aggregated data base to aggregate said data; storing said recurring patterns of rectangles to form stored recurring patterns of rectangles; comparing recurring patterns of rectangles on said at least one webpage that was not previously analyzed to said stored recurring patterns of rectangles to search for a match; if no match is found, performing said above stages of the method for said at least one webpage that was not previously analyzed; and retrieving said data from said aggregated data base, upon demand from user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- the method comprising a plurality of stages;
-
16. A system for automatically collecting data from a plurality of targeted web sites to aggregate said data;
- comprising;
a data base and a processor, said processor performing the following processes, a. a crawler process for fetching data from a provided list of related web sites, said related sites comprising at least one at least one web page that was not previously analyzed; b. a geometrical analyzer process for analyzing said data, said data comprising at least one webpage, said analyzing comprising;
geometrical analyzing of a page layout of the webpage, wherein said geometrical analyzing comprises determining one or more geometrical properties of the webpage, wherein said determining one or more geometrical properties comprises decomposing said page layout of the document into a plurality of layout subareas to render said page layout to form a rendered layout, determining one or more rectangles in each of said layout subareas, and determining height, width and position of each of said rectangles to form said geometrical properties of said rendered layout;locating recurring patterns of said rectangles in said rendered layout; searching for a plurality of record containers within said recurring patterns of said rectangles according to said layout subareas wherein said record containers are defined as having an organized inner structure of said rectangles; selecting a record container to form a selected record container; semantically analyzing a record from said selected record container to form a previously semantically analyzed record if a previously semantically analyzed record is not stored; determining a relevancy of a record to form a relevant record from said selected record container according to said one or more geometrical properties by comparing said recurring patterns of said rectangles and said organized inner rectangles of records to said recurring patterns of said rectangles and said organized inner rectangles of records of a previously analyzed relevant record; storing said recurring patterns of rectangles to form stored recurring patterns of rectangles; comparing recurring patterns of rectangles on said at least one webpage that was not previously analyzed to said stored recurring patterns of rectangles to search for a match; if no match is found, performing said above stages of the method for said at least one webpage that was not previously analyzed; and c. a semantic layer for textually analyzing said relevant record to retrieve information; wherein said data base stores the information retrieved by said semantic layer.
- comprising;
-
17. A method for automatically collecting data from a plurality of targeted web sites to aggregate said data;
- the method comprising a plurality of stages;
automatically and periodically querying for said data from a plurality of related sites, said related sites comprising at least one web page that was not previously analyzed; analyzing the results from said querying, said results comprising at least one webpage, said analyzing comprising;
geometrical analyzing of a page layout of the webpage, wherein said geometrical analyzing comprises determining one or more geometrical properties of the webpage, wherein said determining one or more geometrical properties comprises decomposing said page layout of the document into a plurality of layout subareas to render said page layout to form a rendered layout, determining one or more rectangles in each of said layout subareas, and determining height, width and position of each of said rectangles to form said geometrical properties of said rendered layout;locating recurring patterns of said rectangles in said rendered layout, wherein said recurring patterns of rectangles do not have a geometrically fixed position within the webpage; searching for a plurality of record containers within said recurring patterns of said rectangles according to said layout subareas wherein said record containers are defined as having an organized inner structure of said rectangles; selecting a record container to form a selected record container; semantically analyzing a record from said selected record container to form a previously semantically analyzed record if a previously semantically analyzed record is not stored; determining a relevancy of a record to form a relevant record from said selected record container according to said one or more geometrical properties by comparing said recurring patterns of said rectangles and said organized inner rectangles of records to said recurring patterns of said rectangles and said organized inner rectangles of records of a previously analyzed relevant record, wherein only said recurring patterns are compared and not a location of said rectangles within the webpage; storing the relevant record data in an aggregated data base to aggregate said data; storing said recurring patterns of rectangles to form stored recurring patterns of rectangles; comparing recurring patterns of rectangles on said at least one webpage that was not previously analyzed to said stored recurring patterns of rectangles to search for a match, wherein only said recurring patterns are compared and not a location of said rectangles within the webpage; if no match is found, performing said above stages of the method for said at least one webpage that was not previously analyzed; and retrieving said data from said aggregated data base, upon demand from user.
- the method comprising a plurality of stages;
Specification