Methods and systems for determining a meaning of a document to match the document to content
First Claim
1. A method for determining a source meaning for a web page document, the method performed by a document server implemented as a network of computer processors or as a single computer system, the document server executing a document engine, the method comprising:
- receiving a web page document;
identifying, based on formatting information of the web page document, a collection of different regions contained within the webpage document, that would be displayed to a user visiting said web page document, wherein the regions contained within the webpage document contain content between opening and closing HTML or XML tags;
determining concepts expressed in each of the previously identified different regions in the collection,wherein determining the concepts expressed in each of the different regions comprisesidentifying words contained within each of the different regions and aligning the words with concepts;
determining scores for the concepts expressed in each of the different regions, wherein the score for a concept expressed in at least one of the different regions is based on a size of the at least one of the different regions;
creating a ranked global list of concepts based at least in part on said scores for said concepts expressed in each of said different regions;
removing unrelated concepts from said global list of concepts;
determining the source meaning for the web page document,wherein determining the source meaning includes excluding the unrelated concepts from the determination of the source meaning andwherein the source meaning is a vector of said determined concepts expressed in the web page document; and
making the previously determined source meaning available.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for determining a meaning of a document to match the document to content are described. In one aspect, a source article is accessed, a plurality of regions in the source article are identified, at least one local concept associated with each region is determined, the local concepts of each region are analyzed to identify any unrelated regions, the local concepts associated with any unrelated regions are eliminated to determine relevant concepts, the relevant concepts are analyzed to determine a source meaning for the source article, and the source meaning is matched with an item meaning associated with an item from a set of items.
40 Citations
29 Claims
-
1. A method for determining a source meaning for a web page document, the method performed by a document server implemented as a network of computer processors or as a single computer system, the document server executing a document engine, the method comprising:
-
receiving a web page document; identifying, based on formatting information of the web page document, a collection of different regions contained within the webpage document, that would be displayed to a user visiting said web page document, wherein the regions contained within the webpage document contain content between opening and closing HTML or XML tags; determining concepts expressed in each of the previously identified different regions in the collection, wherein determining the concepts expressed in each of the different regions comprises identifying words contained within each of the different regions and aligning the words with concepts; determining scores for the concepts expressed in each of the different regions, wherein the score for a concept expressed in at least one of the different regions is based on a size of the at least one of the different regions; creating a ranked global list of concepts based at least in part on said scores for said concepts expressed in each of said different regions; removing unrelated concepts from said global list of concepts; determining the source meaning for the web page document, wherein determining the source meaning includes excluding the unrelated concepts from the determination of the source meaning and wherein the source meaning is a vector of said determined concepts expressed in the web page document; and making the previously determined source meaning available. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented method comprising:
-
receiving, by a computing system, a source web page document; identifying, by the computing system based on formatting information of the source web page document, a plurality of regions contained within the source webpage document, that would be displayed to a user visiting said source web page document, wherein the regions contained within the source webpage document contain content between opening and closing HTML or XML tags; determining, by the computing system, at least one local concept expressed within each previously identified region, wherein determining the at least one local concept comprises identifying words in the document and aligning the words with concepts, wherein said at least one local concept expressed within said each previously identified region is a concept expressed by two or more words contained within the region; determining a score for a local concept expressed in each previously-identified region, wherein the score is based on an importance associated with the previously-identified region; analyzing, by the computing system, the previously determined at least one local concept of each region to identify and eliminate from consideration one or more local concepts that are unrelated to local concepts of other of said previously identified regions by creating a ranked global list of all of said local concepts; analyzing, by the computing system, the previously identified regions to identify and eliminate from consideration one or more regions that are unrelated to other previously identified regions by comparing a ranked list of local concepts for each of said previously identified regions to said global list; determining, by the computing system, a source meaning for the source web page document, wherein the source meaning for the source web page document is a weighted vector of said previously determined local concepts expressed in the source web page document that remain after the eliminations; and matching, by the computing system, the source web page document with an item selected from a set of items by comparing the previously determined source meaning and a meaning of the item. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A non-transitory computer-readable medium storing program code operable to cause one or more computers to perform operations comprising:
-
receiving a source web page document; identifying, by a preprocessor based on formatting information of the source web page document, a plurality of regions contained within the source webpage document, that would be displayed to a user visiting said source web page document, wherein the regions contained within the source webpage document contain content between opening and closing HTML or XML tags; determining at least one local concept expressed in each previously identified region, wherein determining the at least one local concept comprises identifying words in the document and aligning the words with concepts, wherein said at least one local concept expressed in the previously identified region is expressed by two or more words in the region; determining a score for a local concept expressed in each previously-identified region, wherein the score is based on a size of, or an importance associated with, the previously-identified region; analyzing the previously determined at least one local concept of each region to identify and eliminate from consideration one or more local concepts that are unrelated to local concepts of other of said previously identified regions by creating a ranked global list of all of said local concepts; analyzing the previously identified regions to identify and eliminate from consideration one or more regions that are unrelated to regions by comparing a ranked list of local concepts for each of said previously identified regions to said global list; determining a source meaning for the source web page document, wherein the source meaning for the source web page document is a weighted vector of said previously determined local concepts expressed in the source web page document that remain after the eliminations; and matching the source web page document with an item selected from a set of items by comparing the previously determined source meaning and a meaning of the item. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29)
-
Specification