Sniffing hypertext content to determine type
First Claim
1. A method for determining a type of embedded content in a web page, the method comprising:
- receiving web page content;
parsing the received web page content;
determining from the parsing that the web page content specifies embedded content to be retrieved;
requesting the embedded content;
receiving the embedded content and a response header;
analyzing the received embedded content to determine a first type of the embedded content;
analyzing the received response header to determine a second type of the embedded content; and
responsive to one of the first type of the embedded content and the second type of the embedded content not being an excluded content type, determining a third type of the embedded content based on the first type of the embedded content and based on the second type of the embedded content, wherein the third type of the embedded content is either the first type of the embedded content or the second type of the embedded content; and
responsive to the first type of the embedded content and the second type of the embedded content being excluded content types, determining the third type of the embedded content based on a highest score of a plurality of generated scores for a plurality of possible content types, the plurality of possible content types comprising the second type of the embedded content, the first type of the embedded content, and a content type associated with a file extension for the embedded content.
2 Assignments
0 Petitions
Accused Products
Abstract
Correct detection of embedded content type enables an operating system to launch the correct program to handle the embedded content. A page retrieval module retrieves an HTML page from a server, the contents of which are parsed by a parsing module. An embedded content analyzer gathers information from the parsed page about content embedded within the web page and proceeds to determine the type of content that is embedded. Content type is determined by analyzing various parameters such as a type specified by the web page, content type provided by an HTTP response, known file extensions present in a URL associated with the content or with the name of the file itself, and by sniffing the file. In one embodiment, the results of each analysis are weighted and a determination is made based upon the weighted total of results.
44 Citations
22 Claims
-
1. A method for determining a type of embedded content in a web page, the method comprising:
-
receiving web page content; parsing the received web page content; determining from the parsing that the web page content specifies embedded content to be retrieved; requesting the embedded content;
receiving the embedded content and a response header;analyzing the received embedded content to determine a first type of the embedded content; analyzing the received response header to determine a second type of the embedded content; and responsive to one of the first type of the embedded content and the second type of the embedded content not being an excluded content type, determining a third type of the embedded content based on the first type of the embedded content and based on the second type of the embedded content, wherein the third type of the embedded content is either the first type of the embedded content or the second type of the embedded content; and responsive to the first type of the embedded content and the second type of the embedded content being excluded content types, determining the third type of the embedded content based on a highest score of a plurality of generated scores for a plurality of possible content types, the plurality of possible content types comprising the second type of the embedded content, the first type of the embedded content, and a content type associated with a file extension for the embedded content. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer program product for determining a type of embedded content in a web page, the computer program product stored on a non-transitory computer readable medium and including instructions configured to cause a processor to carry out the steps of:
-
receiving web page content; parsing the received web page content; determining from the parsing that the web page content specifies embedded content to be retrieved; requesting the embedded content; receiving the embedded content and a response header; analyzing the received embedded content to determine a first type of the embedded content; analyzing the received response header to determine a second type of the embedded content; and responsive to one of the first type of the embedded content and the second type of the embedded content not being an excluded content type, determining a third type of the embedded content based on the first type of the embedded content and based on the second type of the embedded content, wherein the third type of the embedded content is either the first type of the embedded content or the second type of the embedded content; and responsive to the first type of the embedded content and the second type of the embedded content being excluded content types, determining the third type of the embedded content based on a highest score of a plurality of generated scores for a plurality of possible content types, the plurality of possible content types comprising the second type of the embedded content, the first type of the embedded content, and a content type associated with a file extension for the embedded content.
-
-
16. A system for determining a type of embedded content in a web page, the system comprising:
-
receiving means, for receiving web page content; parsing means, coupled to the receiving means, for parsing the received web page content; determining means, coupled to the parsing means, for determining from the parsing that the web page content specifies embedded content to be retrieved; requesting means, coupled to the determining means, for requesting the embedded content; receiving means, coupled to the requesting means, for receiving the embedded content and a response header; and analyzing the received embedded content to determine a first type of the embedded content; analyzing the received response header to determining a second type of the embedded content; and responsive to one of the first type of the embedded content and the second type of the embedded content not being an excluded content type, determining a third type of the embedded content based on the first type of the embedded content and based on the second type of the embedded content, wherein the third type of the embedded content is either the first type of the embedded content or the second type of the embedded content; and responsive to the first type of the embedded content and the second type of the embedded content being excluded content types, determining the third type of the embedded content based on a highest score of a plurality of generated scores for a plurality of possible content types, the plurality of possible content types comprising the second type of the embedded content, the first type of the embedded content, and a content type associated with a file extension for the embedded content.
-
-
17. A system for determining a type of embedded content in a web page, the system comprising:
-
a processor; a page receiving module executed by the processor for receiving web page content; a parsing module, coupled to the page receiving module and executed by the processor, for parsing the received web page content; and an embedded content analyzer, coupled to the parsing module and executed by the processor, for; determining from the parsing that the web page content specifies embedded content to be retrieved; requesting the embedded content; receiving the embedded content and a response header; analyzing the received embedded content to determine a first type of the embedded content; analyzing the received response header to determine a second type of the embedded content; responsive to one of the first type of the embedded content and the second type of the embedded content not being an excluded content type, determining a third type of the embedded content based on the first type of the embedded content and based on the second type of the embedded content, wherein the third type of the embedded content is either the first type of the embedded content or the second type of the embedded content; and responsive to the first type of the embedded content and the second type of the embedded content being excluded content types, determining the third type of the embedded content based on a highest score of a plurality of generated scores for a plurality of possible content types, the plurality of possible content types comprising the second type, the first type, and a content type associated with a file extension for the embedded content.
-
-
18. A method for determining a type of embedded content in a web page, the method comprising:
-
receiving web page content; parsing the received web page content to identify a reference to embedded content; requesting the referenced embedded content; receiving the embedded content and an associated response header, the response header specifying a first content type of the embedded content for the received embedded content; sniffing the received embedded content to determine a second content type of the embedded content, the determination having an associated level of confidence; responsive to the level of confidence associated with the determined second content type of the embedded content exceeding a threshold level, displaying tile embedded content on the web page using the second content type of the embedded content; responsive to the level of confidence associated with the determined second content type of the embedded content not exceeding the threshold level and responsive to the first content type of the embedded content not being an excluded content type, displaying the embedded content on the web page using the first content type of the embedded content; and responsive to the level of confidence associated with the determined second content type of the embedded content not exceeding the threshold level and responsive to the first content type of the embedded content being an excluded content type, displaying the received embedded content using a content type with a highest score of a plurality of generated scores for a plurality of possible content types, the plurality of possible content types comprising the sniffed second content type of the embedded content, the specified first content type of the embedded content and a content type associated with a file extension for the embedded content. - View Dependent Claims (19, 20, 21, 22)
-
Specification