Infrastructure enabling intelligent execution and crawling of a web application
1. A method comprising:
- accessing, by one or more computing systems associated with a social-networking system, a structured document of a network application, the structured document comprising structural information and content items, each content item comprising one or more embedded scripts, resources, or identifiers for the resources;
processing, by the one or more computing systems, the structured document to generate a model representation of the structured document;
executing, by the one or more computing systems, a plurality of the content items of the structured document;
generating, by the one or more computing systems, a plurality of snapshots of the model representation of the structured document, each snapshot comprising a respective modified copy of the model representation and corresponding to a respective executed content item of the plurality of content items;
logging, by the one or more computing systems, the plurality of snapshots of the model representation of the structured document;
creating, by the one or more computing systems, a behavior model of the network application based on the plurality of snapshots of the model representation of the structured document, the behavior model representing at least a communication of data between the network application and one or more third-party servers; and
determining, by the one or more computing devices, based on the behavior model, compliance by the network application with one or more requirements of the social-networking system, wherein the determining compliance is further based on whether the network application is passing data received from the social-networking system to the one or more third-party servers.
In particular embodiments, a method comprises accessing, by one or more computing systems associated with a social-networking system, a structured document of a network application, the structured document comprising structural information and content comprising one or more embedded scripts, resources, or identifiers for the resources. The method further comprises processing the structured document to generate a model representation of the structured document, executing at least some of the content of the structured document and logging multiple snapshots of the model representation of the structured document as the model representation is generated in response to one or more interactions initiated by execution of the content. The method further comprises creating a behavior model of the network application based on the multiple snapshots of the model representation of the structured document and determining, based on the behavior model, compliance by the network application with one or more requirements of the social-networking system.
|INDICATING WEBSITE REPUTATIONS DURING WEBSITE MANIPULATION OF USER INFORMATION|
Patent #US 20100042931A1
Current AssigneeMcAfee LLC
Sponsoring EntityMcAfee Inc.
|System and method for adapting information content for an electronic device|
Patent #US 7,500,188 B1
Current AssigneeProvenance Asset Group LLC
Sponsoring EntityNovarra Incorporated
|SERVER EVALUATION OF CLIENT-SIDE SCRIPT|
Patent #US 20080178162A1
Current AssigneeOath Inc.
Sponsoring EntityAOL Inc.
|Method and apparatus for generating a directory structure|
Patent #US 20030088593A1
Current AssigneeWSOU Investments LLC
Sponsoring EntityNokia Corporation
|Predicting data for document attributes based on aggregated data for repeated URL patterns|
Patent #US 8,645,367 B1
Current AssigneeGoogle LLC
Sponsoring EntityGoogle Inc.
- 1. A method comprising:
accessing, by one or more computing systems associated with a social-networking system, a structured document of a network application, the structured document comprising structural information and content items, each content item comprising one or more embedded scripts, resources, or identifiers for the resources; processing, by the one or more computing systems, the structured document to generate a model representation of the structured document; executing, by the one or more computing systems, a plurality of the content items of the structured document; generating, by the one or more computing systems, a plurality of snapshots of the model representation of the structured document, each snapshot comprising a respective modified copy of the model representation and corresponding to a respective executed content item of the plurality of content items; logging, by the one or more computing systems, the plurality of snapshots of the model representation of the structured document; creating, by the one or more computing systems, a behavior model of the network application based on the plurality of snapshots of the model representation of the structured document, the behavior model representing at least a communication of data between the network application and one or more third-party servers; and determining, by the one or more computing devices, based on the behavior model, compliance by the network application with one or more requirements of the social-networking system, wherein the determining compliance is further based on whether the network application is passing data received from the social-networking system to the one or more third-party servers.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- 25. A system comprising:
- one or more processors associated with a social-networking system; and
logic encoded in one or more computer-readable tangible storage media that, when executed by the one or more processors, is operable to;
access a structured document of a network application, the structured document comprising structural information and content items, each content item comprising one or more embedded scripts, resources, or identifiers for the resources; process the structured document to generate a model representation of the structured document; execute a plurality of the content items of the structured document; generate a plurality of snapshots of the model representation of the structured document, each snapshot comprising a respective modified copy of the model representation and corresponding to a respective executed content item of the plurality of content items; log the plurality of snapshots of the model representation of the structured document; create a behavior model of the network application based on the plurality of snapshots of the model representation of the structured document, the behavior model representing at least a communication of data between the network application and one or more third-party servers; and determine, based on the behavior model, compliance by the network application with one or more requirements of the social-networking system, wherein the determining compliance is further based on whether the network application is passing data received from the social-networking system to the one or more third-party servers.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
- one or more processors associated with a social-networking system; and
- 45. One or more computer-readable non-transitory storage media embodying software that is operable when executed to:
process the structured document to generate a model representation of the structured document; execute a plurality of the content items of the structured document; log the plurality of snapshots of the model representation of the structured document;
This application is a continuation under 35 U.S.C. § 120 of U.S. patent application Ser. No. 12/755,275, filed 6 Apr. 2010, which is incorporated herein by reference.
The present disclosure relates generally to Web applications, and more particularly, to generating behavior models of Web applications.
In various example embodiments, one or more described web pages or web applications may be associated with a social networking system or social networking service. However, alternate embodiments may have application to the retrieval and rendering of structured documents or web applications hosted by any type of network addressable resource or web site. As used herein, a “user” may be an individual, a group, or an entity (such as a business or third party application). Additionally, as used herein, “or” may imply “and” as well as “or;” that is, “or” does not necessarily preclude “and,” unless explicitly stated or implicitly implied.
Particular embodiments may operate in, or in conjunction with, a wide area network environment, such as the Internet, including multiple network addressable systems.
Each client device 30 may generally be a computer or computing device including functionality for communicating (e.g., remotely) over a computer network. Client device 30 may be a desktop computer, laptop computer, personal digital assistant (PDA), in- or out-of-car navigation system, smart phone or other cellular or mobile phone, or mobile gaming device, among other suitable computing devices. Client device 30 may execute one or more client applications, such as a web browser (e.g., MICROSOFT WINDOWS INTERNET EXPLORER, MOZILLA FIREFOX, APPLE SAFARI, GOOGLE CHROME, AND OPERA, etc.), to access and view content over a computer network. In particular implementations, the client applications allow a user of client device 30 to enter addresses of specific network resources to be retrieved, such as resources hosted by social networking system 20. These addresses can be Uniform Resource Locators (URLs). In addition, once a page or other resource has been retrieved, the client applications may provide access to other pages or records when the user “clicks” on hyperlinks to other resources. By way of example, such hyperlinks may be located within the web pages and provide an automated way for the user to enter the URL of another page and to retrieve that page.
In one example embodiment, social networking system 20 comprises computing systems that allow users at client devices 30 to communicate or otherwise interact with each other and access content, such as user profiles, as described herein. Social networking system 20 is a network addressable system that, in various example embodiments, comprises one or more physical servers 22a or 22b (hereinafter referred to collectively as servers 22) as well as data store 24, as illustrated in
Physical servers 22 may host functionality directed to the operations of social networking system 20. By way of example, social networking system 20 may host a website that allows one or more users, at one or more client devices 30, to view and post information (including internal or external hypertext links), as well as communicate with one another via the website. Hereinafter servers 22 may be referred to as server 22, although server 22 may include numerous servers hosting, for example, social networking system 20, as well as other content distribution servers, data stores, and databases. Data store 24 may store content and data relating to, and enabling, operation of the social networking system as digital data objects. A data object, in particular implementations, is an item of digital information typically stored or embodied in a data file, database or record. Content objects may take many forms, including: text (e.g., ASCII, SGML, HTML), images (e.g., jpeg, tif and gif), graphics (vector-based or bitmap), audio, video (e.g., mpeg), or other multimedia, and combinations thereof. Content object data may also include executable code objects (e.g., games executable within a browser window or frame), podcasts, etc. Logically, data store 24 corresponds to one or more of a variety of separate and integrated databases, such as relational databases and object-oriented databases, that maintain information as an integrated collection of logically related records or files stored on one or more physical systems. Structurally, data store 24 may generally include one or more of a large class of data storage and management systems. In particular embodiments, data store 24 may be implemented by any suitable physical system(s) including components, such as one or more database servers, mass storage media, media library systems, storage area networks, data storage clouds, and the like. In one example embodiment, data store 24 includes one or more servers, databases (e.g., MySQL), and/or data warehouses.
Data store 24 may include data associated with different social networking system 20 users or client devices 30. In particular embodiments, the social networking system 20 maintains a user profile for each user of the system 20. User profiles include data that describe the users of a social network, which may include, for example, proper names (first, middle and last of a person, a trade name and/or company name of a business entity, etc.) biographic, demographic, and other types of descriptive information, such as work experience, educational history, hobbies or preferences, geographic location, and additional descriptive data. By way of example, user profiles may include a user'"'"'s birthday, relationship status, city of residence, and the like. The system 20 may further store data describing one or more relationships between different users. The relationship information may indicate users who have similar or common work experience, group memberships, hobbies, or educational history. A user profile may also include privacy settings governing access to the user'"'"'s information is to other users. In particular embodiments, the social networking system 20 maintains in data store 24 a number of objects for the different kinds of items with which a user may interact while accessing social networking system 20. In one example embodiment, these objects include user profiles, application objects, and message objects (such as for wall posts, emails and other messages). In one embodiment, an object is stored by the system 20 for each instance of its associated item. These objects and the actions discussed herein are provided for illustration purposes only, and it can be appreciated that an unlimited number of variations and features can be provided on a social networking system 20.
When a user at a client device 30 desires to view a particular web page (hereinafter also referred to as target structured document) hosted by social networking system 20 or a web application hosted by a web application server 40 and made available in conjunction with social networking system 20, the user'"'"'s web browser, or other document rendering engine or suitable client application, formulates and transmits a request to social networking system 20. The request generally includes a URL, or other document identifier, as well as metadata or other information. By way of example, the request may include information identifying the user, such as a user ID, as well as information identifying or characterizing the web browser or operating system running on the user'"'"'s client computing device 30. The request may also include location information identifying a geographic location of the user'"'"'s client device or a logical network location of the user'"'"'s client device. The request may also include a timestamp identifying when the request was transmitted.
Generally, a web application is an application that may be accessed via a web browser or other client application over a network, or a computer software application that is coded in a web browser-supported language and sometimes reliant on a web browser to render the application executable. Web applications have gained popularity largely as a result of the ubiquity of web browsers, the convenience of using a web browser launched at a remote computing device as a client (sometimes referred to as a thin client), and the corresponding ability to update and maintain web applications without necessarily distributing and installing software on remote clients. Often, to implement a web application, the web application requires access to one or more resources provided at a backend server of an associated website. Additionally, web applications may often require access to additional resources associated with other applications.
When a third party application developer creates a web application for use by users of social networking system 20, the developers may decide whether to configure the web application using, for example, IFrames or FBML, as the default for the application'"'"'s canvas (base) pages. A canvas page is the address where the web application is located or cached within social networking system 20. When user'"'"'s access a given web application via social networking system 20, the users are taken to the canvas page which serves as a base page that effectively hosts the web application and in which the web application is rendered and displayed.
By way of example, referring to an IFrame canvas page such as that illustrated in
As additionally illustrated in
Referring now to
Thus, in particular embodiments, social networking system 20 includes or is coupled to an infrastructure or platform that enables intelligent crawling and actual execution of web pages and web applications, and which may log information that may be used to discover web application providers and ad network servers 50 that may not deliver ads according to service agreements in line with social networking system 20, or which otherwise perform scrupulously. In particular embodiments, social networking system 20 includes or is coupled to a primary-secondary distributed computing system that includes primary computing system (primary) 602 and one or more secondary computing systems (secondary) 604. Each secondary 604 is capable of running one or more crawler processes 606. In particular embodiments, each crawler process 606 may effectively behave as, or similar to, a headless browser (e.g., a browser capable of navigating and rendering web applications and pages without requiring a physical monitor or display in connection with the browser), and which uses test user credentials generated or stored in user credential database 610 to crawl web applications scheduled by master 602.
Still further, each crawler process 606 effectively emulates a browser client at a client device 30, and as such, has browser-privileged access to all the content accessible by a browser. Furthermore, each crawler process 606 may have access to computer network cloud 60 either directly or through social networking system 20. Also, in this way, each crawler process 606 may be completely unplugged from the user-side in that, in particular embodiments, the crawler processes 606 do not interact with real users of social networking system 20.
Referring now to
In particular embodiments, using the test user credentials and web application URL specified in the query received from primary 602, the scheduled crawler process may first login to a login page of social networking system 20 at 704. In alternate embodiments, no login step may be required. By way of example, the test user may be preauthorized by social networking system 20 or automatically granted access via crawler process 606. In particular embodiments, the crawler process then attempts to access the specified web application by transmitting, at 706, a request for the web application'"'"'s canvas page as, for example, described with reference to the flowcharts of
In some implementations, the crawler process 606 may log external interactions (such as outgoing requests and responses) as a structured document is being loaded and the model representation of the page is being generated (710). In particular embodiments, as the structured document is being processed and the DOM representation is being generated, crawler process 606 tracks and logs (e.g., in any suitable database), at 710, one or more interactions including outgoing requests transmitted as a result of executing embedded calls, scripts, or code segments as well as, in some embodiments, incoming responses received from web application server 40, ad network 50, or other locations (e.g., in the case of redirects). In particular embodiments, to perform the tracking/logging, crawler process 606 can be constructed to take advantage of certain functions of an underlying browser application (such as GECKO) that handle network requests. For example, the functions that handle network requests may support certain services, on which another process or module may install hooks. When an event, such as an outgoing request or incoming response occurs, a call back function can be called in order for the crawler process 606 to log the event. In particular embodiments, the overlying programming layer is configured to track interactions, such as, by way of example, all network requests made and transmitted by crawler process 606, such as in response to crawler process 606 executing calls, scripts, or other executable code segments embedded within the base page or the web application content itself. In particular embodiments, the overlying programming layer may also monitor, track, or log incoming responses transmitted to or for the crawler process.
The following illustrates an example raw data output by the enumeration script illustrated above:
The raw data output can be processed as set forth below to facilitate logging and querying of the data:
In particular embodiments, crawler 606 generates, at 712, a behavior model of the web application based on all or a filtered set of the logged interactions, and the enumerated features of the page. In particular embodiments, the behavior model specifies the URL of the web application, the URLs or Domain names of various resources for which requests were sent (including requests for ads sent to ad networks), the URLs or Domain names associated with various resources received from third party servers (including ads from ad networks), all or a portion of the HTML for the web application URL, all or a portion of the raw text of the web application (such as the raw and processed output described above), among other desired information. In particular, the behavior model provides a map of the outgoing requests, including requests made for ads to ad networks 50 by the web application, as a result of crawler 606 executing embedded calls within the web application content. In this way, all of the domains where requests are sent from the web application for the particular test user over a number of scheduled crawls (or even across test users, although in particular embodiments, the same test user is used in crawling a particular web application every time the web application is crawled in order to take advantage of and preserve previously downloaded cookies) may be used to provide insight into what particular ad networks an particular web application is using. In one implementation, a separate process, such as a process hosted on a primary computing system, may itself apply a rule set to the generated behavior model to determine whether the network application meets one or more requirements or is otherwise suitable. In addition, various features of the page are also logged for further analysis and tracking.
More particularly, in some embodiments, the logged data for a given web application may be transferred to Hadoop or other distributed computing platform for subsequent processing including filtering the data to ascertain which of the logged requests or associated URLs or Domain names correspond to ad networks and generating a second list or log of the ad networks. In particular embodiments, this second log may then be queried against a list of known rogue, scrupulous, banned, or otherwise in-violation ad networks. Furthermore, in some embodiments, crawler 606 may also capture, in the behavior model, various parameters sent to various domains, especially ad network domains, to determine if the web application or ad network providing the ad requested by the web application is passing any data received from social networking system 20 about the test user to other parties. Still further, the enumerated attributes of a landing page, for example, can be compared against one or more profiles to possibly identify a phishing site or some other unauthorized or undesirable application.
Still other enabled embodiments may include the ability to record, log, or index how a web application appears or functions at particular point in time to track changes over a determined window of time. Other embodiments may include automatically mapping out the flow paths of a given web application, including recording, logging, or indexing how an application'"'"'s canvas page appears (and what its functionality is) at various points in a possible user flow. Still other embodiments may relate to a “socially-enabled” search index that includes elements of how one or more applications interact with or appear to logged-in users. Still other embodiments may include mapping or tracking variations in application functionality that depend on who the logged-in user is, or various categories of demographics or other characteristics or attributes available from the user profiles. Yet other embodiments may include indexing, mapping, or tracking how an application'"'"'s functionality varies over geographic location, browser type, or type of computing device, for example.
As described herein, any of the described processes or methods can be implemented as a series of computer-readable instructions, embodied or encoded on or within a tangible data storage medium, that when executed are operable to cause one or more processors to implement the operations described above. For smaller datasets, the operations described above can be executed on a single computing platform or node. For larger systems and resulting data sets, parallel computing platforms can be used such as, for example, using Hive to accomplish ad hoc querying, summarization and data analysis, as well as using as incorporating statistical modules by embedding mapper and reducer scripts, such as Python or Perl scripts that implement a statistical algorithm. Other development platforms that can leverage Hadoop or other Map-Reduce execution engines can be used as well. The Apache Software Foundation has developed a collection of programs called Hadoop, which includes: (a) a distributed file system; and (b) an application programming interface (API) and corresponding implementation of MapReduce.
Multiple nodes also facilitate the parallel processing of large databases. In some embodiments, a master server, such as 22a, receives a job from a client and then assigns tasks resulting from that job to slave servers or nodes, such as servers 22b, which do the actual work of executing the assigned tasks upon instruction from the master and which move data between tasks. In some embodiments, the client jobs will invoke Hadoop'"'"'s MapReduce functionality, as discussed above.
Likewise, in some embodiments, a master server, such as server 22a, governs a distributed file system that supports parallel processing of large databases. In particular, the master server 22a manages the file system'"'"'s namespace and block mapping to nodes, as well as client access to files, which are actually stored on slave servers or nodes, such as servers 22b. In turn, in some embodiments, the slave servers do the actual work of executing read and write requests from clients and perform block creation, deletion, and replication upon instruction from the master server.
While the foregoing processes and mechanisms can be implemented by a wide variety of physical systems and in a wide variety of network and computing environments, the server or computing systems described below provide example computing system architectures for didactic, rather than limiting, purposes.
The elements of hardware system 800 are described in greater detail below. In particular, network interface 816 provides communication between hardware system 800 and any of a wide range of networks, such as an Ethernet (e.g., IEEE 802.3) network, a backplane, etc. Mass storage 818 provides permanent storage for the data and programming instructions to perform the above-described functions implemented in the servers 22a, 22b, whereas system memory 814 (e.g., DRAM) provides temporary storage for the data and programming instructions when executed by processor 802. I/O ports 820 are one or more serial and/or parallel communication ports that provide communication between additional peripheral devices, which may be coupled to hardware system 800.
Hardware system 800 may include a variety of system architectures; and various components of hardware system 800 may be rearranged. For example, cache 804 may be on-chip with processor 802. Alternatively, cache 804 and processor 802 may be packed together as a “processor module,” with processor 802 being referred to as the “processor core.” Furthermore, certain embodiments of the present invention may not require nor include all of the above components. For example, the peripheral devices shown coupled to standard I/O bus 908 may couple to high performance I/O bus 806. In addition, in some embodiments, only a single bus may exist, with the components of hardware system 800 being coupled to the single bus. Furthermore, hardware system 800 may include additional components, such as additional processors, storage devices, or memories.
In one implementation, the operations of the embodiments described herein are implemented as a series of executable modules run by hardware system 800, individually or collectively in a distributed computing environment. In a particular embodiment, a set of software modules and/or drivers implements a network communications protocol stack, parallel computing functions, browsing and other computing functions, optimization processes, and the like. The foregoing functional modules may be realized by hardware, executable modules stored on a computer readable medium, or a combination of both. For example, the functional modules may comprise a plurality or series of instructions to be executed by a processor in a hardware system, such as processor 802. Initially, the series of instructions may be stored on a storage device, such as mass storage 818. However, the series of instructions can be tangibly stored on any suitable storage medium, such as a diskette, CD-ROM, ROM, EEPROM, etc. Furthermore, the series of instructions need not be stored locally, and could be received from a remote storage device, such as a server on a network, via network/communications interface 816. The instructions are copied from the storage device, such as mass storage 818, into memory 814 and then accessed and executed by processor 802.
An operating system manages and controls the operation of hardware system 800, including the input and output of data to and from software applications (not shown). The operating system provides an interface between the software applications being executed on the system and the hardware components of the system. Any suitable operating system may be used, such as the LINUX Operating System, the Apple Macintosh Operating System, available from Apple Computer Inc. of Cupertino, Calif., UNIX operating systems, Microsoft® Windows® operating systems, BSD operating systems, and the like. Of course, other implementations are possible. For example, the functions described herein may be implemented in firmware or on an application specific integrated circuit.
Furthermore, the above-described elements and operations can be comprised of instructions that are stored on storage media. The instructions can be retrieved and executed by a processing system. Some examples of instructions are software, program code, and firmware. Some examples of storage media are memory devices, tape, disks, integrated circuits, and servers. The instructions are operational when executed by the processing system to direct the processing system to operate in accord with the invention. The term “processing system” refers to a single processing device or a group of inter-operational processing devices. Some examples of processing devices are integrated circuits and logic circuitry. Those skilled in the art are familiar with instructions, computers, and storage media.
The present disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments herein that a person having ordinary skill in the art would comprehend. Similarly, where appropriate, the appended claims encompass all changes, substitutions, variations, alterations, and modifications to the example embodiments herein that a person having ordinary skill in the art would comprehend. By way of example, while embodiments of the present disclosure have been described as operating in connection with a social networking website, various embodiments of the present invention can be used in connection with any communications facility that supports web applications. Furthermore, in some embodiments the term “web service” and “web site” may be used interchangeably and additionally may refer to a custom or generalized API on a device, such as a mobile device (e.g., cellular phone, smart phone, personal GPS, personal digital assistance, personal gaming device, etc.), that makes API calls directly to a server.