Identifying and tracking sensitive data
First Claim
1. A computer-implemented method of classifying privacy relevance of an application programming interface (API), the computer implemented method comprising:
- in response to receiving a set of input applications, analyzing, by a processor of a computer system, the set of input applications to identify a plurality of custom APIs, via one or more abstract syntax trees (ASTs),wherein representative code of the set of input applications is stored in the one or more ASTs;
generating, by the processor of the computer system, a respective taint specification for each identified custom API, each respective taint specification relating one or more sources of data to one or more data sinks;
generating, by the processor of the computer system, one or more taint flows based on the each respective taint specification, the one or more taint flows being a data path and associated data values between a source of data and a data sink, via data recorded from instrumenting the set of input applications based on the each respective taint specification;
matching, by the processor of the computer system, one or more features and associated feature values from the one or more taint flows to a set of feature templates, via a representative code of each application of the set of input applications,wherein the representative code is searched to find one or more occurrences of each identified custom API;
correlating, by the processor of the computer system, the matched one or more features and associated feature values with respective privacy relevance of the plurality of custom APIs to identify a set of privacy relevant features;
clustering, by the processor of the computer system, the custom APIs from the set of input applications into separate groups based on similarity between the matched one or more features and associated feature values of each identified custom API,wherein the clustering is unsupervised;
detecting, by the processor of the computer system, a candidate API;
extracting, by the processor of the computer system, one or more features from the candidate API;
comparing, by the processor of the computer system, the one or more features extracted from the candidate API to the set of privacy relevant features;
assigning, by the processor of the computer system, a label to the candidate API indicating privacy relevance of the candidate API; and
outputting an indication of the privacy relevancy of the candidate API via a user output device.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of classifying privacy relevance of an application programming interface (API) comprises analyzing a set of input applications to identify a plurality of custom APIs and generating a respective taint specification for each identified custom API. The method further comprises generating taint flows based on each taint specification and matching features and associated feature values from the taint flows to a set of feature templates. The method also comprises correlating the matched features and associated feature values with respective privacy relevance of the plurality of custom APIs to identify a set of privacy relevant features. The method further comprises detecting a candidate API, extracting features from the candidate API and comparing the extracted features to the set of privacy relevant features. Based on the comparison, a label is assigned to the candidate API indicating privacy relevance of the candidate API.
-
Citations
19 Claims
-
1. A computer-implemented method of classifying privacy relevance of an application programming interface (API), the computer implemented method comprising:
-
in response to receiving a set of input applications, analyzing, by a processor of a computer system, the set of input applications to identify a plurality of custom APIs, via one or more abstract syntax trees (ASTs), wherein representative code of the set of input applications is stored in the one or more ASTs; generating, by the processor of the computer system, a respective taint specification for each identified custom API, each respective taint specification relating one or more sources of data to one or more data sinks; generating, by the processor of the computer system, one or more taint flows based on the each respective taint specification, the one or more taint flows being a data path and associated data values between a source of data and a data sink, via data recorded from instrumenting the set of input applications based on the each respective taint specification; matching, by the processor of the computer system, one or more features and associated feature values from the one or more taint flows to a set of feature templates, via a representative code of each application of the set of input applications, wherein the representative code is searched to find one or more occurrences of each identified custom API; correlating, by the processor of the computer system, the matched one or more features and associated feature values with respective privacy relevance of the plurality of custom APIs to identify a set of privacy relevant features; clustering, by the processor of the computer system, the custom APIs from the set of input applications into separate groups based on similarity between the matched one or more features and associated feature values of each identified custom API, wherein the clustering is unsupervised; detecting, by the processor of the computer system, a candidate API; extracting, by the processor of the computer system, one or more features from the candidate API; comparing, by the processor of the computer system, the one or more features extracted from the candidate API to the set of privacy relevant features; assigning, by the processor of the computer system, a label to the candidate API indicating privacy relevance of the candidate API; and outputting an indication of the privacy relevancy of the candidate API via a user output device. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A program product comprising a non-transitory processor-readable storage medium having program instructions embodied thereon, wherein the program instructions are configured, when executed by at least one programmable processor, to cause the at least one programmable processor to:
-
in response to receiving a set of input applications analyze, by a processor of a computer system, the set of input applications to identify a plurality of custom application programming interfaces (APIs), via one or more abstract syntax trees (ASTs), wherein representative code of the set of input applications is stored in the one or more ASTs; generate, by the processor of the computer system, a respective taint specification for each identified custom API, each respective taint specification relating one or more sources of data to one or more data sinks; generate, by the processor of the computer system, one or more taint flows based on the each respective taint specification, the one or more taint flows being a data path and associated data values between a source of data and a data sink, via data recorded from instrumenting the set of input applications based on the each respective taint specification; match, by the processor of the computer system, one or more features and associated feature values from the one or more taint flows to a set of feature templates, via a representative code of each application of the set of input applications, wherein the representative code is searched to find one or more occurrences of each identified custom API; correlate, by the processor of the computer system, the matched one or more features and associated feature values with respective privacy relevance of the plurality of custom APIs to identify a set of privacy relevant features; cluster, by the processor of the computer system, the plurality of custom APIs from the set of input applications into separate groups based on similarity between the matched one or more features and associated feature values of each identified custom API, wherein the clustering is unsupervised; detect, by the processor of the computer system, a candidate API; extract, by the processor of the computer system, one or more features from the candidate API; compare the one or more features extracted from the candidate API to the set of privacy relevant features; assign a label to the candidate API indicating privacy relevance of the candidate API; and output an indication of the privacy relevancy of the candidate API via a user output device. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computer system comprising:
-
a memory; a network interface; and a processor communicatively coupled to the memory and the network interface, wherein the processor is configured to obtain a set of input applications via the network interface and to analyze the set of input applications to identify a plurality of custom application programming interface (APIs) in the set of input applications, via one or more abstract syntax trees (ASTs), wherein representative code of the set of input applications is stored in the one or more ASTs, wherein the processor is further configured to determine a set of privacy relevant features from the plurality of identified custom APIs and to store the set of privacy relevant features in the memory, wherein the representative code is searched to find one or more occurrences of each identified custom API, wherein the processor is further configured to cluster the custom APIs from the set of input applications into separate groups based on similarity between respective identified feature values of each identified custom API, wherein the clustering is unsupervised, wherein the processor is further configured to detect execution of a candidate API subsequent to storing the set of privacy relevant features in the memory, wherein the processor is further configured to extract one or more features from the candidate API, wherein the processor is further configured to compare the extracted one or more features to the set of privacy relevant features in order to determine the privacy relevance of the candidate API, wherein the processor is further configured to assign a label to the candidate API indicating privacy relevance of the candidate API, wherein the processor is configured to provide an indication of the privacy relevancy of the candidate API via the user output device, wherein the processor is configured to determine the set of privacy relevant features by; generating, by the processor of the computer system, a respective taint specification for each identified custom API, each respective taint specification relating one or more sources of data to one or more data sinks; generating, by the processor of the computer system, one or more taint flows based on each respective taint specification, the one or more taint flows being a data path and associated data values between a source of data and a data sink, via data recorded from instrumenting the set of input applications based on the each respective taint specification; matching, by the processor of the computer system, one or more features and associated feature values from the one or more taint flows to a set of feature templates, via a representative code of each application of the set of input applications; correlating, by the processor of the computer system, the matched one or more features and associated feature values with respective privacy relevance of the plurality of custom APIs to identify the set of privacy relevant features and associated feature values. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification