Many companies, both in Digital Native and traditional regulated industries, such as Finance, Healthcare and Telecom, use their data and cloud technologies to solve complex problems, enable rapid ML experimentation, and bring new products to market.
However, before using Cloud for data workloads, many in regulated industries weigh risk vs reward. The risks are mainly in three categories, namely “external cybersecurity”, “data exfiltration or insider threats”, and “cloud provider access to data”. Additionally, data residency requirements as specified by regulations play a crucial role in choosing Cloud over on-premise solutions. While these risks are equally applicable to many customer, the security bar and, therefore, the scrutiny from customers within regulated industries is far higher than other industries.
This blog post describes a set of controls to leverage when creating data products in compliance with security and regulatory requirements using Google Cloud services. We have worked with a number of customers and have observed that their viewpoint is different with respect to applying controls. We suggest you consult your CDO, CISO and legal counsel to ensure that the controls on your GCP projects are in line with regulatory requirements of your company and industry.
Data Residency and Processing
You may be required to store and process your customer’s data within a specified location due to data residency regulations such as GDPR, CCPA etc. In Google Cloud, you can control locations where data is stored and processed using restricting resource locations at Organization, Project or individual service level. Of these, Policy-based restrictions at the Organization level are the most convenient to set up.
You can use one of the more curated value groups to choose a geographic location(s). Typically, the value groups map to a region, e.g. London, or regions such as the European Union. Restricting resources located at the Organizational level will apply to all services which support resource locations. Additionally, you can choose individual services, e.g. BigQuery, Pub/Sub, Dataflow, etc., and use resource location restrictions. The latter allows application-specific customisation and could be a preferred approach for some use cases.
Protecting Sensitive Data
Further to data residency requirements, several regulations require that data should be protected from unauthorized access and highlights the importance of encryption as a mechanism to safeguard data in the event of unauthorized access or data theft. Regulations such as PCI-DSS are very prescriptive. Bank of England’s Prudential Regulation Authority (PRA) has concluded a set of supervisory statements detailing PRA’s expectations from PRA-regulated firms in relation to outsourcing and adoption of cloud.
Google Cloud offers multiple options for encrypting data at rest in services such as Cloud PubSub, Cloud Storage, BigQuery and Persistent disks. Before delving into the details, we should note that all data on Google cloud is encrypted as described in this encryption at rest paper. The default encryption uses AES-256 encryption standard and provides strict key access controls and auditing.
While the default encryption method using Google managed key may be sufficient for your use case, Google cloud offers other encryption options such as Customer Managed Encryption Key (CMEK) and External Key Management (EKM), which have been proven to be very effective for customers in regulated industries. Both of these options give you fine-grained control over key management. You can disable or revoke a key and periodically rotate it to reduce the risk of data breaches. The keys can be generated externally and imported into Cloud KMS, thus further enhancing controls on access to data. Additionally, to comply with FIPS-140-2 requirements, you can store encryption keys in Cloud HSM.
In some cases, just encrypting data at the Storage layer may not be sufficient, e.g. Personal Identifiable Information (PII) in nearly every industry, and Price Sensitive Information (PSI) in Financial Services will require some additional protection. Moreover, many use cases will require data obfuscation to use the data without revealing the actual information. On Google Cloud, you can use Data Loss Prevention service to discover, classify, automatically mask, tokenise and transform sensitive elements in structured and unstructured data (e.g. fields containing notes or images).
Once data is on the Cloud, it undergoes several transformations to create curated datasets to power descriptive and prescriptive analytics. Google cloud services such as Dataflow, Dataproc and Data Fusion to build pipelines offer a level of controls using encryption and confidential VM offering. Additionally, you can manage, monitor and govern data across data lakes and data warehouses using Dataplex (in preview). Dataplex helps to centralize the discovery of metadata and data governance.
Data Encryption and masking techniques help to mitigate external cybersecurity risks.
Mitigating Data Exfiltration Risk
Risks from insider threats are more considerable and more challenging to mitigate. The other difficulty is the inevitable multi-tenant Cloud environment. Typically, you will have multiple GCP projects for different applications. VPC Service Controls helps you to mitigate the risk of data exfiltration by isolating multi-tenant services and ensuring that only authorized networks can access sensitive data. Service perimeters setup using VPC Service Control restricts resource access to allowed IP addresses, identities, and trusted client devices. And it also allows you to securely exchange data with resources outside of the perimeter using ingress and egress rules.
Cloud Provider Access Management
Google Cloud’s commitment to access management relies on a set of products such as Access transparency, Access Approval and Key Access justification. Access Transparency logs give you information on actions taken by Google personnels. It is part of Google’s long-term commitment to transparency and user trust. You can use Access Transparency logs to verify that Google personnel are accessing content only for valid business reasons such as fixing an outage or attending to support requests. A new feature called Access Approval , currently in preview, takes it further by requiring your explicit approval whenever they need to access your content.
Explaining Model Predictions
To realize the potential of AI in regulated industries, AI must be used and developed responsibly. To that end, we have established innovation principles that guide Google’s AI applications, best practices to share our work with communities outside of Google and programs to operationalise our efforts. Our AI Principles serve as our ethical charter for research and product development. To further put responsible AI into practice, we have made available to the world a Responsible AI Toolkit to help developers implement the free responsible AI tools and our innovative techniques during the ML workflow.
Putting it all together – A functional architecture
We’ve put together an example functional architecture to highlight the components of a secure data pipeline and storage as discussed in this article. In this architecture, we are highlighting the various stages of data storage, processing and consumption. At each stage, the process interacts with services providing adequate controls e.g. encryption, identification and tokenization of sensitive data, and transparency reporting via logging. All of the components are deployed within a service perimeter to prevent any data exfiltration.