Databricks on GCP – A practitioners information on information exfiltration safety.


The Databricks Lakehouse Platform gives a unified set of instruments for constructing, deploying, sharing, and sustaining enterprise-grade information options at scale. Databricks integrates with Google Cloud & Safety in your cloud account and manages and deploys cloud infrastructure in your behalf.

The overarching objective of this text is to mitigate the next dangers:

  • Knowledge entry from a browser on the web or an unauthorized community utilizing the Databricks internet software.
  • Knowledge entry from a shopper on the web or an unauthorized community utilizing the Databricks API.
  • Knowledge entry from a shopper on the web or an unauthorized community utilizing the Cloud Storage (GCS) API.
  • A compromised workload on the Databricks cluster writes information to an unauthorized storage useful resource on GCP or the web.

Databricks helps a number of GCP native instruments and providers that assist defend information in transit and at relaxation. One such service is VPC Service Controls, which gives a technique to outline safety perimeters round Google Cloud sources. Databricks additionally helps community safety controls, comparable to firewall rules primarily based on network or secure tags. Firewall guidelines will let you management inbound and outbound visitors to your GCE digital machines.

Encryption is one other necessary part of knowledge safety. Databricks helps a number of encryption choices, together with customer-managed encryption keys, key rotation, and encryption at relaxation and in transit. Databricks-managed encryption keys are utilized by default and enabled out of the field. Clients may bring their own encryption keys managed by Google Cloud Key Administration Service (KMS).

Earlier than we start, let’s take a look at the Databricks deployment structure here:

Databricks is structured to allow safe cross-functional group collaboration whereas preserving a big quantity of backend providers managed by Databricks so you possibly can keep centered in your information science, information analytics, and information engineering duties.

Databricks operates out of a management aircraft and a information aircraft.

  • The management aircraft contains the backend providers that Databricks manages in its personal Google Cloud account. Pocket book instructions and different workspace configurations are saved within the management aircraft and encrypted at relaxation.
  • Your Google Cloud account manages the information aircraft and is the place your information resides. That is additionally the place information is processed. You should use built-in connectors so your clusters can hook up with data sources to ingest information or for storage. You may as well ingest information from exterior streaming data sources, comparable to occasions information, streaming information, IoT information, and extra.

The next diagram represents the stream of knowledge for Databricks on Google Cloud:

Excessive-level Structure

High-level view of the default deployment architecture.

Community Communication Path

Let’s perceive the communication path we need to safe. Databricks might be consumed by customers and functions in quite a few methods, as proven beneath:

High-level view of the communication paths.

A Databricks workspace deployment contains the next community paths to safe

  1. Customers who entry Databricks web application aka workspace
  2. Customers or functions that entry Databricks REST APIs
  3. Databricks information aircraft VPC community to the Databricks management aircraft service. This contains the secure cluster connectivity relay and the workspace connection for the REST API endpoints.
  4. Dataplane to your storage providers
  5. Dataplane to exterior information sources e.g. package deal repositories like pypi or maven

From end-user perspective, the paths 1 & 2 require ingress controls and three,4,5 egress controls

On this article, our focus space is to safe egress visitors out of your Databricks workloads, present the reader with prescriptive steerage on the proposed deployment structure, and whereas we’re at it, we’ll share finest practices to safe ingress (consumer/shopper into Databricks) visitors as properly.

Proposed Deployment Structure

Deployment Architecture

Create Databricks workspace on GCP with the next options

  1. Customer managed GCP VPC for workspace deployment
  2. Private Service Connect (PSC) for Internet software/APIs (frontend) and Management aircraft (backend) visitors
    • Person to Internet Software / APIs
    • Knowledge Airplane to Management Airplane
  3. Visitors to Google Providers over Private Google Access
    • Buyer managed providers (e.g. GCS, BQ)
    • Google Cloud Storage (GCS) for logs (well being telemetry and audit) and Google Container Registry (GCR) for Databricks runtime pictures
  4. Databricks workspace (information aircraft) GCP undertaking secured utilizing VPC Service Controls (VPC SC)
  5. Buyer Managed Encryption keys
  6. Ingress management for Databricks workspace/APIs utilizing IP Access list
  7. Visitors to exterior information sources filtered through VPC firewall [optional]
    • Egress to public package deal repo
    • Egress to Databricks managed hive
  8. Databricks to GCP managed GKE management aircraft
    • Databricks management aircraft to GKE management aircraft (kube-apiserver) visitors over authorized network
    • Databricks information aircraft GKE cluster to GKE management aircraft over vpc peering

Important Studying

Earlier than you start, please guarantee that you’re conversant in these matters

Conditions

  • A Google Cloud account.
  • A Google Cloud undertaking within the account.
  • A GCP VPC with three subnets precreated, see necessities here
  • A GCP IP vary for GKE grasp sources
  • Use the Databricks Terraform provider 1.13.0 or greater. All the time use the newest model of the supplier.
  • A Databricks on Google Cloud account within the undertaking.
  • A Google Account and a Google service account (GSA) with the required permissions.
    • To create a Databricks workspace, the required roles are defined here. Because the GSA might provision extra sources past Databricks workspace, for instance, personal DNS zone, A data, PSC endpoints and many others, it’s higher to have a undertaking proprietor position in avoiding any permission-related points.
  • In your native growth machine, you have to have:
    • The Terraform CLI: See Download Terraform on the web site.
    • Terraform Google Cloud Supplier: There are a number of choices obtainable here and here to configure authentication for the Google Supplier. Databricks does not have any desire in how Google Supplier authentication is configured.

Bear in mind

  • Each Shared VPC or standalone VPC are supported
  • Google terraform supplier helps OAUTH2 access token to authenticate GCP API calls and that is what now we have used to configure authentication for the google terraform supplier on this article.
    • The entry tokens are short-lived (1 hour) and never auto refreshed
  • Databricks terraform provider relies upon upon the Google terraform provider to provision GCP sources
  • No adjustments, together with resizing subnet IP handle area or altering PSC endpoints configuration is allowed put up workspace creation.
  • In case your Google Cloud group coverage has domain-restricted sharing enabled, please be certain that each the Google Cloud buyer IDs for Databricks (C01p0oudw) and your personal group’s buyer ID are within the coverage’s allowed checklist. See the Google article Setting the organization policy. Should you need assistance, contact your Databricks consultant earlier than provisioning the workspace.
  • Make it possible for the service account used to create Databricks workspace has the required roles and permissions.
  • You probably have VPC SC enabled in your GCP tasks, please replace it per the ingress and egress guidelines listed here.
  • Perceive the IP handle area necessities; a fast reference desk is out there over here
  • Here is a list of Gcloud instructions that you could be discover helpful
  • Databricks does help global access settings in case you need Databricks workspace (PSC endpoint) to be accessed by a useful resource operating in a distinct area from the place Databricks is.

Deployment Information

There are a number of methods to implement the proposed deployment structure

  • Use the UI
  • Databricks Terraform Provider [recommended & used in this article]
  • Databricks REST APIs

No matter the method you employ, the useful resource creation stream would seem like this:

Deployment Guide

GCP useful resource and infrastructure setup

It is a prerequisite step. How the required infrastructure is provisioned, i.e. utilizing Terraform or Gcloud or GCP cloud console, is out of the scope of this text. Here is an inventory of GCP sources required:

GCP Useful resource Sort Function Particulars
Challenge Create Databricks Workspace (ws) Challenge requirements
Service Account Used with Terraform to create ws Databricks Required Role and Permission. Along with this you might also want extra permissions relying upon the GCP sources you’re provisioning.
VPC + Subnets Three subnets per ws Community requirements
Personal Google Entry (PGA) Retains visitors between Databricks management aircraft VPC and Clients VPC personal Configure PGA
DNS for PGA Personal DNS zone for personal api’s DNS Setup
Personal Service Join Endpoints Makes Databricks management aircraft providers obtainable over personal ip addresses.

Personal Endpoints must reside in its personal, separate subnet.

Endpoint creation
Encryption Key Buyer-managed Encryption key used with Databricks Cloud KMS-based key, helps auto key rotation. Key might be “software program” or “HSM” aka hardware-backed keys.
Google Cloud Storage Account for Audit Log Supply Storage for Databricks audit log supply Configure log supply
Google Cloud Storage (GCS) Account for Unity Catalog Root storage for Unity Catalog Configure Unity Catalog storage account
Add or replace VPC SC coverage Add Databricks particular ingress and egress guidelines Ingress & Egress yaml together with gcloud command to create a fringe. Databricks tasks numbers and PSC attachment URI’s obtainable over here.
Add/Replace Access Level utilizing Entry Context Supervisor Add Databricks regional Management Airplane NAT IP to your entry coverage in order that ingress visitors is simply allowed from an enable listed IP Checklist of Databricks regional management aircraft egress IP’s obtainable over here

Create Workspace

  • Clone Terraform scripts from here
    • To maintain issues easy, grant undertaking proprietor position to the GSA on the service and shared VPC undertaking
  • Replace *.vars recordsdata as per your atmosphere setup
Variable Particulars
google_service_account_email [NAME]@[PROJECT].iam.gserviceaccount.com
google_project_name PROJECT the place information aircraft will likely be created
google_region E.g. us-central1, supported regions
databricks_account_id Locate your account id
databricks_account_console_url https://accounts.gcp.databricks.com
databricks_workspace_name [ANY NAME]
databricks_admin_user Present at the least one consumer e mail id. This consumer will likely be made workspace admin upon creation. It is a required subject.
google_shared_vpc_project PROJECT the place VPC utilized by dataplane is positioned. If you’re not utilizing Shared VPC then enter the identical worth as google_project_name
google_vpc_id VPC ID
gke_node_subnet NODE SUBNET identify aka PRIMARY subnet
gke_pod_subnet POD SUBNET identify aka SECONDARY subnet
gke_service_subnet SERVICE SUBNET SUBNET identify aka SECONDARY subnet
gke_master_ip_range GKE management aircraft ip handle vary. Must be /28
cmek_resource_id tasks/[PROJECT]/places/[LOCATION]/keyRings/[KEYRING]/cryptoKeys/[KEY]
google_pe_subnet A devoted subnet for personal endpoints, really helpful dimension /28. Please overview community topology choices obtainable earlier than continuing. For this deployment we’re utilizing the “Host Databricks customers (shoppers) and the Databricks dataplane on the identical community” choice.
workspace_pe Distinctive identify e.g. frontend-pe
relay_pe Distinctive identify e.g. backend-pe
relay_service_attachment Checklist of regional service attachment URI’s
workspace_service_attachment Checklist of regional service attachment URI’s
private_zone_name E.g. “databricks”
dns_name gcp.databricks.com. (. is required in the long run)

If you don’t want to make use of the IP-access checklist and want to utterly lock down workspace entry (UI and APIs) outdoors of your company community, you then would want to:

  • Remark out databricks_workspace_conf and databricks_ip_access_list sources within the workspace.tf
  • Replace databricks_mws_private_access_settings useful resource’s public_access_enabled setting from true to false within the workspace.tf
    • Please notice that Public_access_enabled setting can’t be modified after the workspace is created
  • Just remember to have Interconnect Attachments aka vlanAttachments are created in order that visitors from on premise networks can attain GCP VPC (the place personal endpoints exist) over devoted interconnect connection.

Profitable Deployment Test

Upon profitable deployment, the Terraform output would seem like this:

backend_end_psc_status = "Backend psc standing: ACCEPTED"
front_end_psc_status = "Frontend psc standing: ACCEPTED"
workspace_id = "workspace id: <UNIQUE-ID.N>"
ingress_firewall_enabled = "true"
ingress_firewall_ip_allowed = tolist([
"xx.xx.xx.xx",
"xx.xx.xx.xx/xx"
])
service_account = "Default SA hooked up to GKE nodes
[email protected]<PROJECT>.iam.gserviceaccount.com"
workspace_url = "https://<UNIQUE-ID.N>.gcp.databricks.com"

Publish Workspace Creation

  • Validate that DNS data are created, observe this doc to know required A data.
  • Configure Unity Catalog (UC)
  • Assign Workspace to UC
  • Add users/groups to workspace through UC Id Federation
  • Auto provision customers/teams out of your Identity Providers
  • Configure Audit Log Supply
  • If you’re not utilizing UC and want to use Databricks managed hive then add an egress firewall rule to your VPC as defined here

Getting Began with Knowledge Exfiltration Safety with Databricks on Google Cloud

We mentioned using cloud-native safety management to implement information exfiltration safety in your Databricks on GCP deployments, all of which might be automated to allow information groups at scale. Another issues that you could be need to contemplate and implement as a part of this undertaking are:

Leave a Reply

Your email address will not be published. Required fields are marked *