Skip to main content

Deploy Azure DataBricks in Private Link scenario

This document aim to provide guidance to deploy Azure DataBricks in an Azure C3 Hardened subscription and validate access to data stored into secured Storage Accounts protected by Private Endpoints.

[[TOC]]


Introduction

Azure DataBricks solution require a particular Virtual Network configuration that is not compatible with Virtual Network provided in your Azure C3 Hardened subscription. This document will help you to deploy an Azure DataBricks solution in your subscription and access a Storage account containing data protected by Private endpoints.

On the center of the diagram we have the tdp-he-vnet-rg resource Group that contain the Virtual Network provided with your C3 Azure Hardened Subscription. On the right part of the diagram we have a dedicated resource group names DemoDataBrickC3. This Resource group contains a Storage Account that will contain sensitive data hat must not be exposed on Internet. Access to this Storage Account is protected, only a C3 Azure Hardened virtual machine is able to access to it (using the private Endpoint of the Storage Account).

Overview

On the left side of the diagram, we have a new resource group named DemoDataBricks that contain a dedicated Virtual Network, not connected to the TDF Virtual Network. In This document we will be provisioning the DataBricks infrastructure (left side of the diagram) and configure an Private Endpoint linked with the Storage Account. Only point not documented in this document is how to configure the second Private Endpoint connected to the tdp-he-vnet.

DNS considerations

When creating a Private Endpoint, we must provide the name resolution for net Network interface that will be connected to the Storage Account. The TDF DNS central infrastructure is responsible to manage private DNS zone content related to Azure Services configured for Private Endpoint. The service will only create theses DNS records if the Private Endpoint is linked to the tdp-he-vnet Virtual Network, not for other Virtual Network.

So when we create the Private Endpoint on the left side of the diagram we must create the Private DNS zone related to the service : privatelink.blob.core.windows.net and link it to the Virtual Network dedicated to Azure DataBricks. So, even if we have two Azure Private DNS zone with the same name, they do not provide the same result for DNS resolution.

Deployments steps

This section will provide required steps to successfully deploy your Azure DataBricks solution into a dedicated Virtual Network by following the required steps:

  • Provision a dedicated Virtual Network
  • Provision a dedicated subnet for Private Endpoint
  • Disable Network policies for Private Endpoints
  • Provision an Azure DataBricks workspace
  • Provision an Azure dataBricks Cluster
  • Provision a Storage Account with sample data
  • Secure Storage Account with a Private Endpoint
  • Validate DNS resolution from dataBricks
  • Mount file to Azure DataBricks

Provision a dedicated Virtual Network

This dedicated Virtual Network will be used to deploy the Azure DataBricks solution. As illustrated bellow, the Virtual Network have two address spaces Azure DataBricks Virtual Network requirements

Dedicated Virtual Network

This configuration cannot be performed on the Virtual Network provided with your C3 Azure Hardened Subscription. For this reason, you must create a new Virtual Network, dedicated for Azure DataBricks solution. Only one requirement : This new Virtual Network must be located in the same Azure region as the Storage Account we will be accessing to.

Provision a dedicated subnet for Private Endpoint

Usage of Private Endpoint come with some limitations. At current time, Private Endpoint are not yet compatible with the following features:

  • Network Security Groups
  • Route Table

Because some products such as DataBricks rely on theses network features, we must create a dedicated subnet to be used only by private endpoints. As documented here Network security group rules, DataBricks solution come with a set of Network Security Groups applicable to the subnets used by the solution. In order to use Private endpoints, we need to introduce a separate subnet, dedicated to Private Endpoints usage.

Disable Network policies for Private Endpoints

By default, network policies such as User defined-route and Network Security groups are not compatible with Private Endpoint. Any newly created Virtual network must be reconfigured to disable Private EndPoint Policies on subnet that will host private endpoints. This procedure is described here : Manage network policies for private endpoints

This operation must be performed on the private endpoint dedicated subnet we created on the previous step.

Provision an Azure DataBricks workspace

Next step is to create an Azure DataBricks workspace. This resource must:

  • Be located in the same Azure subscription as the Storage Account containing data
  • Be located in the same Azure region as the Storage Account

When creating the Azure DataBricks workspace, be sure to select the following options :

  • Deploy Azure DataBricks workspace in your own Virtual Network (VNET)
  • Deploy Azure DataBricks workspace with secure Cluster Connectivity (No Public IP)
  • Select the Virtual Network created in the previous step

Dedicated Virtual Network

Note : Public and Private subnets CIDR are important. Address space size determine how many nodes you can have. Required additional information are available here Address space and maximum cluster nodes

Once deployment process complete, you should have an Azure DataBricks workspace as illustrated bellow:

Azure DataBricks Workspace

This workspace was provisioned using the Standard SKU, witch does not provide Virtual Network peering feature. This feature is only available in Premium SKU. In the C3 Azure Hardened subscription, this Azure DataBricks feature cannot be used.

Provision an Azure dataBricks Cluster

An Azure DataBricks workspace need compute resources to process data. Connect onto the DataBricks portal management URL and request to create a cluster using the "new cluster option".

Create Azure DataBricks cluster

Creation process is relatively simple, the only point to consider is the Azure Quotas of the Azure Virtual Machine you select. If you to not have enough quotas for the virtual machine you selected, deployment will fail. If Deployment fail for this reason.

Cluster creation fail because of quotas

The cluster deployment will take a few minutes. The following operations will be performed :

  • A dedicated resource group will be created to host all cluster related resources
  • Multiple Virtual machine will be configured with two network interfaces,both connected to the dedicated Virtual Network we provisioned but not on the same subnet.

Provision a Storage Account with sample data

Azure DataBricks cluster will be consuming sample data located in a dedicated Storage Account. This Storage Account will be :

  • Created in the same Azure Subscription as the Azure DataBricks workspace and cluster
  • Created in the same Azure region as the Azure DataBricks workspace and cluster
  • Created using the Allow Blob public access option configured to disable
  • Have a container named demo configured with public access level configured to Private
  • 1000_Sales_Records.csv file

Once file is uploaded into the demo container, click on to the Generate SAS option for the uploaded file as illustrated bellow:

Generate a SAS token

Keep the Blob SAS token. This token will be used later in the document as access key for the Storage Account.

Secure Storage Account with a Private Endpoint

The Azure DataBricks cluster will not accessing the Storage Account publicly but by using a Private Endpoint. We can compare Private Link with the Network interface of a Virtual machine. We will be creating a Private Endpoint, connected to a Subnet of the Virtual Network we provisioned at the beginning of this document. Private Endpoint configuration is documented in table bellow :

ParameterConfiguration
SubscriptionSame subscription as Azure DataBricks
Azure regionSame region as Azure DataBricks
Connection methodConnect to an Azure resource in my Directory
Resource subscriptionSame subscription as Azure DataBricks
Resource typeMicrosoft.Storage/StorageAccounts
ResourceName to storage account created in previous step
Target Sub-ResourceBlob

Private Endpoint service configuration

Last step of the Private Endpoint configuration is to link it to the Azure DataBricks dedicated virtual network created at first step. Network configuration is documented in table bellow:

ParameterConfiguration
Virtual NetworkVirtual Network created for DataBricks solution
SubnetDedicated subnet distinct from public and private subnets created for DataBricks
Integrate with Private DNS zoneYes
DNS zoneCreated in the same Azure subscription

Be sure that the newly created private DNS zone named privatelink.blob.core.windows.net is configured with Virtual Network links with the Virtual Network created on first step as illustrated bellow :

Private DNS Zone Virtual Network link

Creating a Private Endpoint, establish a link with the related Azure resources. Access to the resource require an approval. Approval can be :

  • Automatic
  • Manual

When you are owner of the Azure resources, approval is automatic. That's the case if you create a Private Endpoint with a resource located in your Azure subscription. If you do not manage this resource, owner of the resource need to approve the Private Endpoint to finalize the setup. That will be the case if your resource is located in another Azure subscription.

Private Endpoint approval process

Once you created your Private Endpoint resource, Azure resource owner need to validate this request in the Azure PrivateLink center of the Azure portal or at Azure resource level. In illustration bellow, Azure resource owner have an pending request that need validation. Azure resource owner need to carefully review this request because the private endpoint resource can be located in any Azure subscription.

Private Endpoint approval

Azure resource owner should only accept private Endpoint of Azure subscription they trust. Azure resource owner should only approve request from Azure Subscription related to the Thales Digital Factory with the same level of security. Trusting a Private Endpoint from a C2 Azure Hardened subscription linked to a resource located in an Azure C3 Hardened subscription is strictly forbidden.

Validate DNS resolution from DataBricks clusters

Azure DataBricks cluster virtual machines will be using DNS resolution provided by the Virtual Network. Go back in the Azure DataBricks administration portal and use the Create Notebook option with the following options as documented in table bellow:

ParameterConfiguration
Default LanguageScale
ClusterCluster name previously created

Create a Notebook

Type the following commands in the interface : %sh nslookup <name of your storage account>.privatelink.blob.core.windows.net

Test Private DNS resolution

You should have a positive DNS resolution provided by the default Azure DNS (168.63.129.16). The Authoritative answer should provide an IPv4 address from the virtual Network provided for Azure DataBricks solution.

Mount file to Azure DataBricks

In this last step we will be using the notebook to mount content of our Azure Container into Azure DataBricks using the following commands: val containerName = "demo"

val storageAccountName = "<Your Storage Account>"

val sas = "Shared Access token created for the file"

val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"

dbutils.fs.mount(source = "wasbs://<Storage Account Container>@<Storage Account Name>.blob.core.windows.net/1000_Sales_Records.csv", mountPoint = "/mnt/myfile",extraConfigs = Map(config -> sas))

Mount file to Azure DataBricks

Process should take a new seconds. You should see :

  • A job was created to process the operation
  • This Job status was Succeeded

If mount was successful, we should be able to access file content using the following command:

val mydf = spark.read

.option("header","true")

.option("inferSchema", "true")

.csv("/mnt/myfile01")

display(mydf)

You should be able to see the following content:

Storage Account content accessed from Azure DataBricks

Additional nodes

Enforce security at Storage Account level

From a technical point of view, it possible to enforce security level of the Storage Account by configuring the Allow access from to Selected Network. Connection to the Storage Account performed using the Private Endpoint wont be affected, but the only way to access the Data Plane of the Storage Account would be to configure Service Endpoint for the Storage Account Service or create a Private Endpoint connected to the tdp-he-vnet Virtual Network.

Enforce Storage Account Strict isolation

Thales DataLake use case

This DataBrick setup can also be used to access DataLake services provided by the Thales Digital Factory. For this scenario we just have some minor changes :

  • Private Endpoint approval
  • Use Service Principal for authentication
  • Change the notebook authentication sequence

When creating the Private Endpoint referencing the Storage Account used by the Thales Digital Factory, there is no automatic approval (The TDF DataLake team own the Storage Account, not you). Your Private Endpoint will be considered in a pending state, waiting the TDF DataLake team to approve it. TDF DataLake team can only approve your Private Endpoint if it comply with a simple rule : This private Endpoint must be created in a C3 Azure Hardened subscription (not allowed from any other subscription type).

For security considerations, the TDF DataLake team does not allow SAS key authentication mechanism to consume their service. You must rely on a Service principal / Azure Managed Identity and request this Identity to be allowed to Access the TDF DataLake Storage Account for the data Plane level.

At last you must adapt the notebook example provided in this document as documented bellow:

val configs = Map(

"fs.azure.account.auth.type" -> "OAuth",

"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id" -> "<Azure AD Application Client ID Secret>"

"fs.azure.account.oauth2.client.secret" -> "<Azure AD Application Secret>"

"fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<Azure AD Tenant ID>/oauth2/token" )

dbutils.fs.mount(

source = "abfss://<Storage Account Container name>@<Azure AD Storage Account Name>@dfs.core.windows.net/<File>"

MountPoint = "/mnt/<Local mount name>",

extraConfigs = configs

)

Access to TDF DataLake

Security consideration

When creating the Private Endpoint resource, be sure to select a resource that is located in your subscription. In this case, Private Endpoint configuration will be automatically approved. For the TDF scenarios, we will only support this scenarios. Establishing a trust relationship with resources that are not located in your Azure Hardened Subscription will be forbidden.

Conclusion

So we can process data stored in a Storage Account with C3 Azure Hardened Subscription by an Azure DataBricks instance that is not connected to the same Virtual Network. When DataBricks clusters access the Storage Account, we are using the Microsoft Backbone, so not exposed on Internet.