Deploy Azure DataBricks in Private Link scenario
This document aim to provide guidance to deploy Azure DataBricks in an Azure C3 Hardened subscription and validate access to data stored into secured Storage Accounts protected by Private Endpoints.
Introduction
Azure DataBricks solution require a particular Virtual Network configuration that is not compatible with Virtual Network provided in your Azure C3 Hardened subscription. This document will help you to deploy an Azure DataBricks solution in your subscription and access a Storage account containing data protected by Private endpoints.
On the center of the diagram we have the tdp-he-vnet-rg
resource Group that contain the Virtual Network provided with your C3 Azure Hardened Subscription. On the right part of the diagram we have a dedicated resource group names DemoDataBrickC3
. This Resource group contains a Storage Account that will contain sensitive data hat must not be exposed on Internet. Access to this Storage Account is protected, only a C3 Azure Hardened virtual machine is able to access to it (using the private Endpoint of the Storage Account).
On the left side of the diagram, we have a new resource group named DemoDataBricks
that contain a dedicated Virtual Network, not connected to the TDF Virtual Network. In This document we will be provisioning the DataBricks infrastructure (left side of the diagram) and configure an Private Endpoint linked with the Storage Account. Only point not documented in this document is how to configure the second Private Endpoint connected to the tdp-he-vnet
.
DNS considerations
When creating a Private Endpoint, we must provide the name resolution for net Network interface that will be connected to the Storage Account. The TDF DNS central infrastructure is responsible to manage private DNS zone content related to Azure Services configured for Private Endpoint. The service will only create theses DNS records if the Private Endpoint is linked to the tdp-he-vnet
Virtual Network, not for other Virtual Network.
So when we create the Private Endpoint on the left side of the diagram we must create the Private DNS zone related to the service : privatelink.blob.core.windows.net
and link it to the Virtual Network dedicated to Azure DataBricks. So, even if we have two Azure Private DNS zone with the same name, they do not provide the same result for DNS resolution.
Deployments steps
This section will provide required steps to successfully deploy your Azure DataBricks solution into a dedicated Virtual Network by following the required steps:
- Provision a dedicated Virtual Network
- Provision a dedicated subnet for Private Endpoint
- Disable Network policies for Private Endpoints
- Provision an Azure DataBricks workspace
- Provision an Azure dataBricks Cluster
- Provision a Storage Account with sample data
- Secure Storage Account with a Private Endpoint
- Validate DNS resolution from dataBricks
- Mount file to Azure DataBricks
Provision a dedicated Virtual Network
This dedicated Virtual Network will be used to deploy the Azure DataBricks solution. As illustrated bellow, the Virtual Network have two address spaces Azure DataBricks Virtual Network requirements
This configuration cannot be performed on the Virtual Network provided with your C3 Azure Hardened Subscription. For this reason, you must create a new Virtual Network, dedicated for Azure DataBricks solution. Only one requirement : This new Virtual Network must be located in the same Azure region as the Storage Account we will be accessing to.
Provision a dedicated subnet for Private Endpoint
Usage of Private Endpoint come with some limitations. At current time, Private Endpoint are not yet compatible with the following features:
- Network Security Groups
- Route Table
Because some products such as DataBricks rely on theses network features, we must create a dedicated subnet to be used only by private endpoints. As documented here Network security group rules, DataBricks solution come with a set of Network Security Groups applicable to the subnets used by the solution. In order to use Private endpoints, we need to introduce a separate subnet, dedicated to Private Endpoints usage.
Disable Network policies for Private Endpoints
By default, network policies such as User defined-route and Network Security groups are not compatible with Private Endpoint. Any newly created Virtual network must be reconfigured to disable Private EndPoint Policies on subnet that will host private endpoints. This procedure is described here : Manage network policies for private endpoints
This operation must be performed on the private endpoint dedicated subnet we created on the previous step.
Provision an Azure DataBricks workspace
Next step is to create an Azure DataBricks workspace. This resource must:
- Be located in the same Azure subscription as the Storage Account containing data
- Be located in the same Azure region as the Storage Account
When creating the Azure DataBricks workspace, be sure to select the following options :
- Deploy Azure DataBricks workspace in your own Virtual Network (VNET)
- Deploy Azure DataBricks workspace with secure Cluster Connectivity (No Public IP)
- Select the Virtual Network created in the previous step
Note : Public and Private subnets CIDR are important. Address space size determine how many nodes you can have. Required additional information are available here Address space and maximum cluster nodes
Once deployment process complete, you should have an Azure DataBricks workspace as illustrated bellow:
This workspace was provisioned using the Standard SKU, witch does not provide Virtual Network peering feature. This feature is only available in Premium SKU. In the C3 Azure Hardened subscription, this Azure DataBricks feature cannot be used.
Provision an Azure dataBricks Cluster
An Azure DataBricks workspace need compute resources to process data. Connect onto the DataBricks portal management URL and request to create a cluster using the "new cluster option".
Creation process is relatively simple, the only point to consider is the Azure Quotas of the Azure Virtual Machine you select. If you to not have enough quotas for the virtual machine you selected, deployment will fail. If Deployment fail for this reason.
The cluster deployment will take a few minutes. The following operations will be performed :
- A dedicated resource group will be created to host all cluster related resources
- Multiple Virtual machine will be configured with two network interfaces,both connected to the dedicated Virtual Network we provisioned but not on the same subnet.
Provision a Storage Account with sample data
Azure DataBricks cluster will be consuming sample data located in a dedicated Storage Account. This Storage Account will be :
- Created in the same Azure Subscription as the Azure DataBricks workspace and cluster
- Created in the same Azure region as the Azure DataBricks workspace and cluster
- Created using the
Allow Blob public access
option configured to disable - Have a container named
demo
configured with public access level configured toPrivate
- 1000_Sales_Records.csv file
Once file is uploaded into the demo
container, click on to the Generate SAS option for the uploaded file as illustrated bellow:
Keep the Blob SAS token. This token will be used later in the document as access key for the Storage Account.
Secure Storage Account with a Private Endpoint
The Azure DataBricks cluster will not accessing the Storage Account publicly but by using a Private Endpoint. We can compare Private Link with the Network interface of a Virtual machine. We will be creating a Private Endpoint, connected to a Subnet of the Virtual Network we provisioned at the beginning of this document. Private Endpoint configuration is documented in table bellow :
Parameter | Configuration |
---|---|
Subscription | Same subscription as Azure DataBricks |
Azure region | Same region as Azure DataBricks |
Connection method | Connect to an Azure resource in my Directory |
Resource subscription | Same subscription as Azure DataBricks |
Resource type | Microsoft.Storage/StorageAccounts |
Resource | Name to storage account created in previous step |
Target Sub-Resource | Blob |
Last step of the Private Endpoint configuration is to link it to the Azure DataBricks dedicated virtual network created at first step. Network configuration is documented in table bellow:
Parameter | Configuration |
---|---|
Virtual Network | Virtual Network created for DataBricks solution |
Subnet | Dedicated subnet distinct from public and private subnets created for DataBricks |
Integrate with Private DNS zone | Yes |
DNS zone | Created in the same Azure subscription |
Be sure that the newly created private DNS zone named privatelink.blob.core.windows.net
is configured with Virtual Network links with the Virtual Network created on first step as illustrated bellow :
Creating a Private Endpoint, establish a link with the related Azure resources. Access to the resource require an approval. Approval can be :
- Automatic
- Manual
When you are owner of the Azure resources, approval is automatic. That's the case if you create a Private Endpoint with a resource located in your Azure subscription. If you do not manage this resource, owner of the resource need to approve the Private Endpoint to finalize the setup. That will be the case if your resource is located in another Azure subscription.
Once you created your Private Endpoint resource, Azure resource owner need to validate this request in the Azure PrivateLink center of the Azure portal or at Azure resource level. In illustration bellow, Azure resource owner have an pending request that need validation. Azure resource owner need to carefully review this request because the private endpoint resource can be located in any Azure subscription.
Azure resource owner should only accept private Endpoint of Azure subscription they trust. Azure resource owner should only approve request from Azure Subscription related to the Thales Digital Factory with the same level of security. Trusting a Private Endpoint from a C2 Azure Hardened subscription linked to a resource located in an Azure C3 Hardened subscription is strictly forbidden.
Validate DNS resolution from DataBricks clusters
Azure DataBricks cluster virtual machines will be using DNS resolution provided by the Virtual Network. Go back in the Azure DataBricks administration portal and use the Create Notebook
option with the following options as documented in table bellow:
Parameter | Configuration |
---|---|
Default Language | Scale |
Cluster | Cluster name previously created |
Type the following commands in the interface :
%sh
nslookup <name of your storage account>.privatelink.blob.core.windows.net
You should have a positive DNS resolution provided by the default Azure DNS (168.63.129.16). The Authoritative answer should provide an IPv4 address from the virtual Network provided for Azure DataBricks solution.
Mount file to Azure DataBricks
In this last step we will be using the notebook to mount content of our Azure Container into Azure DataBricks using the following commands:
val containerName = "demo"
val storageAccountName = "<Your Storage Account>"
val sas = "Shared Access token created for the file"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
dbutils.fs.mount(source = "wasbs://<Storage Account Container>@<Storage Account Name>.blob.core.windows.net/1000_Sales_Records.csv", mountPoint = "/mnt/myfile",extraConfigs = Map(config -> sas))
Process should take a new seconds. You should see :
- A job was created to process the operation
- This Job status was Succeeded
If mount was successful, we should be able to access file content using the following command:
val mydf = spark.read
.option("header","true")
.option("inferSchema", "true")
.csv("/mnt/myfile01")
display(mydf)
You should be able to see the following content:
Additional nodes
Enforce security at Storage Account level
From a technical point of view, it possible to enforce security level of the Storage Account by configuring the Allow access from
to Selected Network
. Connection to the Storage Account performed using the Private Endpoint wont be affected, but the only way to access the Data Plane of the Storage Account would be to configure Service Endpoint for the Storage Account Service or create a Private Endpoint connected to the tdp-he-vnet
Virtual Network.
Thales DataLake use case
This DataBrick setup can also be used to access DataLake services provided by the Thales Digital Factory. For this scenario we just have some minor changes :
- Private Endpoint approval
- Use Service Principal for authentication
- Change the notebook authentication sequence
When creating the Private Endpoint referencing the Storage Account used by the Thales Digital Factory, there is no automatic approval (The TDF DataLake team own the Storage Account, not you). Your Private Endpoint will be considered in a pending state, waiting the TDF DataLake team to approve it. TDF DataLake team can only approve your Private Endpoint if it comply with a simple rule : This private Endpoint must be created in a C3 Azure Hardened subscription (not allowed from any other subscription type).
For security considerations, the TDF DataLake team does not allow SAS key authentication mechanism to consume their service. You must rely on a Service principal / Azure Managed Identity and request this Identity to be allowed to Access the TDF DataLake Storage Account for the data Plane level.
At last you must adapt the notebook example provided in this document as documented bellow:
val configs = Map(
"fs.azure.account.auth.type" -> "OAuth",
"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" -> "<Azure AD Application Client ID Secret>"
"fs.azure.account.oauth2.client.secret" -> "<Azure AD Application Secret>"
"fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<Azure AD Tenant ID>/oauth2/token"
)
dbutils.fs.mount(
source = "abfss://<Storage Account Container name>@<Azure AD Storage Account Name>@dfs.core.windows.net/<File>"
MountPoint = "/mnt/<Local mount name>",
extraConfigs = configs
)
Security consideration
When creating the Private Endpoint resource, be sure to select a resource that is located in your subscription. In this case, Private Endpoint configuration will be automatically approved. For the TDF scenarios, we will only support this scenarios. Establishing a trust relationship with resources that are not located in your Azure Hardened Subscription will be forbidden.
Conclusion
So we can process data stored in a Storage Account with C3 Azure Hardened Subscription by an Azure DataBricks instance that is not connected to the same Virtual Network. When DataBricks clusters access the Storage Account, we are using the Microsoft Backbone, so not exposed on Internet.