Introduction
In this article, we will discuss how to secure Azure data lake storage using Private endpoint. The main benefit of using Private Endpoint is to enable on-prem and Azure services to connect to the data lake using Microsoft backbone network. This eliminates the need to send data across the public internet. This results in improving networking security as well as reducing data exfiltration.
Challenges
In the last article, we set up a Virtual network integration to allow services within the Virtual Network's Subnet to connect to our data lake. This solution is working great but our scope has been expanded. We need to enable on-prem applications and other Azure services to connect to our data lake through private networking. This traffic needs to stay within the Microsoft backbone network and restrict the traffic to the private virtual network.
Private Endpoint is designed for this use case and it can be implemented on various Azure services, like the data lake. Before we dive into the tutorial, it is important to understand the new concepts and services required to achieve this:
- Private DNS zone
Private DNS zone service is responsible for translating a private domain name into an IP address. A DNS record is created to provide a unique private URL for the data lake and the related private IP address.
- Network interface (NIC)
If you have created a Virtual Machine (VM), a network interface should be familiar to you. If not, a network interface is used to obtain an IP address from the Virtual network. In this use case, this will be a private IP address.
- Private Endpoint
For simplicity, a Private Endpoint associates the unique domain record in the Private DNS zone with the Network interface.
It is important to understand the added cost of implementing this solution. Azure will charge for both ingress and egress traffic through the private network, the Private DNS zone, and Private Endpoint. For detail cost estimate, please utilize the Azure Cost Calcuator.
Tutorial
-
In our previous tutorial, we have set up a Virtual network integration with our data lake. The first thing we need to do is to remove this integration:
- Navigate to Networking tab of our data lake.
- Under Firewalls and virtual networks click on the '...' and 'Remove' under the Virtual Network.
- Click 'Save'.
- After the Virtual network integration is successfully removed, click on the 'Private endpoint connections' tab, then click '+ Private endpoint'. This will bring up the Private endpoint creation wizard.
- Create a Private Endpoint - Basics
Provide the Resource group and the Instance details. Click 'Next: Resource >' to continue the wizard.
-
Create a Private Endpoint - Resource
Private endpoint is available for various types of storage services. For the data lake, we need to create a Private Endpoint for 'blob' and 'dfs'. Since we can only select 1 sub-type, we will select 'blob' first. Click 'Next: Configuration>' to continue the wizard.
After reaching the end of this tutorial, remember to repeat the step for 'dfs' as well.
** It's important to note that we create the 'blob' Private Endpoint because Azure Storage Explorer uses the Blob API to retrieve the containers information. Without it, we will receive an error.
-
Create a Private Endpoint - Config
In order to keep the services organized, I created a dedicated resource group for all networking components (sandbox-dataPlatform-network). I created a subnet called 'privateEndpoint' for all the Private Endpoints. Alternatively, you can create different subnet for each Private Endpoint type. Click 'Next: Tags >' to continue the wizard.
-
Create a Private Endpoint - Tags
Finally, we will create the tags required for cost management. Since the data lake will be used by many projects, the new services are assigned to the 'shared' cost center. Click 'Next: Review + create >' to continue the wizard.
-
Testing
After setting up the Private Endpoint, we should see the following resources in Azure portal. We see the Private Endpoint, the network interface, and the Private DNS zone.
To test the Private Endpoint, we will connect to our virtual machine on the same network as our data lake. In the VM, we need to bring up the Command Prompt by typing 'cmd' via the Run window and utilize the nslookup
command.
In the screenshot, we can see the public endpoint is mapped to the Private Endpoint and shows the private IP.
References
Summary
Private endpoint is the best way to connect to our data lake securely. By keeping the traffic private, the network security team can have better control over who and where people can access our data lake.
The setup is much more complex and has added cost. For a detailed comparison between Virtual Network integration and Private endpoint, I have provided a link in the References section above.
If you are setting up Private Endpoint for your data lake, do not forget to create the Private Endpoint for 'dfs'!
Happy Learning!