PASS Pro Now Available

Welcome to the next evolution of PASS. Unlock exclusive training, discounts, and networking opportunities designed to accelerate your data career. Learn More >

The official PASS Blog is where you’ll find the latest blog posts from PASS community members and the PASS Board. Contributors share their thoughts and discuss a wide variety of topics spanning PASS and the data community.

Getting Started with Big Data Clusters – Part 1

I Want My First Big Data Cluster TODAY!

So, you’ve heard about Big Data Clusters in SQL Server 2019 and want to get some hands-on experience? Let’s get you started today! This article will guide you step-by-step on how to deploy a Big Data Cluster in Azure Kubernetes Services (AKS) from a Windows client. The idea is to get your Big Data Cluster up and running fast without going into detail on every step involved, so don’t expect a deep dive on every component along the way.

 

Is There Anything I Should Know Before Getting Started?

Since we will focus on how to deploy a Big Data Cluster (BDC), let’s assume that you have a rough idea on what a BDC is. If this is not the case, you can watch my talk “Introducing SQL Server 2019" from last year’s PASS Summit on PASStv.

BDC runs on a technology called Kubernetes – AKS is just one way of using Kubernetes as your base layer to deploy a BDC. While you don’t need to be an expert in Kubernetes, I highly recommend reading The Illustrated Children’s Guide to  Kubernetes or, if you prefer a deeper dive, watching this video by MVP Anthony Nocentino: Inside Kubernetes – An Architectural Deep Dive.

We will mainly interact with our BDC through a tool called Azure Data Studio, one of the client tools provided by Microsoft to work with SQL Server. If you are unfamiliar with it, take a look at the official docs.

Finally, as I am lazy, we will mainly deploy through the command line including the installation of our prerequisites. This will happen through a package manager called Chocolatey (https://chocolatey.org/).

 

What Will I Need for This Exercise?

As we will deploy this cluster in Azure Kubernetes Services, you will need an Azure subscription. This can be a free trial as we will not be using a ton of resources (unless you plan to keep this cluster up and running for a while).

Since you will need that Azure Subscription anyway, I would also recommend deploying a client VM running Windows 10 or Windows Server just to make sure the prerequisites and tools don’t interfere with anything else you are already running. Deployment would also work from a Mac or Linux machine, but would require a slightly different approach – feel free to ping me if you have a need for this.

I chose to deploy a Windows 10 VM using a Standard B2s machine – even if you keep this running for a few days, it will only cost you a few cents per hour. Still, if you decide to just deploy from your existing laptop, you should probably be fine.

 

Can We Finally Get Started?

YES! Let’s start by installing Chocolatey followed by the BDC prerequisites.

Connect to the Client you’ll be using to run your deployment and open a new PowerShell window. Run this command to install Chocolatey (choco):

Next, we’ll install the prerequisites (python, kubernetes-cli, azure-cli, python, Azure Data Studio and azdata). You can do this from the same PowerShell window:

The last preparatory step will be a small python script from GitHub, which will basically take care of the actual deployment. Close the PowerShell window and open a command prompt. Run the following curl command to download the script:

We’re almost ready to go. To be able to deploy the BDC to Azure, we need to login to our Azure subscription using:

This will open a web browser window where you will authenticate using the credentials assigned with your Azure subscription. After you’ve logged in, you can close the web browser. You will see a list of all your subscriptions (if you have multiple). Copy the id of the subscription to be used. Then run the deployment script:

This script will ask you for a couple of parameters, some of which have defaults. For those parameters that have a default value, just stick to those for now.

  • Azure subscription ID: Use the ID you got from az login.
  • Name for your resource group (this script will create a new RG for your BDC).
  • Azure Region.
  • Size of the VMs to be used for the cluster.
  • Number of worker nodes.
  • Name of the BDC (think of this like a named instance).
  • Username for the admin user.
  • Password for the admin user.

 

After the last step, the script will:

  • Create a Resource Group.
  • Create a Service Principal.
  • Create an AKS Cluster.
  • Deploy a BDC to this AKS Cluster.

 

The script will take around 30 minutes and will finish by printing the endpoints of your BDC. This output should look similar to this:

Look for the SQL Server Master Instance Front-End (using port 31433), you will need it later.

 

Wait?! What Just Happened?

By just running a few commands in a shell, we deployed a full BDC in Azure. All this was made possible using a package manager (choco) and Kubernetes: this combination allows us to deploy a complex system in a very simple way.

You can connect to it, just like every other SQL Server, using Azure Data Studio and the SQL Master endpoint from above.

You are now connected to your BDC and can start working with it.

ADS would also be one way to effectively manage and troubleshoot your BDC.

Instead of using the python script above, we could have also deployed the BDC directly through ADS. Additionally, the script was calling a tool called azdata which uses two configuration files that define the sizing and configuration of your BDC. We left all those at their default values for now, leaving a deep dive into them for another day.

 

What if I Have Additional Questions?

This post is part of a mini-series. In the following posts, we will look at how you can leverage the capabilities of your BDC through Data Virtualization or by using flat files stored in the storage pool.

In the meantime, if you have questions, please feel free to reach out to me on Twitter: @bweissman – my DMs are open.

Ben Weissman
About the author

Ben Weissman has been working with SQL Server since SQL Server 6.5, mainly in the BI/Datawarehousing field. He is a Data Platform MVP, MCSE Data Management and Analytics Certified professional, and a Certified Data Vault Data Modeler. He is also the first BimlHero Certified Expert in Germany, as well as a co-author of ”SQL Server Big Data Clusters" and "The Biml Book".

Ben has been involved in more than 150 BI Projects and is always looking for ways to become more productive and make SQL Server even more fun!

Together with his team at Solisyon, Ben provides training, implementation, and consultancy for SQL/BI developers and data analysts in upper-mid-market companies around the globe.

1 comments on article "Getting Started with Big Data Clusters – Part 1"

7/21/2020 8:26 PM
Annaswamy Srinivas

Is it possible to deploy entire BDC on laptop ? Curl works in Windows?


Please login or register to post comments.

Theme picker

Back to Top
cage-aids
cage-aids
cage-aids
cage-aids