The official PASS Blog is where you’ll find the latest blog posts from PASS community members and the PASS Board. Contributors share their thoughts and discuss a wide variety of topics spanning PASS and the data community.

Getting Started With Big Data Clusters – Part 3

How Can I Work With File-Based Data in a Big Data Cluster?

So far, we’ve deployed a Big Data Cluster and explored its Data Virtualization capabilities

Now, it’s time to look how you can use your Big Data Cluster to consume file-based data from a CSV or Parquet file. The Big Data Cluster component to be used is the so-called Storage Pool.

Is There Anything I Should Know Before Getting Started?

You will need to have access to a Big Data Cluster – like the one we’ve deployed in the first post – as well as Azure Data Studio.

Also, you will need some file-based data. Our example will be using the 2015 flight delay dataset, which you get for free at https://www.kaggle.com/usdot/flight-delays.

You will also need your BDC’s gateway endpoint to access HDFS files.

To retrieve that, you have two options:

Open a command prompt and run:

azdata login

azdata will prompt you for your namespace (if you deployed using the default: sqlbigdata), username, and password. You can then run the following command:

azdata bdc endpoint list -o table

This will return the endpoints again. This time look for the “Gateway to access HDFS files, Spark”:

The other option would be to go through Azure Data Studio. Connect to your Big Data Cluster, right-click the Server and choose “Manage”:

If you switch to the “SQL Server Big Data Cluster” pane on the next screen, you will see all the endpoints of your instance:

How Can I Get Data into the Storage Pool?

Getting data into your storage pool can be achieved through Azure Data Studio, as well as through the command line. The latter is the preferred way when you’re automating the upload of your data – for example, hourly logfiles from your webserver.

Let’s first take a look at how to do that in Azure Data Studio. If you connect to a Big Data Cluster, ADS will show a node for your HDFS. If you right click it, you can create new directories and upload files:

Create a directory “FlightDelays” and upload the files “flights.csv” and “airlines.csv”. If you expand the HDFS node and your newly created directory, both files should show up:

Another way of uploading files would be through curl using this command:

curl -s -S -L -k -u root:<Password> -X PUT "https://<endpoint>/gateway/default/webhdfs/v1/<target>?op=create&overwrite=true" -H "Content-Type: application/octet-stream" -T "<sourcefile>"

In our case, it would look like this (replace your endpoint accordingly):

curl -s -S -L -k -u root:MySQLBigData2019 -X PUT "https://13.88.129.102:30443/gateway/default/webhdfs/v1/FlightDelays/airports.csv?op=create&overwrite=true" -H "Content-Type: application/octet-stream" -T "airports.csv"

If you refresh your folder in ADS, the third file should show up as well:

How Can I Make That Data Usable?

Now that we’ve uploaded some data to the storage pool, we need to make it accessible for SQL Server. To do this, we need an external data source within our database, a file format, and an external table pointing to either a file or to a directory. If you point your external table to a directory, all files in the directory will be used and will need to have the same structure.

Azure Data Studio also provides a Wizard for flat files. Right click the file “airlines.csv” in in ADS and click “Create External Table From CSV Files”:

This will open up another 4-step wizard. In the first step, you will define which database to create the external table in, as well as its name and schema. You will also define the name of the external data source (leave this as SqlStoragePool) and the name for the file format (change this to CSV):

The next step will just give you a preview of your data, whereas the important third step is where you will define your column names and data types:

As most other tools and wizards for flat files, it will try to guess the data types for each column but will only use a limited sample size. Also, it is rather pessimistic for some data types (while none of the entries for IATA_CODE contain more or less than 2 characters or any Unicode characters, it recommends NVARCHAR(50) as the data type) and potentially too optimistic for others. If you were not provided a proper schema for your file(s), this may end up in a trial and error approach.

The last step will provide you with an overview of what you’ve configured in the wizard, again with the option to immediately create those objects or script them.

Choose “Generate Script” and look at the output. The first part of the script creates the external data source, pointing to the storage pool:

USE [DataVirt];
CREATE EXTERNAL DATA SOURCE [SqlStoragePool]
    WITH (LOCATION = N'sqlhdfs://controller-svc/default');

The next step creates a (rather generic) file format:

CREATE EXTERNAL FILE FORMAT [CSV]
    WITH (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = N',', STRING_DELIMITER = N'\"', FIRST_ROW = 2));

The last step creates the external table, which will use the previously created format and data source:

CREATE EXTERNAL TABLE [dbo].[airlines]
(
    [IATA_CODE] nvarchar(50) NOT NULL,
    [AIRLINE] nvarchar(50) NOT NULL
)
WITH (LOCATION = N'/FlightDelays/airlines.csv', DATA_SOURCE = [SqlStoragePool], FILE_FORMAT = [CSV]);

For the other two files, you can either go through the wizard again or just create them by running these two statements:

CREATE EXTERNAL TABLE [dbo].[airports]
(
    [IATA_CODE] nvarchar(50) NOT NULL,
    [AIRPORT] nvarchar(100) NOT NULL,
    [CITY] nvarchar(50) NOT NULL,
    [STATE] nvarchar(50) NOT NULL,
    [COUNTRY] nvarchar(50) NOT NULL,
    [LATITUDE] float,
    [LONGITUDE] float
)
WITH (LOCATION = N'/FlightDelays/airports.csv', DATA_SOURCE = [SqlStoragePool], FILE_FORMAT = [CSV]);
CREATE EXTERNAL TABLE [dbo].[flights]
(
    [YEAR] bigint NOT NULL,
    [MONTH] bigint NOT NULL,
    [DAY] bigint NOT NULL,
    [DAY_OF_WEEK] bigint NOT NULL,
    [AIRLINE] nvarchar(50) NOT NULL,
    [FLIGHT_NUMBER] bigint NOT NULL,
    [TAIL_NUMBER] nvarchar(50),
    [ORIGIN_AIRPORT] nvarchar(50) NOT NULL,
    [DESTINATION_AIRPORT] nvarchar(50) NOT NULL,
    [SCHEDULED_DEPARTURE] time NOT NULL,
    [DEPARTURE_TIME] time,
    [DEPARTURE_DELAY] bigint,
    [TAXI_OUT] bigint,
    [WHEELS_OFF] time,
    [SCHEDULED_TIME] bigint NOT NULL,
    [ELAPSED_TIME] bigint,
    [AIR_TIME] bigint,
    [DISTANCE] bigint NOT NULL,
    [WHEELS_ON] time,
    [TAXI_IN] bigint,
    [SCHEDULED_ARRIVAL] time NOT NULL,
    [ARRIVAL_TIME] time,
    [ARRIVAL_DELAY] bigint,
    [DIVERTED] bit NOT NULL,
    [CANCELLED] bit NOT NULL,
    [CANCELLATION_REASON] nvarchar(50),
    [AIR_SYSTEM_DELAY] bigint,
    [SECURITY_DELAY] bigint,
    [AIRLINE_DELAY] bigint,
    [LATE_AIRCRAFT_DELAY] bigint,
    [WEATHER_DELAY] bigint
)
WITH (LOCATION = N'/FlightDelays/flights.csv', DATA_SOURCE = [SqlStoragePool], FILE_FORMAT = [CSV]);

You will notice that the latter two tables make use of the same data source and connection. If you expand your database and its tables in ADS, you will also see the three tables:

How Can I Query That Data?

Just like with data from external data sources as described in our previous post, this data can be queried as if it were coming from a local table:

SELECT * FROM airlines
SELECT * FROM airports

Of course, you can also join those tables against each other or against other tables in your database:

SELECT TOP 10 fl.AIRLINE,al.*, fl.FLIGHT_NUMBER FROM flights fl INNER JOIN airlines al on fl.AIRLINE = AL.IATA_CODE

If the structure of your file doesn’t match your external table definition – for example if you’ve defined a column and a SMALLINT that contains larger numbers – you would see a data conversion error, including the information of which column and row caused the issue. In this case, you would need to recreate your external table using the proper data types.

That Seems Slow…

While performance may not be bad for smaller files like airlines.csv, queries tend to get very slow once they get even slightly bigger. Therefore, I highly recommend converting those files into the parquet format first. This can also be done from within your Big Data Cluster.

The Microsoft samples on GitHub contain an example on how to convert a CSV into a parquet file as well as on how to add an external table based on a parquet file.

What if I Have Additional Questions?

Please feel free to reach out to me on Twitter: @bweissman – my DMs are open. Also make sure to check out the posts by my friend Mohammad Darab – he has a ton of content prepared for you!

Ben Weissman
About the author

Ben Weissman has been working with SQL Server since SQL Server 6.5, mainly in the BI/Datawarehousing field. He is a Data Platform MVP, MCSE Data Management and Analytics Certified professional, and a Certified Data Vault Data Modeler. He is also the first BimlHero Certified Expert in Germany, as well as a co-author of ”SQL Server Big Data Clusters" and "The Biml Book".

Ben has been involved in more than 150 BI Projects and is always looking for ways to become more productive and make SQL Server even more fun!

Together with his team at Solisyon, Ben provides training, implementation, and consultancy for SQL/BI developers and data analysts in upper-mid-market companies around the globe.

Please login or register to post comments.

Theme picker

Back to Top
cage-aids
cage-aids
cage-aids
cage-aids