Loading data into Azure SQL Data Warehouse using PolyBase



PolyBase

  

Azure SQL Data Warehouse and PolyBase 


We have talked and walked through PolyBase and enablement the PolyBase feature in the SQL Server 2016. It is a new feature that serves to process T-SQL query to read data from external data sources, including relational as well non-relational databases.

Along with PolyBase, we went through an introduction and a brief architecture of Azure SQL Data Warehouse that is a cloud based enterprise solution for data warehousing workload built on top of Massively Parallel Processing (MPP) design. MPP engine leverages a scale out architecture to distribute the computational processing of massive data across multiple nodes to execute quickly.

Next, onwards in this article, we will dive deep to do some hands on activity loading data from Azure Storage Blobs to Azure SQL Data Warehouse using PolyBase feature. Since we have already covered in one of the articles, the provisioning of an Azure Storage Blobs so will focus the following tasks – 
  • Create an Azure SQL Data Warehouse
  • Connect the Azure SQL Data Warehouse using SQL Server Management Studio (SSMS)
  • Upload a csv file to earlier created Azure Storage Blob Container
  • Load data into Azure SQL Data Warehouse from the csv file (Blob Container), using PolyBase


Meanwhile, if required, then you can visit few previous articles to have a brief awareness about the following artifacts – 


Pre-requisites


Before moving ahead, we need some pre-requisites to accomplish this Azure SQL Data Warehouse and PolyBase hands on activities on top of Azure Cloud.
  1. Azure subscription, if you don't have an account then sign up for a free Azure account - https://azure.microsoft.com/en-gb/free/
  2. Download the required lab files, essential to this task from the GitHub repository - https://github.com/rajendraazure/SQLDW-PolyBase
  3. Must have an existing Azure Storage Blob Container, where you can upload files.
  4. Must have a VM or local workstation, well equipped with SQL Server 2016 and PolyBase enabled.


Create Azure Storage Blob Containers


In this walk-through, I am using an existing Azure Storage Blob; you either can provision a new one or go with an earlier created Storage Blob. 
Essential to log on the Azure Portal and move to the Storage Blobs where you will have to create container as – 
  • sensors-data

Login to the Azure portal https://portal.azure.com/

STEP – 1: Launch Azure Storage Blob 


On the Hub menu, click All services and select the storage account, required to browse and load its blade in the Azure portal. Here you can see all details and options under different sections like Essentials and Services.

Storage account


Time to create one container under the storage account as we considered earlier; click the Blobs, which will load the container blade.

Storage Blobs

STEP – 2: Create a Container


Since, you do not have any containers so the list is empty, click + Container from the menu to launch the New container blade with following properties – 
  • Name – The name of the container (case-sensitive).
  • Public access level – Simply go ahead with default selection as Private (no anonymous access).


Add Container


Click the OK button to proceed with creating the said container; notification will be appeared about the validation and deployment. Sooner you can see the new entry in the container list.

Container List


Prepare sample Data Files


In the context of this Azure SQL Data Warehouse and PolyBase hands on activity, you will use PolyBase to load data from your Azure Blob store account to your Azure SQL Data Warehouse. Henceforth required a simple data file for the blob container.

Either you can create these files or download the same from the GitHub location; you have noticed the same in the pre-requisites section the GitHub link. Go to the said GitHub location and download the sample file - https://github.com/rajendraazure/SQLDW-PolyBase

GitHub


Here you can see three different files, though required the third one i.e. weather data – 
  1. 2016Orders.csv – Some random orders for the year 2016
  2. 2017Orders.csv – Some random orders for the year 2017
  3. BeachWeatherStationsAutomatedSensors.csv – A small CSV file contains sensor data from various Lake Michigan beaches in Chicago, Illinois. This dataset was obtained from the city’s open data portal (https://data.cityofchicago.org).


Upload sample Data Files to Azure Storage Blob Containers


Once downloaded the weather file, required to upload in the storage containers, which you have created in earlier steps. Navigate to the Azure storage account and select the container named sensors-data.

Sensors-data


You can see there is no file so go ahead by clicking the Upload button from the top menu. It will launch the Upload blob blade where required few property details, make sure to fill the appropriate details before uploading a file to the blob container as – 
  • Files – File name will be displayed once you browse the data file.
  • Overwrite if files already exist – Select the box.


Under Advanced option, provide a few more details as – 
  • Authentication type – Simply go with default selection SAS.
  • Blob type – No doubt, it would be a Block blob as the default selection.
  • Block size – Go with default selection as 4 MB.
  • Upload to folder – Provide a folder name.


Upload Blob


Once you have filled all details then can click the Upload button to load the data file from your local source to the Azure Blob Container. Sooner the data file will be uploaded to the container, refresh the blob container blade and you can notice this new data file exist under order folder.

Weather data file

Provisioning an Azure SQL Data Warehouse 


You see an Azure SQL Data Warehouse is a fully managed (Platform-as-a-Service (PaaS)) and scalable cloud based distributed relational database management system for data warehousing workloads. In this activity, the PolyBase enables to query the storage data by using T-SQL statements and load into an Azure SQL Data Warehouse. 

If required, in the meantime, visit one of previous articles to know the brief about SQL Data Warehouse and its architecture – 
On the other hand, have a look the Microsoft official link – 

STEP – 1: Launch an Azure SQL Data Warehouse 


In the Microsoft Azure portal, click the + Create a resource from the Hub and click the Databases from the Azure Marketplace. It will load all available database services under the Featured section, select the SQL Data Warehouse or can search from the search box.

SQL Data Warehouse
  
Promptly the SQL Data Warehouse blade will be loaded, the time you click the SQL Data Warehouse and the required details needed to submit for the following properties – 
  • Database name – The name of database.
  • Subscription – The name of your Azure subscription, it would be either free, paid or specific subscription etc. 
  • Resource group – The name of the resource group used to organize related resources, you can either create a new resource or choose an existing one.
  • Select source – Select Sample and later select the AdventureWorksDW under the Select sample box (though you can go with Blank).

New SQL Data Warehouse

STEP – 2: Configure an Azure SQL Server 


Next, the property Server, here essentials to configure the server settings to create a new server with the following sets – 
  • Server name: The name of SQL Data Warehouse DB server (a unique name). 
  • Server admin login: Administrator login name as per your choice. 
  • Password: Strong password for the above said administrator.
  • Confirm password – Confirm the same password. 
  • Location – Ideally, select the same location as your storage account.
  • Allow azure services to access server: Go ahead with default selection as Selected.


SQL DB Server

If you see all information are correct, then proceed by clicking the Select button. It will create the said DB server with defined properties. Once again, continue to provide rest properties details in the context of creating a SQL Data Warehouse under the blade – 
  • Server – Now, it has configured with the created server name.
  • Performance level – Need to select appropriate performance level that fits for our task.
  • Collation – Usually it display the default selection as SQL_Latin1_General_CP1_CI_AS. 


Click the Performance level that launch another blade to configure performance and decide the pricing tier. Though If you need some different configuration , then configure and click the Apply button from the Configure performance blade.

Configure performance

Since we are dealing with sample data that does not have massive size so better to go with the default selection. Now the remaining part is Collation, usually it maps with the default selection SQL_Latin1_General_CP1_CI_AS. However, here it will be specified by the selected sample source.

SQL Data Warehouse

Now, nearly all properties have filled up; proceed by clicking the Create button to create the SQL Data Warehouse instance. Post validation, it can take a few minutes to be deployed, depends on the configuration as you configured the SQL Data.

In the Azure Portal, you can view the notification and sooner you will be notified once SQL Data Warehouse is created successfully. Once deployed, you can verify the newly created SQL Data Warehouse under the SQL databases blade as a new entry.

SQL Databases


Congratulation, an Azure SQL Data Warehouse is created!! 😊

Connect to SQL Data Warehouse 


In this demo exercise, you will connect to the SQL Data Warehouse to proceed further to complete the PolyBase exercises.

Move to Azure portal and expand the SQL Data Warehouse, you can see the Server name under the Overview section. Copy the server name that is required to connect from a client/local workstation.

SQL Data Warehouse


Use your taskbar shortcut or any other preferred method to launch SQL Server Management Studio (SSMS) from a client/local workstation; it will launch the Connect to Server dialog.

Here, essentials to provide server details as – 
  • Server type – Database Engine
  • Server name – Provide the SQL Data Warehouse server name, which you copied in earlier steps
  • Authentication – Change the Authentication drop-down list to SQL Server Authentication
  • Login – Same name that you had provided during provisioning of server.
  • Password – Same password that you had provided during provisioning of server.

Connect to Server

Once you submit all details, and then proceed ahead by clicking the Connect button. If everything goes well, it will connect the SQL Data Warehouse.

However, in my case encountered an error, will talk in the next section.

Firewall Rule – Fix Exception


If you have not configured the firewall for the DB server previously, then might you encounter can be an exception as ‘Client with IP address ‘**.**.**.**’ is not allowed to access the server, like – 

Exception

Move to Azure Portal and select the Firewalls and virtual networks option under the Security section of the provisioned Azure SQL Data Warehouse. Here need to add one rule with precise Client IP address, which can allow access the Azure SQL Data Warehouse service.

Firewalls and Virtual Networks


Sometimes you can see a different IP address in blade due to proxy settings, henceforth make sure to create a rule assigning the same IP address that appears in the exception message.

Allow access to Azure services


Now try to connect again, this time SQL Data Warehouse will be connected easily with the SQL Server Management Studio (SSMS).

DB Object Explorer


Next, in the Object Explorer pane, expand the node for your SQL Data Warehouse, and continue to expand nodes to see the database and sample tables that were loaded during the provisioning of SQL Data Warehouse.

Object Explorer


On the SSMS toolbar, click the New Query button and launch new query window. In the query window, execute the following T-SQL query – 

SELECT SalesAmount, ProductLine FROM dbo.SalesByCategory;

Select query


Review the results, you have now successfully connected to and queried your SQL Data Warehouse.

Load and Transform Data using T-SQL with PolyBase


In this hand on an activity, you will load the data into the Azure SQL Data Warehouse that extracted from the CSV file in Azure Storage. In this exercise, you will execute T-SQL with PolyBase that can load and transform the data.

To pursue the tasks, required to create desired DB objects, which is essentials to map and import the data from the CSV file in Azure Storage. In fact, the said file BeachWeatherStationsAutomatedSensors.csv is a small CSV file that contains sensor data from various Lake Michigan beaches in Chicago, Illinois. 

SOURCE: Dataset was obtained from the city’s open data portal - https://data.cityofchicago.org.

STEP – 1: Get the Azure Storage Access Keys 


Since you supposed to execute some T-SQL query to read the Azure Storage, therefore storage access details required to map with a query.

Move to the Azure Portal, click the Access keys under Settings of your Azure Storage account blade, copy, and note down the Storage account name and Key1.

Access Keys


Apart from this copy the container name which appears in the Blobs list, as – 

Container


Henceforth you will catch the following essential details – 
  1. Storage account name – democloudstorage
  2. Container name – sensors-data
  3. Key1 - ***************************************************************************************************************************=='


STEP – 2: Prepare Requisites Objects 


In this activity task, you will create a table for the weather data that will be extracted from the weather file in Azure storage. Now, time to prepare the T-SQL query that will create requisites objects, look below listed code-snippet.

CREATE MASTER KEY;
CREATE DATABASE SCOPED CREDENTIAL DemoCloudStorageCredential WITH IDENTITY = 'user',
SECRET = '*****************************************************************************************************************************==';
CREATE EXTERNAL DATA SOURCE DemoCloudStorage WITH (
TYPE = HADOOP,
LOCATION =
'wasbs://sensors-data@democloudstorage.blob.core.windows.net',
CREDENTIAL = DemoCloudStorageCredential
);
CREATE EXTERNAL FILE FORMAT TextFile WITH (
FORMAT_TYPE = DelimitedText, FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Here you can see that Access Key, Storage Container and Storage Account has mapped with the T-SQL statements. Like,
  • SECRET = Access Key 1
  • LOCATION = wasbs://StorageContainer@StorageAccount.blob.core.windows.net


NOTE: Make sure to apply the above details, as it exists in the Azure Portal, because the account, container and the key fields are case-sensitive.

Post synched-up the details, execute the T-SQL query that will create necessary objects before proceeding the creation of an external table – 

PolyBase T-SQL


You can notice, DemoCloudStorage and TextFile objects will be appear under the External Data Source and External File Formats respectively.

Object Explorer


You can execute a couple of queries to validate new objects by querying the following catalog views – 
  • sys.database_credentials
  • sys.external_data_sources
  • sys.external_file_formats


SELECT * FROM sys.database_credentials
SELECT * FROM sys.external_data_sources
SELECT * FROM sys.external_file_formats

Select queries

STEP – 3: Create an External Table 


In previous steps, you synched-up and created the External Data Source, now time to create the external table for loading weather data. Prepare the following query with the data source mapping and execute in a query window, which connected to the SQL Data Warehouse.

CREATE EXTERNAL TABLE dbo.BeachSensorsExternal (
StationName VARCHAR(50) NOT NULL,
MeasurementTimestamp VARCHAR(50) NOT NULL,
AirTemperature DECIMAL(9,2) NULL,
WetBulbTemperature DECIMAL(9,2) NULL,
Humidity DECIMAL(9,2) NULL,
RainIntensity DECIMAL(9,2) NULL,
IntervalRain DECIMAL(9,2) NULL,
TotalRain DECIMAL(9,2) NULL,
PrecipitationType DECIMAL(9,2) NULL,
WindDirection DECIMAL(9,2) NULL,
WindSpeed DECIMAL(9,2) NULL,
MaximumWindSpeed DECIMAL(9,2) NULL,
BarometricPressure DECIMAL(9,2) NULL,
SolarRadiation DECIMAL(9,2) NULL,
Heading DECIMAL(9,2) NULL,
BatteryLife DECIMAL(9,2) NULL,
MeasurementTimestampLabel VARCHAR(50) NOT NULL,
MeasurementID VARCHAR(100) NOT NULL
)
WITH (
LOCATION='/',
DATA_SOURCE=DemoCloudStorage, FILE_FORMAT=TextFile
);

Create External Table

It will create an external table; you can validate the same in the Object Explorer.

Object Explorer


You can validate the data also; execute the SELECT query to retrieve weather data from the external table. It will display top 10 records of the weather data that has been loaded from the weather file in Azure Storage.

SELECT TOP 10 * FROM dbo.BeachSensorsExternal;

Select query


STEP – 4: Create another Table 


In previous steps, you have created an External Table containing the raw weather data successfully. Now you will create another table in the SQL Data Warehouse that will transform the data and load it, look the following code snippet – 

CREATE TABLE [dbo].[BeachSensor]
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
) AS
SELECT
StationName,
CAST(MeasurementTimestamp as DATETIME) AS MeasurementDateTime,
AirTemperature,
WetBulbTemperature,
Humidity,
RainIntensity,
IntervalRain,
TotalRain,
PrecipitationType,
Lab 03 | Integrating and Ingesting Data
WindDirection,
WindSpeed,
MaximumWindSpeed,
BarometricPressure,
SolarRadiation,
Heading,
BatteryLife
FROM dbo.BeachSensorsExternal;

SELECT COUNT(*) FROM dbo.BeachSensor;

Here, T-SQL statements are creating a new table named BeachSensor, then take a subset of columns from the raw external table. Post loads of the modified data into BeachSensor table, executing a SELECT query to display the table’s record count.

Create Table


Congratulation, loading data into the Azure SQL Data Warehouse done!! 😊

Thus, in this walked through we uploaded a raw data file into an Azure Blob Container and covered to provisioning an Azure SQL Data Warehouse. Later on, by using PolyBase T-SQL queries, loaded the raw data into the SQL Data Warehouse and verified the same.

Keep visiting for further posts.


SQL Server 2016 - PolyBase Installation on Windows


PolyBase Installation


PolyBase - Introduction


In one of previous articles, we went through the definitions and artifacts about the Azure SQL Data Warehouse and PolyBase. PolyBase is a new feature available in SQL Server 2016 and later versions. It serves to process T-SQL query to relational and non-relational databases (NoSQL). Certainly, using PolyBase you can run queries on external tables and files in Hadoop or in Azure Blob Storage as well import or export data to/from Hadoop.

If needed, then you can visit the post to get a brief introduction about the Azure SQL Data Warehouse and PolyBase .

PolyBase – Installation Prerequisites


PolyBase is a new feature and included in the SQL Server 2016 installer, here we will go through the installation of SQL Server 2016 and enablement the PolyBase feature. In this series, the following pre-requisites needed to proceed further – 
  1. 64-bit SQL Server 2016 or later version or can go with a trial version of SQL Server  - https://www.microsoft.com/evalcenter/evaluate-sql-server-2016
  2. Microsoft .NET Framework 4.5.
  3. Oracle Java Runtime Environment 7. JR7 or later is required, because PolyBase uses Java to connect to Hadoop. You can download JRE on the Oracle JRE page - https://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html
  4. Minimum memory – 4 GB and Space – 2 GB, though recommended minimum of 16 GB RAM.


PolyBase – Installation Mode


Before proceeding to install PolyBase on the SQL Server instance, must be aware about the installation mode i.e. – 
  • Single node (Standalone)
  • PolyBase scale-out group


Since after the installation of PolyBase either standalone or in a scale-out group, you cannot change the configuration later. Even to update any setting later, you have to uninstall and reinstall the feature again. 

Sometime a single node SQL Server instance with PolyBase create some performance bottleneck during dealing with massive data sets in Hadoop or Azure Blob Storage. In that scenario, it would be better to opt the scale-out group option. The PolyBase Group feature allows you to create a cluster of SQL Server instances to process large data sets from external data sources, such as Hadoop or Azure Blob Storage, in a scale-out fashion for better query performance. You can now scale your SQL Server compute to meet the performance demands of your workload. For a PolyBase scale-out group, make sure that:
  • All the machines are on the same domain.
  • You use the same service account and password during PolyBase installation.
  • Your SQL Server instances can communicate with one another over the network.
  • The SQL Server instances are all the same version of SQL Server.


STEP – 1: Launch SQL Installation Wizard


If you are moving with Evaluation edition, then download the setup file using the said URL - https://www.microsoft.com/evalcenter/evaluate-sql-server-2016.

Once downloaded, try to execute the setup file under administrator account that will launch the installation type wizard. In fact, it is essential to select the PolyBase features, so preferably go with Custom mode, and go ahead by clicking the Custom mode.

Evaluation - Custome mode


Sooner it will ask the target location where installation related files would be downloaded.

Evaluation - Media Location

But, If you are moving ahead with the SQL Server 2016 license version like,  Standard or any other edition, then go to the folder and run the SQL Server setup.exe under administrator account - 

SQL Server Installation files

Sooner, the SQL Server Installation Center wizard will be launched, containing all relevant steps and features – 

SQL Server Installation Center


STEP – 2: Installation Process


On the left side of the wizard, select second option Installation, then selects New SQL Server stand-alone installation or add features to an existing installation from the numerous options available.

Sql Server - Installation

A new wizard with a default Product Key option will be launched by clicking the above said option. Here you have opted to provide the product key or go with free edition, select the appropriate and proceed by clicking the Next button.

SQL Server - Product Key


Next, you will get license terms and conditions (License Terms), select the acceptance check box and click the Next button to proceed further.

SQL Server - License Terms

In case of Evaluation edition, there will be not any license terms and conditions.  

STEP – 3: Feature Selection


By accepting the Microsoft software license terms and conditions onwards, the system will proceed further activities meanwhile like, Global Rules, Product Updates, Install Setup Files and Install Rules. Next, time to select the instance feature from the Feature Selection page.

Here must select the PolyBase Query Service for External Data along with other features –

SQL Server - Feature Selection

By clicking the Next button, setup will execute the rules and validate all pre-requisites to make installation succeed. During installing the selected features, you can get the failure message about Oracle JRE 7, as I got it – 
  
SQL Server - Feature Rules

If Oracle Java Runtime Environment 7 or later is not installer earlier, the setup will ask to install the same because PolyBase works on top of Java to connecting the Hadoop.

In some scenario, the failure would be a bit different such as listed below in case of Evaluation edition – 

SQL Server - Global Rules


STEP – 4: Installing the JRE 


In the meantime the JRE 7 or later version is mandatory to fulfil the PolyBase setup, so move to Oracle official download page and download 64 bit compatible Java version – 

JAVA SE Runtime


Move inside of the downloaded folder and execute the jre-8u201-windows-x64.exe file to install Java Runtime Environment. It will launch the Java Setup wizard to proceed further.

Welcome to Java


By clicking the Install button the installation will be started and sooner you will get the completion acknowledgement.

Installed Java


NOTE: Versions 7 (starting from 7.51) and 8 are supported as well JRE and Server JRE both work. JRE9 and JRE10 are not supported.

STEP – 5: Re-run Feature Selection


Post installation of JRE requisite, re-run the SQL Server 2016 setup’s rules under the Feature Rules section. Here you can get all rule validated successfully, go ahead and click the Next button.

SQL Server - Feature Rules

STEP – 6: Instance & PolyBase Configuration


Post Feature Rules validation the setup will launch the Instance Configuration wizard. Here you can go with either default instance, or some precise named instance. 

SQL Server - Instance Configuration

After Instance configuration, the setup will launch the PolyBase Configuration wizard. Since this is SQL Server 2016 Standard edition, so we have only one available option as scale-out group.

However, if you are re-installing the SQL Server and using an Evaluation edition, then it will ask a fresh installation or will go ahead with adding features to an existing one.

SQL Server - Installation Type

In case of SQL Server 2016 Standard edition, need to specify a port range with at least six ports; SQL Server setup allocates the available ports from the range automatically. Proceed with default port range for the PolyBase services and click the Next button.

PolyBase Configuration

However, in case of Evaluation edition, you can go with Standalone mode since it is a demo-based activity in the context of PolyBase exposure.

SQL Server - PolyBase Configuration


STEP – 7: Server Configuration


Once you configured the Instance and PolyBase settings, time to provide the Service Account. Here again, as per SQL Server edition, you will have to provide account like a common domain account or a standard service account etc.

For example, the wizard will be appearing a bit different from the Standard and Evaluation editions, like - 
   
SQL Server - Server Configuration


SQL Server - Server Configuration

STEP – 8: Remaining Configuration


Post Server Configuration, required to set up remaining configurations like – 

Analysis Service Configuration – Here it is essential to select the Server Mode and assign at least one admin-privileged account.

Analysis Services Configuration


Reporting Service Configuration – By default, you can get only one option unless until you select the reporting feature earlier from the Feature Selection section.

Reporting Services Configuration


Consent to install Microsoft R Open – Since I had selected the Microsoft R feature during selection so will have to accept the downloading and installation of Microsoft R Open.

Microsoft R Open


STEP – 9: Complete the Installation


Post acceptance of the Microsoft R Open, by clicking the Next button setup will be proceeding further and will be Ready to Install with all configured settings as well selected services.

Ready to Install


It will start the installation, the time you click the Install button. Setup will take some time to complete the installation and you will get the Complete page with success acknowledgement.

Complete


Congratulations, PolyBase installed!! 😊

Enable the PolyBase


You just completed the successful installation of the PolyBase, which must be enabled to access the features. Next, connect the SQL Server (SSMS) and execute the following T-SQL command – 

USE master;  
GO  
EXEC sp_configure 'show advanced option', '1';  
RECONFIGURE; 
GO
EXEC sp_configure 'hadoop connectivity', 4; 
GO 
RECONFIGURE; 
GO

Here show advanced option in 1 is used to enable an advanced option and hadoop connectivity equal to 4, is used to connect to Hadoop in Windows Servers.

Post execution of the T-SQL statements, you can see the succeed message – 

T-SQL statements

Confirm the PolyBase Installation


Now PolyBase is ready and you can query the file and tables in Azure Blob or Hadoop using PolyBase. However, it is always good to validate the PolyBase installation to avoid any surprises later. 

Execute the following command, if PolyBase is installed, the return is 1 otherwise 0 will be returned.

SELECT SERVERPROPERTY ('IsPolyBaseInstalled') AS IsPolyBaseInstalled;

T-SQL Statements

Apart from this, PolyBase installs three user databases as listed below – 
  • DWConfiguration
  • DWDiagnostics, and 
  • DWQueue 


The above said databases should exist as well required for the PolyBase use, henceforth does not delete or alter them.

Database Objects


In a summarization, here you walked through the SQL Server 2016 installation along with PolyBase features. Now in further next articles we will do some hands on activities with PolyBase like, leveraging PolyBase to load data into Azure SQL Data Warehouse from an Azure Blob Storage.

Keep visiting the blog!