Connecting Azure HDInsight Cluster and Hadoop hands on activity






In my previous post I tried to explain the concept of Big Data and HDInsight as well provisioning the cluster on top of Azure cloud. Next, onwards, here we will connect the same cluster and will go for some precise hands on activities like processing Big Data via Hadoop in some different ways.

Since HDInsight provisioning is required as a pre-requisite to complete these tasks so you can either select existing HDInsight cluster or create a new HDInsight cluster. Please refer my previous post to provision a new HDInsight cluster, titled as Big Data and HDInsight Cluster – provisioning on top of Azure cloud.

Here you will cover following Hadoop hands on activity on top of Azure HDInsight Cluster – 
  • Connect an HDInsight Cluster
  • Brows cluster storage
  • Execute commands to explorer HDFS file system
  • Upload and process data files.
  • Run MapReduce jobs using the function etc.


Connecting to an HDInsight Cluster


I have been provisioned an HDInsight cluster already during previous post so moving ahead to log on the Azure portal again to fetch SSH details - https://portal.azure.com/

In fact, I am using Windows based computer so will choose PuTTY application to connect the cluster.

STEP – 1 

Go to Azure portal and select the HDInsight Cluster blade, here click the SSH + Cluster login link under the Settings category, it will load the Cluster Dashboard.



STEP – 2

Next, select the Hostname and it will display the endpoint through which the cluster can be connected.

In my case hostname is something like - HDInsightDemo-ssh.azurehdinsight.net


STEP – 3

Open PuTTY, and in the Session page, enter the host name (the earlier copied hostname) into the Host Name box. Then, under Connection type, select SSH and click Open.



If you get a security alert something that the server’s host key is not cached in the registry and want to connect or abandon the connection, simply click Yes to continue.



STEP – 4

Security alert onwards, when prompted, enter the SSH username and password you specified during provisioning the cluster and make sure to submit the SSH user not a cluster login username. 



Post authorization, you will be connected with HDInsight cluster console, here you can see a couple of few details such as Ububtu 16.04.5 LTS server where on top of this HDInsight cluster is running.


 Congratulation, Azure HDInsight Cluster connected!! 😊

Some hands on activities with HDInsight cluster 


Since you have already opened an SSH console for the created cluster successfully, now you can use it to work with the cluster shared storage system. Here Hadoop uses a file system named HDFS, which in Azure HDInsight clusters is implemented as a blob container in Azure Storage.

Next, time to do some hands on activities on top of the cluster using Hadoop command, keep notes that commands are case-sensitive.

Browse Cluster Storage


Task – 1  

Execute the following command to view the contents of the root folder in the HDFS file system. 

hdfs dfs –ls /
 

Task – 2 

Execute the following command to view the contents of the /example folder in the HDFS file system. This folder contains sub-folders for sample apps, data, and JAR components etc.

hdfs dfs –ls /example



Task – 3

Execute the following command to view the contents of the /example/data/gutenberg folder, which contains sample text files.

hdfs dfs -ls /example/data/gutenberg



Task – 4
Execute the following command to view the text in the davinci.txt file on console.

hdfs dfs -text /example/data/gutenberg/davinci.txt
  
 


You can see the file contains a large volume of unstructured text.

Run a MapReduce Job


MapReduce is a framework through which you can write applications to process huge amounts of data. It is a processing technique and a program model for distributed computing based on Java. 

MapReduce essentially refers to two distinct and diverse tasks that Hadoop programs perform, to distribute the processing of data across nodes in the cluster. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples structured as key - value pairs.

Next the reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. In fact, the sequence of the name MapReduce suggests, the reduce job is always performed after the map job.

Task – 1

Execute the following command to view the sample Java jars stored in the cluster head node.

ls /usr/hdp/current/hadoop-mapreduce-client




Task – 2

Execute the following command to get a list of MapReduce functions available in the hadoop-mapreduce-examples.jar.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar


Task – 3

Execute the following command to get help for the wordcount function in the hadoop-mapreduce-examples.jar that is stored in the cluster head.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount


Task – 4

Execute the following command to run a MapReduce job using the wordcount function in the hadoop-mapreduce-examples.jar jar to process the davinci.txt file you already viewed and store the results of the job in the /example/demoresults folder accordingly.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/demoresults

MapReduce job will start to process the wordcount function promptly.


The sooner the MapReduce job will complete, you can see related details which appear on console.

Task – 5

Post MapReduce job completion, execute the following command to view the output folder. You can notice an existence of a file named part-r-00000, that has been created by the job.

hdfs dfs -ls /example/demoresults



Task – 6

Execute the following command to view the results in the output file part-r-00000.

hdfs dfs -text /example/demoresults/part-r-00000



Uploading and Processing Data Files


In the previous hands on activities, you executed a Map Reduce job on a sample file that is provided with HDInsight. Now you will upload data to the Azure blob store and do further activity Hadoop, and then download the results for analysis on your local computer accordingly.

Task – 1

You need a huge volume of file to process this task, either you can create the text file or download the same. I am going to download it, for example, some product reviews – 


Task – 2

Since you need to upload this file to Azure Blob Storage of HDInsight cluster, so you can either use Azure Storage Explorer or go ahead manually via the Azure portal.

Go to the Azure portal and select the HDInsight cluster storage account.


Next, move inside the Blobs.



Post selection of default container, you can see the HDInsight container blade where HDFS files folders will be displayed.


Task – 3

Click the Upload link inside the Container blade and upload the earlier downloaded sample file reviews.txt under a new folder name demofiles.

The rest leave all other options as default selection and proceed to click the Upload button under the Upload blob blade.


Sooner you will get success upload acknowledgement.



Task – 4

Switch to the SSH console for your HDInsight cluster and execute the following commands to list file details there.

hdfs dfs –ls /demofiles
hdfs dfs –text /demofiles/reviews.txt
  




Task – 5

Execute the following command to run a MapReduce job using the wordcount function in the hadoop-mapreduce-examples.jar jar to process the uploaded file reviews.txt and store the results of the job in the /demofiles/results folder.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /demofiles/reviews.txt /demofiles/results

Promptly the MapReduce job will start to process the file and sooner it completes the same.




Task – 6

Execute the following command to view the output folder demofiles, and verify that a file named part-r-00000 has been created by the job.

hdfs dfs -ls /demofiles/results



Task – 7

Next move to Azure portal and go inside the HDInsight container and verify the results folder and output files.


Task – 8

Click the part-r-00000 file and you can see a summary detail about the blob, next click the Download link to download the same to your local computer.



Task – 9

The part-r-00000 text file is a tab-delimited file so you can either use a spreadsheet application or a normal text editor to see the word counts.

I am opening the file using the Notepad ++ editor.



Keep visit for further articles ! 👍

Big Data and HDInsight Cluster – provisioning on top of Azure cloud





Big Data – Introduction


In the recent IT world, the Big Data, is the latest watchword. In fact, the Big Data is a term which is being used for a collection of large and complex data sets. Here these massive volumes of data are so fused as well chaotic and difficult to store and process through the traditional data processing applications or database management tools.

Big Data is universally these days, some tremendous real-life big data examples - 
  1. Retail companies used to handle millions of customer transaction every hour. 
  2. It is believed that a single Jet engine can generate 10 terabytes plus of data within thirty minutes of a flight duration. 
  3. If I talk about the social media like Facebook, then the statistic indicates 500 TB plus of new data gets ingested into the databases as well about 230 plus millions of tweets are created every day.
  4. Present cars have nearby 100 sensors which monitor fuel level, tire pressure, etc. and generates a lot of sensor data.
  5. Stock exchanges like New York Stock Exchange itself generates about one terabyte of new trade data per day.

Big Data – Sources




In reality, whenever someone opens an application on their phones, surfing on the internet, searching something specific in a search engine and many more usual activities, a piece of data is gathered. In brief, following are the major sources of big data – 
  1. Social networking sites like Facebook, Google, LinkedIn, Twitter etc.
  2. E-commerce site like Amazon, Flipkart etc.
  3. Weather Station like India Meteorological Department.
  4. Telecom company like Airtel, Vodafone, Idea etc. 
  5. Share Market like NSE, BSE etc.
  6. Medical records, etc.


Big Data – Categories


Big Data could be of three types – 



[1] Structured 

Any type of data which can be stored, accessed as well processed in a pre-defined format is called as Structured Data. For example – RDBMS is one of the best example of structured data where we can manage data with the help different schema and process the same using SQL language.

[2] Unstructured

Any data which form or structure are unknown and cannot be stored in RDBMS as well not easy for analysis, classified as Unstructured Data. For example – heterogeneous data sources, a result of Google Search returns a combination of text files, images, audios and video etc.

[3] Semi – structured

It is a specific type of data where data do not have a formal structure in term of RDBMS, but it has some organizational properties and elements enabled. For example – an API generates either XML or JSON output.

Big Data – Four V’s


Some precise terms known as four V’s are associated with big data that actually define the characteristics and help to make definition even better about the big data.



[1] Velocity

Velocity is one of the major characteristics of big data which defines as pace, the speed of generation of data where different sources generate it every day. It deals with the speed at which data flows in from sources and processed to meet the demands, determines real potential in the data.

[2] Volume 

Here Volume refers to the massive amount of data which used to grow day by day at a very fast pace from a variety of sources, including business transactions, social media and sensor facts or machine-to-machine data.

[3] Veracity

Doubtful data or uncertainty of data leads to data inconsistency and incompleteness which is in fact called as Veracity. In order to be of worth of big data in the context of an organization, make sure it should be correct.

[4] Variety

Here Variety refers to assorted sources and nature of data that belongs to structured, unstructured and semi-structured. In brief big data can be varied, the data can exist in different forms of images, audios, videos, and sensor data etc. 

Azure HDInsight and Hadoop cluster – Introduction


Azure HDInsight is a Hadoop service offering from the Hortonworks Data Platform (HDP) and hosted on top of Azure cloud. It is a cloud based fully managed, full-spectrum and open-source analytics service for enterprises to process massive amounts of data. HDInsight also supports a broad range of scenarios, such as batch processing in term of extract, transform, and load (ETL), data warehousing, machine learning, internet of things (IoT) and data science etc.

Apache Hadoop is an open source distributed data processing cluster that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in a parallel format. 

You can visit my previous post to do some hands on activity with Hadoop on top of Windows 10.

Azure HDInsight deploys and provisions Apache Hadoop clusters on top of Azure cloud, providing a software framework designed to manage, analyze, and report on big data with high availability and utilization.

Next, onward you will provision an HDInsight cluster and run a sample MapReduce job on the cluster and check the results.

Provisioning and Configuring an HDInsight Cluster


Pre-requisites


STEP - 1

Login to the Azure portal https://portal.azure.com/

Click ‘+ Create a resource’ from the left hand menu, you will get Analytics category under the Azure marketplace tab. Click the HDInsight link under Featured category – 


Post selection of HDInsight, a new blade to create a new HDInsight cluster will be appearing with particular categories – 



STEP - 2

On the Basics tab, make sure the correct subscription is selected and submit appropriate details as follows – 
  • Cluster Name – Enter a unique name.
  • Subscription – Select your Azure subscription.
  • Cluster Type – Hadoop.
  • Operating System – Linux.
  • Version – Select the default one, most probably it is the latest version of Hadoop.
  • Cluster Login Username – Enter a user name of your choice.
  • Cluster Login Password – Enter a strong password.
  • SSH Username – Enter another user name of your choice (to access the cluster remotely).
  • SSH Password – You can use the same above password.




Next, choose either an existing Resource group or can create a new using Create new. I am going with earlier created resource group ‘rajResource’.

Later on I went through the default data center location as East US 2, and clicked the Next button.


STEP - 3

Post click on Next button; the Storage blade will be appearing where you need to submit following details as – 
  • Primary storage type – Azure Storage
  • Selection Method – My Subscriptions
  • Storage account – Either select an existing one or create a new storage account.
  • Default Container – Enter a new name or go with default selection with cluster name.




Apart from this leave the rest two options with default Optional selection for Additional storage accounts and Data Lake Storage Gen1 access respectively.

Even no need to pass any input for the Metastore Settings, it is an optional setting so you can go and click the Next button.



STEP - 4

Post click on Next button, the Cluster summary blade will be appeared and post validation success you can see the details about the HDInsight cluster you are about to create. 


In fact, Azure HDInsight Clusters billed on a per minute basis, clusters run a group of nodes depending on the component. These Nodes vary by group for example Worker Node, Head Node, quantity, etc. so we choose the smaller available size for demo purpose.

Visit the Microsoft Azure official pricing page for more details. I went with two Worker Node and clicked the Next button – 


Post selection of Next button you will get Script actions blade, just leave with default selection since an optional input and click the Next button herewith – 



STEP - 5

Next, the Cluster summary blade will be appearing again and post validation success you can see the details about the HDInsight cluster you are about to create. 



If you feel each information is fine and ready, then click the Create button. It will take a while the cluster to be provisioned and status to show as Running (good time to have a cup of tea! ☕).


Note: As soon as an HDInsight cluster is running, the credit in your Azure subscription will start to be charged. Henceforth post demo lab, do not forget to clean up the resources to avoid using your Azure credit unnecessarily.

Sooner you will get a notification once the HDInsight cluster provisioned successfully.



Congratulation, HDInsight Cluster is deployed!! 😊, time to connect the cluster.

View Cluster details in the Azure Portal


In the Microsoft Azure portal, select your resources and move to the HDInsight Cluster blade, here the summary of your created cluster will be appeared.



On the HDInsight Cluster blade, you can change the size setting also, such as scale the number of worker nodes to meet processing demand.



Cluster dashboards


Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It is a fully open-source, Apache project and graphical interface to Hadoop. You can explore the dashboard for your cluster using this web application. 


Click the Ambari home link under the Cluster dashboard section, it will redirect to cluster portal and ask to log on. Make sure to provide a cluster user name and password not SSH user name.

For example, in my case – https://hdinsightdemo.azurehdinsight.net/



Post successful login, the web application will display the dashboard of HDInsight cluster where you can see all running Big Data components and details – 


Connecting to an HDInsight Cluster


HDInsight cluster has been provisioned as well you went through the dashboard also using Ambari web application. Now time to connect the HDInsight cluster using SSH client such as PuTTY.

Since I am using Windows based computer so will PuTTY application to connect the cluster.

STEP – 1 

Go to Azure portal and select the HDInsight Cluster blade, here click the SSH + Cluster login link under the Settings category, it will load the Cluster Dashboard.



STEP – 2

Next, select the Hostname and it will display the endpoint through which the cluster can be connected.

In my case hostname is something like - HDInsightDemo-ssh.azurehdinsight.net


STEP – 3

Open PuTTY, and in the Session page, enter the host name (the earlier copied hostname) into the Host Name box. Then, under Connection type, select SSH and click Open.



If you get a security alert something that the server’s host key is not cached in the registry and want to connect or abandon the connection, simply click Yes to continue.



STEP – 4

Security alert onwards, when prompted, enter the SSH username and password you specified during provisioning the cluster and make sure to submit the SSH user not a cluster login username. 
 

Post authorization, you will be connected with HDInsight cluster console, here you can see a couple of few details such as Ububtu 16.04.5 LTS server where on top of this HDInsight cluster is running.



Congratulation, Azure HDInsight Cluster connected!! 😊

In the next post we will connect the same cluster and will process Big Data via Hadoop something precise hands on activities like – 
  • Brows cluster storage
  • Run a MapReduce job
  • Upload and process data files etc.

Stay in touch!! 👍