Introducing U-SQL- Make Big Data Processing Easy

Microsoft recently announced a new feature for Azure Services, Azure Data Lake Services for analytics in the Microsoft Azure cloud services. It is having a large scale repository and also a default service YARN that provide a benefit for analysing the large scale data to a developer or DBA and also for the data scientists. YARN is Apache Hadoop NextGen MapReduce known as MapReduce 2.0 (MRv2) or called YARN.

Azure Data Lake have a managed way to manage all Hadoop, Spark and HBase Services. Azure Data Lake use U-SQL. It is a language that explores the benefits of SQL and is scalable and provides a distributed query capabilities for a developer and DBA those are currently working with Big Data. Also, it provides you to efficiently analyse your data in the store and across relational stores such as Microsoft SQL Azure Database.

Azure Data Lake

Benefits of Data Lake

Batch, Real-time, and interactive analytics made it easy to:

  • Store and analyze data of any kind and size.
  • Develop faster, debug and optimize smarter.
  • Interactively explore patterns in your data.
  • No learning curves—use U-SQL, Spark, Hive, HBase, and Storm.
  • Managed and supported with an enterprise-grade SLA.
  • Dynamically scales to match your business priorities.
  • Enterprise-grade security with Azure Active Directory.
  • Built on YARN, designed for the cloud.

Why use U-SQL?

U-SQL is a large scalable Language and if we see the behaviour of Big Data analytics we can have several requirements. Here are the major requirements as per the MSDN blog.

  • Process any type of data From analysing BotNet attack patterns from security logs to extracting features from images and videos for machine learning, the language needs to enable you to work on any data.

  • Use custom code easily to express your complex, often proprietary business algorithms. The example scenarios above may all require custom processing that is often not easily expressed in standard query languages, ranging from user defined functions, to custom input and output formats.

  • Scale efficiently to any size of data without you focusing on scale-out topologies, plumbing code, or limitations of a specific distributed infrastructure.

How to use U-SQL?

If you want to see how we can use U-SQL. Go through the MSDN blog example. Let’s assume I have downloaded my Twitter history of all my tweets, retweets, and mentions as a CSV file and placed it into my Azure Data Lake Store. Here top 50 rows in the .csv file can be seen:

File Information
Image Source: blogs.msdn.com

In this just count the number of tweets for each of the authors in the tweet “network”:

  1. @t = EXTRACT date string  
  2.            , time string  
  3.            , author string  
  4.            , tweet string  
  5.      FROM "/input/MyTwitterHistory.csv"  
  6.      USING Extractors.Csv();  
  7.    
  8. @res = SELECT author  
  9.             , COUNT(*) AS tweetcount  
  10.        FROM @t  
  11.        GROUP BY author;  
  12.    
  13. OUTPUT @res TO "/output/MyTwitterAnalysis.csv"  
  14. ORDER BY tweetcount DESC  
  15. USING Outputters.Csv();  
Here are the following three major steps of processing data with U-SQL from the above U-SQL script:
  1. Extract data from your source. Datatypes are based on C# datatypes and the built-in extractors library to read and schematize the CSV file is used.

  2. Transform using SQL and/or custom user defined operators. A familiar SQL expression that does a GROUP BY aggregation is used in the preceding example.

  3. Output the result into a file. You can also store it into U-SQL table for further processing.

This was just an introduction to U-SQL, for further understanding refer the msdn blog examples to:

  • Add additional information about the people mentioned in the tweets.
  • Extend my aggregation to return how often people in my tweet network are authoring tweets and how often they are being mentioned.