Intro to T-SQL for Data Science
Data Science uses multitudes of scientific methods, algorithms, and processes to extract knowledge and insight from data and uses it across wide range of domains. Data Science helps to extract patterns from raw data which would have been invisible beforehand. In order to access and experiment with these data, a Database is essential. Thus, today we’ll learn about the Database as a whole and use examples and features of T-SQL to learn to find meaning out of raw data.
Check out other articles on
SQL.
T-SQL is one of the mostly widely used SQL Derivatives. It is known as a transactional language used to define how things are to be done. T SQL is one of the easiest and most effective way to get things done. It is importantly used to create applications and also to add business login into the application the backend systems.
Cloud or Cloud Computing can be defined as the on-demand availability of the ecosystem of computer resources ranging from data storage (cloud storage) to computer power without the user having to actively manage the system. These are the data centers which are available to users across the internet. We can say, Cloud is the architecture of the system where someone else owns the hardware such as AWS, Azure and Alibaba Cloud and the user can use it for their need at the time of convenience. This can also be called as Off-Premise.
On-premises is the scenario where the user owns the Hardware and the software systems are deployed on the very premises of the user or of the organization instead of remote facility for instance, Cloud Computing or Server Farm.
If you want to install in your own machine,
Download and Install SQL Server Express
-> SQL Server 2019 Express Edition
- Express – which is free
- Web
- Standard
- Enterprise
- Developer
Watch SQL Server Conference Videos on T-SQL for Better Performance,
Tools of the Trade (IDE)
SSMS
SQL Server Management Studio (SSMS) is an application software which is widely used to configure, manage and administer different components within the Microsoft SQL Server.
VS Code
Visual Studio (VS) Code is a free software developed by Microsoft for Windows, Linux and macOS which developers and engineering write and edit code, debug, refactor and provides multiple other functionalities.
What is an IDE?
In layman’s terms, this is an environment where you can write your code and/or perform your analysis. For E.g., Microsoft SQL Server Management Studio, VS Code, Azure Data Studio.
We would highly suggest you to try out Azure Data Studio to start out T-SQL journey as it is free and to learn from the experience.
Database
Database can be defined as the collections of information which is organized such that the information could be accessed and worked upon. It is often controlled by a database management system which all together with database is known as a database system, in short form just called database.
Database Management System (DBMS)
A DBMS acts as the interface between the database and its end-users such that the users can access, manage and update the information.
Database Objects
Database objects can be understood as the objects in a database which are used in order to store or refer to the data. Tables, Views, Sequences, Indexes, Clusters and Synonyms are all examples of the database objects.
Table
A set of related data stored in our database
Excel Reference
A sheet in Excel. Just like Excel contains spreadsheets, relational databases are composed of tables.
A set of data of a particular type, generally there is a value for each row in our table.
A collection of fields that make up a record.
The smallest source of data in a database
The logical processing order of a SELECT statement explains the methods of how the query will be process and the final result will be achieved.
- FROM
- ON
- JOIN
- WHERE
- GROUP BY
- WITH CUBE or WITH ROLLUP
- HAVING
- SELECT
- DISTINCT
- ORDER BY
- TOP
Working with Data:
Selecting
This is the process of retrieving one or many rows or columns from one or many tables in a database. E.g., Returning 20 records with our 11 selected columns from the Cars Table
Filtering
This is the process of excluding data based on predefined rules. E.g., Returning 12 records with our 11 selected columns from the Cars Table where the cars have a body style of type “convertible”
Ordering
This is the process of sorting a dataset based on a specified sort order. E.g., Returning 12 records with our 11 selected columns from the Cars Table where the cars have a body style of type “convertible” and ordering them from the widest based on the Wheelbase.
Grouping
Grouping data within a dataset typically over 1 or more columns. E.g., Returning All the records from the Cars dataset, grouping the data over the Make to see how many Models each Make has that is of Body_Style Convertible and what is the Average Price for each Model.
Modifying
Permanent VS Temporary data modifications. For Permanent we use the UPDATE statement, for temporary we can use what best suite our needs. E.g., We are correcting the spelling error of ‘Audi’ in the Make the top statement will only display the corrected data, where the bottom will permanently change the underlying data.
Working with Multiple datasets
Joining
Joining 2 or more datasets together on a common “join” key. The SQL join allows us to collect two or more tables using the common identifiers. There are different types of join such as: Inner Join, Outer Join, Cross Join and Semi-join.
If you want to dive deeper into T – SQL, watch this video embedded below,
SET Operations
The SET Operations like Union, Intersect, Minus help us get meaningful outputs from the data stored in table under various special conditions.
Watch this video by
AI 42 to dive indepth in T-SQL for Data Science,
Views and Procedures
Creates a virtual table with predefined Logic, this is useful for repetitive data alterations.
Common Table Expressions (CTE’s)
It is a temporary result set which can be referenced within another CRUD statement.
Pivot Vs Un-Pivot
Pivot and Un-Pivot can be used to alter the expressions of table into another table. Pivot rotates the expression of the table by transforming the unique values from a column in the expression in numerous columns in output whereas Unpivot does the opposite operation of Pivot.
Stored Procedures
An encapsulated set of T-SQL commands that can be ran repeatedly.
Rank
The RANK() function is a widely used window function which assigns a rank to each row in a partition of a result set. It returns a rank for each row in our window. 1+ the number in the ranking before it.
Analytic Functions
It calculates the aggregate value which is based on the group of rows. It returns multiple rows for each group. We can extensively use analytic functions to calculate moving averages, percentages, running total and top – N results within a group.
The analytic functions supported by the Transact- SQL are as follows,
- CUME_DIST
- FIRST_VALUE
- LAG
- LAST_VALUE
- LEAD
- PERCENT_RANK
- PERCENTILE_CONT
- PERCENTILE_DISC
Aggregate Functions
It calculates a set of values and thus returns back an individual value. It is widely used with the GROUP BY clause of the SELECT statement. Aggregate functions are always deterministic.
The aggregate functions provided by T-SQL is as listed below,
- APPROX_COUNT_DISTINCT
- AVG
- CHECKSUM_AG
- COUNT
- COUNT_BIG
- GROUPIN
- GROUPING_ID
- MAX
- MIN
- STDEV
- STDEVP
- STRING_AG
- SUM
- VAR
- VARP
In this article, we learned different aspects of T-SQL. Understanding the resources itself is critical. We learned about different tools from Cloud Computing to On-premise systems to IDEs to work on. Using these tools, will help us get used to the technologies that are extensively employed in the professional setting. It also helps us experience with the state of art technologies. Furthermore, we learnt database, DBMS, different database objects, logical processing order, data modification methods and numerous other functionalities provided by T-SQL. This is a stepping stone to become able to play with data. All these tools will help us down the line to operate on data as per our needs and help us make progress on our Data Science journey. Learning to extract knowledge and insights from data as per the need for our projects and research works will make us a competent Data Scientist.