SQL is not the sexiest language on the block and many/most data scientists I know prefer to stick to R and/or Python. Some common complains I hear about SQL are:
- It is hard to read and as a consequence
- large SQL statements are hard to debug.
- Version control with databases often requires additional tooling to work with git.
However, since SQL was designed to work with and process data, I think that every data scientist should have at least a basic understanding of how to write queries efficiently. While I am no expert in SQL myself, maybe you still find something interesting that can help you in your work.
There are a lot of SQL flavors out there, but in this post I will focus on Microsoft’s T-SQL. It should be easy enough to translate T-SQL to your SQL flavor of choice.
Before we get started, I would recommend you download SQL Server Management Studio or VS Code with the MSSQL extension. Another cross-platform option is SQL Operations Studio, which offers a VS Code style look and feel. To follow along, you will need access to a SQL Server instance. If you don’t you can either:
- Install SQL Server 2017 locally on either Windows or Linux
- Install SQL Server 2017 using Docker or
- spin up an SQL Server instance in the cloud using Azure or AWS
Once you are connected to a SQL Server instance, you are ready to go on:
Creating a database and a table
First, let’s create a database and a table:
CREATE DATABASE test_db -- select database to use for current connection: USE test_db CREATE TABLE test_table ( id INT NOT NULL, last_name VARCHAR(50) NOT NULL, first_name VARCHAR(50), age INT );
You can access the online help for a specific command by highlighting a command and pressing F1.
A schema is basically a namespace in a database and is also closely tied to permission settings. The default schema is
[dbo] and is specified immediately after the database name:
SELECT * FROM [server_name].[test_db].[dbo].[test_table]
This is the fully qualified name of a table, starting with the server [server_name], then the database [test_db], followed by the schema [dbo] and the table [test_table] at the very end. Most of the time you will probably not bother to write the fully qualified name, especially if you use the default [dbo] schema, since no explicit schema implies [dbo].
Now wer are going to add some data to our table:
INSERT INTO dbo.test_table (id, last_name, first_name, age) VALUES ('1', 'Phu', 'Winnie', NULL), ('2', 'Pan', 'Peter', '17')
Importing data from one SQL Server into another is also easy. You can either write a query:
INSERT INTO [your_schema_here].[your_table_here] SELECT * FROM [source_server].[source_database].[source_schema].[your_table_here]
or use ‘SQL Server Import and Export Data’ tool, which is pretty much self-explanatory to use.
There are only a couple of commands that cover probably 99% of all cases you are going to encounter:
-- * selects all columns SELECT * FROM [test_db].[dbo].[test_table] -- select specific columns by name SELECT id ,last_name ,first_name ,age FROM [test_db].[dbo].[test_table]
Note: SQL is NOT case sensitive, so LAST_NAME, last_name and Last_Name and every other variation is seen as the same variable.
Some names are reserved key words and need to be put in square brackets if you want to use them as table or column names, like so:
The following commands always need to be specified in this order:
GROUP BY >
ORDER BY. The number on the right side of the statement specifies the order in which the statement is going to be processed by the database engine:
SELECT * -- 5 FROM test_table -- 1 WHERE id = '1' AND id = '2' -- 2: filter rows GROUP BY id, last_name, first_name -- 3: group rows HAVING COUNT(last_name) = 1 -- 4: filter groups ORDER BY id, last_name -- 6 OFFSET 2 ROWS -- 7: ORDER BY is obligatory before OFFSET! FETCH NEXT 10 ROWS ONLY -- 7
ORDER BY is the last statement to be executed, you can already use a column alias and columns defined in the
Temporary tables and Common Table Expressions
There are basically two ways to work with temporary data in SQL:
- Create temporary tables or
- use common table expressions, aka CTEs
Creating a temporary table is easy, all you need to do is add
into #temp_table_name to your select statement like so:
-- without sub-select: SELECT * INTO #temp_table FROM dbo.test_table -- with sub-select: SELECT * INTO #temp_table FROM ( SELECT * FROM dbo.test_table )
Personally, I prefer the option with sub-select as it is easier to see that data is written to a temporary table. As you can see, all you need to do to create a temporary table is prefix your table with one hashtag
#. This temporary table is available as long as the current connection from the current user is open. You can also create a temp table using two hashtags
##. This temp table is available to any user by any connection and is only deleted if all open connections are closed.
Usually, you do not open and close connections very often (at least I do not) so you might want to drop your temp table on each re-run. You can do that like so:
-- drop table if it exists: if object_id('tempdb..#temp_table') is not null DROP table #temp_table SELECT * INTO #temp_table FROM dbo.test_table
With a common-table expression you do not need to bother with cleaning up temp tables as a CTE is only available at run time of its query. So in a sense, a CTE is a temp table that is not persisted after the query finishes and thus you do not need to drop it. A major benefit in my opinion is also that CTEs allow you to structure your query in a more readable manner:
WITH name_of_CTE_1 AS ( SELECT * FROM dbo.test_table ), name_of_CTE_2 as ( SELECT * FROM dbo.test_table ) SELECT a.id a.last_name b.first_name FROM name_of_CTE_1 a LEFT JOIN name_of_CTE_2 b on a.id = b.id
As you might have guessed, the
UPDATE statement allows you to alter existing records in a table.
First, you specify the table you want to update, then you
SET one or more columns you want to update followed by a
WHERE the update should take place:
UPDATE [test_db].[dbo].[test_table] SET first_name = 'Peter Wilfried', last_name = 'Paan' WHERE last_name = 'Pan'
Some more useful commands
I found the following commands quite useful:
-- select unique rows SELECT DISTINCT col1 ,col2 FROM some_table -- counting SELECT -- count all rows COUNT(*) -- count rows that are not NULL in col1 ,COUNT(col1) FROM some_table -- Aggregation functions SELECT -- sum per group SUM(col1) OVER (PARTITION BY col2, col3) as SUM_COL1_GROUP -- rolling sum per group ,SUM(col1) over (partition by col2, col3 order by col4 rows between 30 preceding and 1 preceding) -- median per group ,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col2) as MEDIAN_COL2_GROUP FROM some_table -- Checking cases<> SELECT col1 ,CASE col2 WHEN 'a' then 1 WHEN 'b' then 2 ELSE NULL END as col2_case FROM some_table -- Select top n values with ties SELECT TOP 10 WITH TIES col1 FROM some_table
Last but not least, I will quickly talk about data types.
SQL Server offers essentially 6 different data type categories:
- Exact numeric:
- Approximate numeric:
Choosing the right data type depends on your specific application and depending on your choice, you will need less or more storage space. One important point though when you work with characters:
nvarcharto get Unicode characters!
To convert data from one type to another you can use:
-- Plain vanilla conversion from one format to another -- Fails if CAST is not possible CAST(column as DATATYPE) -- Returns NULL if casting is not successful TRY_CAST(column as DATATYPE) -- Same as CAST, but additional options for DATES CONVERT(DATATYPE, column) -- Returns NULL if convert is not successful TRY_CONVERT(DATATYPE, column)
Using our sample table:
SELECT CAST([id] AS NUMERIC) as id -- Cast from char to numeric ,TRY_CAST([last_name] AS NUMERIC) as last_name -- will return NULL ,[first_name] ,[age] FROM [test_db].[dbo].[test_table]
Dealing with NULLs
NULL in SQL Server represents missing or unknown values. Any expression involving
NULL will return
NULL, such as ‘character’ + NULL = NULL. Note, that the empty string is NOT equal to
NULL, it is known and it is empty:)
-- Return value if column is NULL ISNULL(column, value) -- Make NULL if column has value NULLIF(column, value) -- Find the first non-NULL column and return it (here: 'a') COALESCE(NULL, 'a', 'b') -- Combining functions: ISNULL(TRY_CAST([last_name] AS NUMERIC), 'NA')
That was a brief detour into data types.
I hope you found this short overview useful. I plan to cover working with (columnstore) indices and the SQL Server Machine Learning Library in another post.