SQL as Sculpture
An Intuitive Way To Understand Queries



Facebook share
LinkedIn share
Twitter share



(This article is part of a forthcoming series on Understanding SQL Through Analogies. Analogies help you understand complex new things by comparing them to things that you already understand well. Or in this case, something you can easily understand - like sculpture.)



If you look at it from the right perspective - from way up high - you can say that sculpting, creating a sculpture, is about just two things:


  1. Assembling a mass of material (aka your medium)
  2. Carving that material down

Okay. Maybe that's a little too high, a little too abstract. How about if we zoom in just a bit closer...

A sculpture is often made of clay. Depending on how big you want your sculpture to be, you might be able to find a single block of clay that's large enough to suit your need. But if you can't - if you need a block that's bigger than that - then you need to mush multiple smaller blocks together, into a single (larger) block.

After assembling your block, you use a variety of tools to carve it down. You start with coarser tools that slice off big chunks. You work your way toward more delicate tools that make barely perceptible refinements. At some point, a sense of perfection overtakes you and you triumphantly bellow:


Finis!


This carving - painstaking craftsmanship - is sometimes described as finding (or even freeing) the fully formed sculpture that is hidden within (that inherently exists in) the material. On the one hand, that's an amazing way to think about making a sculpture. On the other hand, it depresses me - because I lack such insight (and fine motor skills) myself.

Or... maybe I don't. Maybe I only lack these things in an artistic context.

What might it look like if we considered writing a SQL statement - a query, a SELECT statement - as an act of sculpting, of creating a sculpture?




Assembling Your Data


The first thing you need to do in making... um... a... data... sculpture... is to assemble your material.

In SQL, your medium is sets of data. Note that I did not say that your medium is data. I said that it is sets of data. That is an essential distinction that is lost on a lot of would be relational database users.

A database is a set of tables. A table is a set of rows. A row is a set of fields (column values).

A query (a SELECT statement) builds and returns a result set. That result set is structurally equivalent to a table - it is a set of (zero or more) rows, each of which is a set of (one or more) fields.

We can think of a table (a set of rows) as a block of data, sort of like a block of clay. In doing so, we're getting more abstract again. But this time it helps us to build our intuition.

If you think of tables as blocks of data, then you can think of every SELECT statement as starting with an initial or primary block of data. Your primary block of data is the table that you name in the FROM clause. For example, here we retrieve all of the fields, in all of the rows, from a primary block of data (a table) named flashcards.


SELECT * FROM flashcards;

Sometimes we need more than one block of data. This will be the case when our primary table does not have all of the data that we need - we need some of the rows, or some of the columns, from some other table(s). This poses two challenges:


  • arranging the blocks (tables) sensibly
  • mushing the blocks (tables) together

But before we can do these things, we need to think about what the blocks might look like.

When we work with clay, we can expect that the blocks we mush together will all be of the same size and shape. If they're not, it really doesn't matter - because it's just clay. One blob of clay is (mostly) indistinguishable from any other.

When we work with data, though, we usually don't have that expectation. Most of the time, the other blocks (tables) we need are a different shape than our primary block. By the shape of a table, we mean the set of columns that define the table. If two tables are defined with the same set of columns, we can say that the tables are of the same shape. If they differ in any way, column-wise, we say that the tables are differently shaped.


Arranging and Mushing Data Blocks of the Same Shape

When you have data blocks (tables) of the same shape, you can (you must) stack them on top of each other. You can do this because the columns align - each data block (table) has the same columns, in the same order, as all of the others. Each column in a data block can thus support its corresponding column in any data block(s) stacked above it.

Recall, as I said before, though, that a result set is equivalent to a table. That doesn't just mean that a result set looks like a table. It also means that we can treat a result set like a table. This is a crucial characteristic of result sets, because I wasn't quite precise in the last paragraph. You cannot actually stack tables on top of each other in a query. But you can stack result sets. And that gets us to the same place.

We thus arrange and mush (stack) identically shaped data blocks (result sets) in what we call a union operation. We do this simply by inserting the UNION keyword in between two queries.

For example, here we will stack a result set from querying a QA/staging table (flashcards_qa) below a result set from querying our primary (flashcards) table.


SELECT * FROM flashcards
UNION 
SELECT * FROM flashcards_qa;

The UNION operator is used to stack two result sets, one on top of the other. It gathers together the rows returned by one query with those returned by another query, merging them all together into a single, larger result set.

We can (we should) think of a UNION operation as adding more rows (but not more columns) to our primary block of data. It makes the primary block taller (but not wider).

There's not much else to arranging and mushing together data blocks of the same shape. But you don't have to stop at two blocks. You can stack as many queries as you want, as long as the result set from each query is of the same shape - the same set of fields - as all of the others.


Arranging and Mushing Data Blocks of Different Shapes

When you have data blocks (tables) of different shapes (different sets of columns), you can't stack them on top of each other. You can't do that because their columns can't align. So your only option for arranging them is to lay them next to each other, side-by-side.

But in order to do that, the tables have to have at least one column that is common between them - a column of the same semantic type (and the same data type). This is often an identifier field, such as a user_id number, or a purchase_id number. But it could be anything - a date field, or even a text field. If the two tables do not have any common column(s) at all (one or more, of the same semantic type), they cannot be mushed together, under any circumstances.

We arrange and mush differently shaped tables in what we call a table join operation. We append the JOIN keyword to our query's FROM clause, along with the name of the table we want to add columns from. Then we use the ON keyword to tell the DBMS how the tables are related.

For example, here we join the columns from a table named scores to those of our flashcards table. Both tables have a card_id field.


SELECT *
FROM flashcards
    JOIN scores
        ON flashcards.card_id = scores.card_id;

We can (we should) think of a table join operation as adding more columns to our primary block of data. It makes our primary data block wider (but not taller).

We're glossing over a lot of detail here. But the intuition is simple - laying one or more other tables alongside your primary table.




Carving Your Data


When it comes to carving your data block down to the size and shape you need, it should come as no surprise that you have two categories of choices. You can carve down the height of the block - the quantity of rows in the result set. Or you can carve down the width of the block - the quantity of fields (column values) in the result set. In most cases, you will do both.

A reminder, though, that with the simplest query, we do neither. When we select * (an asterisk in the column list), we get all of the fields in the returned rows. When our query only includes the SELECT and FROM clauses, we get all of the rows returned from our table(s). So this query will retrieve all of the fields, in all of the rows, from our flashcards table:


SELECT *
FROM flashcards;

To carve down the width of our data block - the quantity of fields in the result set - we simply name the columns we want - and only those columns - in the SELECT clause's column list:


SELECT question FROM flashcards;
SELECT question, answer FROM flashcards;

When it comes to carving down the height of our data block - the quantity of rows in the result set - we have more than one tool at our disposal.

The most important and most commonly used tool is the WHERE clause. You can think of it as the chef's knife of carving tools. You can use it to slice off huge chunks, or shave off little bits, from your block of data.


SELECT * FROM flashcards WHERE author = 'Mike';
SELECT * FROM flashcards WHERE create_date > '2020-09-01';
SELECT * FROM flashcards WHERE question like '%awesome%';

You can also use it to make multiple cuts, all at once.


SELECT * FROM flashcards
WHERE author = 'Mike'
    AND create_date > '2020-09-01'
    AND question like '%awesome%';

The LIMIT clause is another common tool for carving off rows from a data block/set. It's not very sophisticated though, at least not by itself. You can think of it as the kiddie scissors of carving tools. Here, we use it to retrieve only the first 25 rows from our flashcards table:


SELECT * FROM flashcards
LIMIT 25;

The LIMIT clause is often used in combination with the OFFSET clause, in order to take a slice from somewhere in the middle. Here, we will retrieve only the 21st through 30th rows from our table:


SELECT * FROM flashcards
LIMIT 10
OFFSET 20;

One more row carving tool, before we move on - aggregation functions. An aggregation function is best understood as a grouping tool, rather than as a carving tool. But reducing the quantity of rows is an intended side effect of grouping. So it's reasonable to think of an aggregation function as a carving tool, as well.

We use an aggregation function to summarize the values in a particular column. Here, for example, we count the rows in our flashcard table:


SELECT count(question)
FROM flashcards;

This query reduces the flashcard table's thousands of rows (and half dozen columns) down to a result set having a single field, in a single row.

Besides counting things, we can use aggregation functions to get an average of a column's values, or a minimum or maximum value, or any of many other options.


Of course, we can use all of these tools together, to carve our assembled data block down to just the size (rows) and shape (columns) that we need.

The trick is to approach the task of carving methodically, cutting off one chunk at a time, until you are left with just the rows and columns that you want.




A Full Example


Here is a (simplified) set of tables for a multi-user flashcard system:



Table "users"
  Column   |  Type   
-----------+---------
 user_id   | integer 
 user_name | text    


Table "flashcards"
     Column  |  Type   
-------------+---------
 card_id     | integer 
 question    | text    
 answer      | text    


Table "scores"
   Column    |  Type             
-------------+-------------
 user_id     | integer     
 card_id     | integer     
 score       | smallint    
 review_time | timestamp   


The scores table records how well each student performed with each flashcard at various points in time. Note that this table links back to (it relates to) both the flashcards table and the users table (via the card_id and user_id fields, respectively).

Our goal in this example is to find the average score for each flashcard question for a particular user.

We start by gathering our data. We want to know about all of the questions, so we will choose the flashcards table as our primary block/set/table.


FROM flashcards

We need the score column from our scores table. That table isn't the same shape as the flashcards table. (It has a different set of columns.) So we need to arrange it alongside the flashcards table via a table join.


FROM flashcards
    JOIN scores ON flashcards.card_id = scores.card_id

That's all we need to do as far as gathering and arranging our data set. With just this FROM clause, our result set will have as many rows as there are in the scores table - one for each time each user reviewed each card. It will have all of the fields from both of the tables:



     Column             |  Type   
------------------------+---------
 flashcards.card_id     | integer 
 flashcards.question    | text    
 flashcards.answer      | text    
 scores.user_id         | integer     
 scores.card_id         | integer     
 scores.score           | smallint    
 scores.review_time     | timestamp   


We only want rows for one particular user - say, user #5. So we'll add a WHERE clause to carve off all of the rows for all of the other users. This reduces our result set from millions of rows to hundreds.


FROM flashcards
    JOIN scores ON flashcards.card_id = scores.card_id
WHERE scores.user_id = 5

We only need the question field and the score field. So we'll write our SELECT clause to retrieve only those two columns. This carves off all of the other columns, reducing our result set from the seven original columns to just two.


SELECT flashcards.question, scores.score
FROM flashcards
    JOIN scores ON flashcards.card_id = scores.card_id
WHERE scores.user_id = 5

As it stands now, we will have a row for each time that user #5 reviewed each question. We want only one row for each question, with an average of the scores for the question. So we need to use the avg() function on the score field.


SELECT flashcards.question, avg(scores.score)
FROM flashcards
    JOIN scores ON flashcards.card_id = scores.card_id
WHERE scores.user_id = 5
GROUP BY flashcards.question;

An aggregation function needs a GROUP BY clause to tell the DBMS which column(s) we want to summarize over. But, other than that, we're done. We built up a data set, block-by-block. Then we carved it down, cut-by-cut, until we had only the rows and columns needed to answer our question.




A Better Way to Write SELECT Statements


In the preceding (full) example, you may have noticed that I did not write my query from the top down, in the clause order expected by the DBMS:


  1. SELECT
  2. FROM
  3. WHERE
  4. GROUP BY

When you write a query from the top down, you write it in a rote, mechanical way. You think more about how you have to talk to the DBMS, rather than about what you want from it. You're focused on rules, and worried about making mistakes.

That's not an intuitive way to think about what your result set should look like.

Instead, write a query like you would craft a sculpture. Build up a block of data. Then carve it down.


  1. Start with the FROM table as the core of your data set
  2. JOIN (and/or UNION) any other tables needed to complete the set
  3. Use the WHERE clause to carve off unwanted rows
  4. Use the SELECT clause to carve off unwanted columns
  5. Add in other clauses or operators as necessary

At each step, pause to think about - to visualize - what the result set would look like, at that point. Then think about what else you need to do - what you need to add to the block, or remove from the block - in order to end up with just the rows and fields that you want in the result set.

SQL as sculpture?

SQL is sculpture.




What We Learned


At a high level, writing a query, a SELECT statement, is about two things - in this order:


  1. assembling a set of data (from one or more tables)
  2. carving that data set down to just the rows and columns that you want

In assembling our data set, we start with a primary table, identified in the FROM clause. We can make the set wider (more columns) with JOIN. We can make the set taller (more rows) with UNION.

To carve the set down, our primary tool for making it shorter (fewer rows) is the WHERE clause. We make the set narrower (fewer columns) by naming only the columns we want, in the SELECT clause.


Data Assembling and Carving Tools


Data Block GoalData Set GoalPrimary ToolAdditional Tools
WiderMore columnsJOIN
TallerMore rowsUNION, UNION ALL
NarrowerFewer columnsSELECT
ShorterFewer rowsWHERELIMIT, OFFSET, aggregation functions (e.g. count, sum, min, max, avg), DISTINCT, INTERSECT, EXCEPT

We focused solely on queries (SELECT statements) in this article because DML statements are much simpler. All of them only ever operate on a single table - so we never make the data block taller or wider.

But the WHERE clause is used (more often than not) in UPDATE and DELETE statements in the same way as in a SELECT statement - to carve off rows that we don't want to update or delete.


Queries can be as complicated as you need them to be. Or they can be as expressive as you want them to be. It all depends on your perspective.

But they don't have to be frustrating - if you understand what you're doing when you write them.

The analogy of SQL as sculpture is just one of the ways that our Using SQL Databases flashcard deck teaches SQL differently. This deck doesn't just develop rote memory. It builds deep - and lasting - conceptual understanding. If you want to really learn all of this, start there.







Author/Feedback

Michael Reichner is the Founder of SRSoterica (this site), and the author of spaced repetition system (SRS) flashcard decks for absorbing the concepts that underlie complex programming subjects.

Discuss this article on the site that it was linked from, or at the SRSoterica subreddit.

Last updated: November 10, 2020


Tell your friends and coworkers about this article:


Facebook share
LinkedIn share
Twitter share


Would you like to be notified about new articles?





Do you need to learn SQL? Do you need to understand it better?

Using SQL Databases is a spaced repetition system (SRS) flashcard deck for learning how to work with relational databases. This deck teaches you how to understand databases (structurally), how to understand SQL statements (as set operations), and how to perform common tasks in SQL.


Our decks are targeted at the Absorption stage of the learning process. We bring subject matter expertise, SRS expertise and instructional design expertise to every deck that we craft. You will come to understand SQL at a much deeper level than most other programmers.