Web Development Lesson 7 - Databases
Joins
Objective
In this activity you'll use data from two tables and you'll learn why separating data into multiple tables is a good idea.
Setup
Null Value
character_name | hero_villain | first_appeared | power | iron_man | first_avenger | thor | avengers |
Black Widow | Hero | 1964 | Skill | NULL | NULL | NULL | True |
Captain America | Hero | 1941 | Biological | NULL | True | NULL | True |
Hulk | Hero | 1962 | Biological | NULL | NULL | NULL | True |
Iron Man | Hero | 1963 | Technology | True | NULL | NULL | True |
Loki | Villain | 1949 | Magic | NULL | NULL | True | True |
Scarlet Witch | Hero | 1964 | Magic | NULL | NULL | NULL | NULL |
Multiple Tables
Now consider if we added a column for all 27 films released so far in the MCU, not to mention the 4 TV series. Our table would be enormous, and most of it would be empty (NULL).
That would take more time to manage, slow down our queries and take more space on our server. None of these are good outcomes, but what can we do about it?
We could make each combination of hero and movie its own row of data, like this.
character_name | hero_villain | first_appeared | power | movie |
Black Widow | Hero | 1964 | Skill | Iron Man |
Black Widow | Hero | 1964 | Skill | Avengers |
Captain America | Hero | 1941 | Biological | First Avenger |
Captain America | Hero | 1941 | Biological | Avengers |
Hulk | Hero | 1962 | Bilogical | Avengers |
Iron Man | Hero | 1963 | Technology | Iron Man |
Iron Man | Hero | 1963 | Technology | Avengers |
Loki | Villain | 1949 | Magic | Avengers |
This looks much better. It takes less space, and would be faster, but it still has problems.
The first is that a lot of data is repeated. See how we have Black Widow's information (hero_villain, first_appeared and power) twice. If we want to update that data, we'd have to update every single row with Black Widow. That will slow things down and could lead to errors.
The second is more subtle. Did you notice that we're missing Scarlet Witch? We could include a row for her without movie data, but how would we manage later when we added Age of Ultron? Would we leave a NULL row taking up space in the database for no purpose? We could check for NULL rows and replace those when we add the first movie, but it's not ideal.
The answer is to separate the data into two tables. One with information about the hero, and another tracking which films they've been in.
You've seen the first already.
Let's take a look at the second. Click on the table called 'appearances' in the left pane of PHPMyAdmin.
Here we see a table with a list of hero/movie pairs.
Unique Index
Remember that every table must have a unique index. But in this list, Iron Man and Loki both appear more than once in 'character_name' and 'The Avengers' appears five times in 'movie', so neither is appropriate for our primary index.
Instead, we create a new field that will be unique. By custom, we call it 'id'.
Click on the 'Structure' tab to see how this is defined.
It's an integer with the 'unsigned' attribute. This means that it can only be positive numbers. In computing, where numbers are stored as a collection of ones and zeros, each called a 'bit', one bit is usually used to indicate whether a number is positive or negative. By making our index unsigned, we are doubling the number of rows we can have.
Also notice under 'Extra' that the 'id' field is set to 'AUTO_INCREMENT' which means that when we add rows the new row will have a number one higher than the highest number previously used. This saves us from having to find an unused number ourselves.
So how do we combine the data from these tables?
Joins
SELECT *
FROM heroes
LEFT JOIN appearances ON heroes.character_name=appearances.character_name
The results now show all the information that we had in our most recent table.
The advantage here is that we only need to update one line in our heroes table for each hero, so there is much less risk of error and we're taking less storage space. It's also quicker to process, but that's probably not noticeable for such a small database.
Let's break down the query.
SELECT *
is the same as before. It says to show all fields. In this case, show all fields from all tables included in the query.
FROM heroes
is the same as before, though I've left out the backticks. Backticks are not required in most cases.
The last line is new. LEFT JOIN
means that we're connecting all rows from the second table that match the first table, and we're using the 'character_name' field of both tables to identify the match (heroes.character_name=appearances.character_name
).
Notice that we identify the fields to use by naming the database first and separating them by a .
. This ensures that the database doesn't get confused because the field names are the same in both tables.
Also notice that Scarlet Witch appears in the results because she's in the first table, even though she's not listed in the second table. The missing information is listed as NULL
. Let's see what happens if we swap the tables around.
We could swap the table names in the query FROM appearances LEFT JOIN heroes ON heroes.character_name=appearances.character_name
, but we can also simply change the type of join from LEFT
to RIGHT
.
SELECT *
FROM heroes
RIGHT JOIN appearances ON heroes.character_name=appearances.character_name
When you run this query, Scarlet Witch disappears because the second table (appearances) is our base and she doesn't appear in that table.
Let's go back to the original query so we can see Scarlet Witch.
SELECT *
FROM heroes
LEFT JOIN appearances ON heroes.character_name=appearances.character_name
Before we go on, notice that we've got unnecessary information in our results. 'id' is required by the database, but for now it has no value for us. Nor is there any value in repeating the 'character_name' data.
Let's grab only the fields we actually need.
SELECT character_name, hero_villain, first_appeared, power, movie
FROM heroes
LEFT JOIN appearances ON heroes.character_name=appearances.character_name
Notice that we've replaced *
with the list of specific fields we want to include.
Unfortunately, when you run the query you'll see an error saying that the column 'character_name' is ambiguous. Although they're exactly the same in both tables, the database needs to be told which one to choose.
SELECT heroes.character_name, hero_villain, first_appeared, power, movie
FROM heroes
LEFT JOIN appearances ON heroes.character_name=appearances.character_name
Next: PHP Database Connections