Introduction to relational databases with SQL

Relational Database (DB)

A relational database is a digital representation of a relational model of data (how information is linked together).
A typical DB stores the organized data into tables (or relations).
In a table, the different columns correspond to the fields (or attribute) of our data and the rows correspond to the different records (or tuples).

Figure 1. A simple relational database (source: wikipedia).

DB Tool for this tutorial

There are many digital database software out there. Some are easy to use, some are slightly more complicated. Some are free, some are proprietary. But they all have more or less the same basis, which we'll cover here (a portion of it anyway!)

Most of them can be used directly from the terminal, but some also provide a Graphical User Interface (GUI). A GUI can be extremely useful, especially when you start dealing with relational databases.

For this tutorial, we will use SQLite along with DB Browser for SQLite (GUI, Figure 2). You may already have SQLite on your machine.

Other popular options are MySQL (e.g. LSST), or PostgreSQL (along with pgAdmin as GUI).

Figure 2. DB Browser for SQLite.

Figure 3. MySQL Workbench GUI.

Organising your data

Say we want to create a database to represent a catalogue of galaxies classified by morphologies.

We would like to have a list of all galaxies from our survey and the morphology to which they belong.

An easy way to represent the information would be to put everything together in one table.

The problem with this is that if you delete all entries (rows), you would not only delete your list of galaxies, but all morphologies as well. You would therefore have to start from scratch.

Figure 4. (Outdated) Hubble Classification of Galaxy Morphology.

Split and create relations

To keep our data organized, it's good practice to categorize our information into logical groups.

We want to keep independent entities separate and create relations between them. This process is called normalisation.

Simply by the GUI (e.g. using buttons, dropdown menues and input/tick boxes), the Galaxy and Morphology data can be structured into two separate tables (say MORPHOLOGY and GALAXY), as shown in Figures 5 to 7.

We now have created the relation between the tables, having the morphology data independent to the galaxy data, and associating each galaxy to a specific morphology.

From Figures 5 and 6, we can see the relationship between MORPHOLOGY and GALAXY is built on the attribute morphologyNo. This was made possible by setting up the "Foreign Key" during the creation of the table GALAXY with MORPHOLOGY(morphologyNo).

Therefore, when we will want to insert a new galaxy along with its morphology, we will get an error if we do not provide a valid morphologyNo. In other word, the insertion won't happen if we want to assign a morphology that doesn't exist.

Extra

Extra information about PRIMARY KEY.

Note

By default, DB Browser displayed the menus in French on my machine. Don't you worry, it should be in your langage on your machine!

Figure 5. Creating the MORPHOLOGY table.

Figure 6. Creating the GALAXY table.

Figure 7. Manually adding data to the MORPHOLOGY table.

Interacting with the database using SQL

Now that we have a structure for our database, it would be nice to be able to interact with it. For instance, it would be nice to add new records of galaxies or new morphologies. To do this, we use a language called SQL.

The GUI we use here helps lets us create tables and insert fields without having to type any line of SQL (like in Figure 7). However, once you know SQL, it's quite practical and easy to use. For instance, you won't need to remember what is the morphologyNo to insert a new galaxy (which can be very impractical if you have a lot of items in a table).

The SQL language consists of a only few key words. Some keywords were already shown in the table creation figures (e.g. CREATE TABLE). Some important keywords are shown on the right.

More information about the multiple keywords and how to use them can be found here. Also, ways to combine and use them is well explained on the Software Carpentry webpage.

INSERT INTO VALUES

SELECT FROM

JOIN ON

WHERE GROUP BY HAVING

ORDER BY

Let's try SQL

Say we want to insert a new record in the MORPHOLOGY table for irregular galaxies. We can do it as follow in the "Execute SQL" section of DB Browser:

INSERT INTO MORPHOLOGY(morphologyNo, morphologyName) VALUES(NULL, 'Irregular');

Since morphologyNo is an auto_increment variable, typing NULL during the INSERT INTO query lets the database assign the next available value.

Figure 8. Query result.

A bit more

Lets now insert a new galaxy in the GALAXY table:

INSERT INTO GALAXY(galaxyNo, galaxyName, morphologyNo) VALUES(NULL, 'Large Magellanic Cloud', (SELECT morphologyNo FROM MORPHOLOGY WHERE morphologyName LIKE 'irregular'));

Now, let's find all galaxies, and also display their morphology name:

SELECT GALAXY.galaxyName, MORPHOLOGY.morphologyName FROM GALAXY JOIN MORPHOLOGY ON GALAXY.morphologyNo = MORPHOLOGY.morphologyNo;

This will return you all galaxies (well, only the one we added so far) and its morphology name (no numbers). The JOIN*** links the two tables ON columns name morphologyNo.

*** By default, JOIN means INNER JOIN in SQLite.

Figure 9. Query result #2.

Figure 10. The Large Magellanic Cloud (source: fromquarkstoquasars.com).

Extra

If you want to know more about SQL itself, I encourage you to look at w3school SQL's tutorial or at this quick reference page. There are many DB tutorials online. Just google something like "SQLite tutorial" and you'll find something quickly (for instance, Software Carpentry)!

For example, here is a tutorial about MySQL. And here is one about PostgreSQL

Other topics you may want to read about (but not restricted to) are the different kinds of joins (e.g. left join, right join, ...), auto_increment a field to avoid typing them yourself, modify a table (ALTER TABLE), importing data, and some more advanced concepts such as inverted index, etc.

Have a look at the left side bar of w3school It's a good source of information about SQL (and for all kind of web development).

Figure 11. Venn diagrams of the different types of JOIN (source: dsin.files.wordpress.com).

Last but not least: SDSS SkyServer!

Now that you have an idea of how works a database and how to query it, have a look at SDSS SkyServer's SQL Search. You can now query the database quickly without using Imaging Query form or Spectral Query form.

Also, for tutorial and examples on how to use SkyServer's SQL Search, look at this page.

For more information about how the database is structured, look at this page.

Have fun with your colleagues/new .astro friends!

Example

To get quasars with redshifts less than 3.5, ordered by right ascension, that are in both the FIRST Bright Quasar Survey (FBQS) and SDSS.
SELECT sp.ra, sp.dec, sp.z, sp.zErr, sp.zWarning, f.ra as FIRST_ra, f.dec as FIRST_dec FROM SpecPhotoAll AS sp INNER JOIN FIRST AS f ON sp.objid = f.objid WHERE class = 'QSO' AND sp.z < 3.5 ORDER BY sp.ra A snapshot of the outcome is shown in Figure 13.

Figure 12. SDSS DR12 Database schema (source: http://skyserver.sdss.org/dr12).

Figure 13. Query result.