RDBMSes are used in finance industry and similar industries. At the same time RDBMSes and the relational model played a role in web developement. I think it was accidentially.
The main use case of RDBMSes remains business data. Think of tables called customers
or orders
or anything like this. And so, correctness of the data is a great concern. Because of this,
these databases are usually transactional.
Business data is often heavily normalized. The database schemas used for dynamic web pages often mimic those databases that store business data, which works but is obviously not always the best solution.
In dynamic websites, the concept of data correctness virtually doesn’t exist. The mere purpose of transactions (if they are used) is to prevent data that causes the displaying PHP script to crash.
In general speaking, data correctness in web applications is a SHOULD-BE, but not a MUST-BE.
There is an excellent Article about Database normalisation on Wikipedia. Normalization is to eliminate Transitive dependencies as well as insertion, update, and deletion anomalies.
Informally, a relational database relation is often described as “normalized” if it meets 3NF (third normal form).
The role of RDBMSes in dynamic websites is closely related with the beginnings of PHP and MySQL in the 90s. Before PHP gained traction, websites have been mostly written in Perl. PHP initially began as a set of CGI programs written in Perl known as PHP/FI (Personal Home Page Tools/Forms Interpreter). MySQL arose to be the first (propably) Free and Open Source SQL database (Postgres didn’t support SQL at the time), furthermore MySQL was built with to be compatible to the Semi-Free mSQL dbms, which was free for personal use, and therefore popular.
Within the evolution of the WWW, mSQL and later MySQL was extensively used in dynamic web pages, which have been written in Perl and later more and more PHP. Initial systems used MySQL merly as a document database with SQL interface, but as these web applications (or CMS or Wiki engines) became more and more complex, heavily normalized database models had been implemented, as this is the most obvious thing to do with an RDBMS.
A “denormalized” Table:
Table Blog_Posts
id | title | body | category | tag |
---|---|---|---|---|
1 | A paradox on … | …some text… | Poetry | Paradox |
1 | A paradox on … | …some text… | Philosophy | Paradox |
1 | A paradox on … | …some text… | Poetry | Ideas |
1 | A paradox on … | …some text… | Philosophy | Ideas |
This table is obviously denormalized (or unnormalized). Everyone will propably agree, that this is a waste of diskspace.
The “normalized” solution:
Table Blog_Posts
id | title | body |
---|---|---|
1 | A paradox on … | …some text… |
Table Blog_Post_Categories
id | category |
---|---|
1 | Poetry |
1 | Philosophy |
Table Blog_Post_Tags
id | tag |
---|---|
1 | Paradox |
1 | Ideas |
This is the obvious solution as ruled by the 3NF (I think).
My solution:
Table Blog_Posts
id | title | body | categories | tags |
---|---|---|---|---|
1 | A paradox on … | …some text… | {Poetry,Philosophy} | {Paradox,Ideas} |
This example violates even the 1NF (first normal form) by having multiple values per field. However modern RDBMS usually have array types. And having arrays within fields is not expensive (at least not in PostgreSQL).
Despite not adhering to the normal forms, this example doesn’t suffer from the kind of data explodification as seen in the “denormalized” example above. Yet, it doesn’t suffer from type of segmentation as the “normalized” example.
My solution also improves the data locality and CPU overhead. And, ironically, this solution saves more disk space than seperate tables.
Disclaimer: This works since a blog post is usually associated to less than 10 categorys and tags, rather than tousands of those.
…like this, but this is obvious!
The query:
SELECT p.id AS id,title,body,category,tag
FROM Blog_Posts p
JOIN Blog_Post_Categories c ON c.d = p.id
JOIN Blog_Post_Tags t ON t.id = p.id;
is going to give you:
id | title | body | category | tag |
---|---|---|---|---|
1 | A paradox on … | …some text… | Poetry | Paradox |
1 | A paradox on … | …some text… | Philosophy | Paradox |
1 | A paradox on … | …some text… | Poetry | Ideas |
1 | A paradox on … | …some text… | Philosophy | Ideas |
All your application wants is the Blog_Posts
record + the list of
categories, the Blog-post is associated with + a list of of tags, the Blog-post
is associated with.
What your application doesn’t want is this giant table-block. Those table-blocks are interesting for automated report generation in the finance industry, but they are not interesting for web developers.
Revisiting the Digg-Scenario.
Lets suppose we have the following schema…
CREATE TABLE Diggs (
id integer primary key,
itemid integer,
userid integer,
digdate timestamp
);
CREATE TABLE Friends (
id serial primary key,
userid integer,
username text,
friendid integer,
friendname text,
mutual boolean,
date_created timestamp,
-- Constraints...
unique (userid,friendid)
);
…and the following indexes:
CREATE INDEX ON Diggs (itemid,userid);
CREATE INDEX ON Friends (userid);
For a given user-id, we want to get the friends of the friends, and from those, we want any Diggs of an Item.
SELECT d.digdate,d.id
FROM Friends mf,Friends fomf,Diggs d
WHERE
mf.userid = $myself AND
fomf.userid = mf.friendid AND
d.userid = fomf.friendid AND
d.itemid = $specific_item
ORDER BY d.digdate DESC, d.id DESC
LIMIT 4;
To the average user, it should be obvious how the query plan should look like: go straight through the indexes! But take a look at the query plan: If you request the query plan using explain or an equivalent command, you will often find something bizarre.
In large scale web applications, joins are often pulled into the application.