Designing a system against a relational database involves several key elements that ensure data integrity, efficient query performance, and maintainability. Here’s a summarized overview of these design elements: This article I am going to strive to cover the first of numerous key elements of designing a system against a relational database. I will eventually cover the following topics, but this post will specifically be based on the first topic data modeling:
The general idea with data modeling for a relational database (and for other types of databases) is to build the database in a way that caters to your specific usage needs. This involves multiple layered tasks. Each of the sections below I’ll define simple the task, then elaborate on characteristics of that particular task.
In this post I’ve put together some of the naming conventions, rules, and ideas that I tend to follow when creating database schemas to work with. This also applies to schema-less databases, distributed systems databases, graph, time series, or whatever else I am working with. It’s always good to have some good conventions to work with, and the descriptions and ideas in this post are a solid starting point.
What We’re Working With, Some of The Database Rules
Let’s start with a SQL Server rule. Table names must be less than 128 characters. To force SQL Server to use non-standard table names one can use brackets. Then in scripts these names have to be single quoted.
There are many other rules about naming things in SQL Server. But let’s talk about some other database specific rules for other databases.
Postgres names fold to lowercase versus uppercase, which is different then many other databases. Throw in some double quotes however and you can use names like MyTable, MYTABLE, and mytable. These would all be the same without double qutoes, add the double quotes around those names and “MYTABLE” becomes different than “MyTable” and different than “mytable”.
SQL identifiers in Postgres and key words must begin with letters (a-z), which include diacritical marks and non-Latin letters. After the first letter and identifier can have letters, underscores, digits, or dollar signs. If an identifier is double quoted, you can also yse keywords, albeit I would very strongly reccommend against this practice.
As these examples provide, there are a number of ways that the rules are just different enough from one database to another that it is often very helpful to use a naming convention that would work across databases. I’m often working with a variety of databases so this post will cover naming convention ideas and respective patterns and practices around them that would work with every conceivable database I know to exist!
Table, Column, Tuple, or Related Naming Conventions
Table, column, and related object names should contain only letters, numbers as characters in the body of the name and not as the preface characters, underscores and absolutely no spaces or special characters. In summary, use a case scheme like Camel or Pascal Case but do not use Snake or Kebab Case.
Use meaningful names from the business or organizational domain being modeled. Such as “BankingUsers”, “Transactions”, “railroads”, or “railroad_Systems”.
Use singular word names if at all possible, only moving to compound word naming if absolutely necessary. Ideally names would be single words like “User”, “Transactions”, “railroad”, or “system” and exclude compound names like “railroadSystem” until it is needed to prevent confusion or naming collisions.
Columns that are primary or foreign keys should be prefaced with PK_ and FK_ respectively, and in my moderately humble opinion, stick to just PK or FK using Camel or Pascal Case. For other metadata, indexes, and related names use a respective preface or postfix conventions.
It’s also a good idea to choose plural or singular for table names. However, be sure to choose one or the other so that frameworks, Object Relationship/Relational Mappers/Mapping (ORM), and other tools can effectively name things when used. For example, when table names are singular, many ORM frameworks when generating code would take a singular table name like Customer and make it Customers and have objects of Customer.
Schema/Domain Naming Conventions
In many databases there are additional organizational and related structures that help us to setup tables, functions, stored procedures, compiled SQL/queries, and other objects in groupings. Naming these objects accordingly is easiest by following the same convention as the table naming convention.
For example, if the table naming convention is following Camel Case then continue that;
Table Names for Returns in an E-commerce Domain:
Table Names for Pricing in an E-commerce Domain:
Table Names for Core Tables, for multiple schemas, within the E-commerce Domain:
Here’s an example I put together with some naming conventions I’ve found useful, reduces confusion, and manages to tell a reasonable amount of information about the domain space and schema of the database without conflicts.
Throughout this schema I’ve used Camel Casing, with most single word column and single word table names. Keeping it simple, such as Id and Stamp are some of the recurring columsn that are useful for retrieval, relationships, and determining origins of data over time. In production settings there are default columns that are often needed and one can rest assured, a time stamp is most likely one of those that is needed everywhere for audits!
For my foriegn key columns, one can determine the relationship by the name of the column itself. For example, in the Connection table the are two foreign keys, one to the Action table and one to the Source table, to the respective Id columns in each of those tables. The one exception is NoteJot, which I named because Note tends to conflict in certain systems. In that table I’ve added a relationship, for recursive data, back to itself with the use of the NoteId foreign key back to the table’s primary key Id.
The diagram above, without any further details can be used to create the schema, with SQL code that would look like the following.
We have the tables and keys, all following the Camel Case standard. Much of this however could be – if you preferred – to Pascal Case however switching them to Snake or Kebab Case would cause a number of issues depending on the database.
Do NOT Use These
Unless you want to spend tons of time with errors, debugging, and related issues skip these practices.
Generally throughout databases it is best to skip Snake or Kebab Case. Various situations they’re fine, but overall they’re likely to run into conflicts, naming limitations, or other concerns. It’s best to just skip them and remove the concern.
When you’re using data that has variance in how it is represented, do not use multitudes of formats. For dates, location, geographic, or related data it is best to stick to a particular format that is repeated throughout the database. Using mixed representations, for example with dates a MMDDYYYY format and then a DD-MM-YYYY format, methods, functions, or other elements that consume or process this data will need to account for this. Creating more time consuming and error prone code.
Once you pick a naming scheme for any particular database object type, stick to the naming scheme. For example, if you go with Camel Casing for your tables, use Camel Casing for all of the tables and don’t switch to Pascal for some of them. Specifically, however with this guidance, is if you switch object types, for example you name the tables Pascal Cased but switch to Camel Case for indexes, that’s perfect. Then if ever reviewing a list of objects irrespective of kinds of objects, one can differentiate merely by the conventions used.
Summary & Caveat
Working up a set of patterns, practices, rules, and generally conventions to work with on the database side of things is immensely useful. It helps the Database Administrators, Data Scientists, Software Developers, others that need to utilize the database, and to communicate with each other in reference to the database and data.
I dig through a lot of internet results and blog entries that show CRUD data modeling all the time. A lot of these blog entries and documentation are pretty solid. Unfortunately, rarely do we end up with data that is accurately or precisely modeled the way it ought to be or the way we would ideally use it. In this post I’m going to take some sample elements of data and model it out for various uses. Then reconstitute that data into different structures for various uses within microservices, loading, reading, both in normalized form and denormalized form.
The Domain: Railroad Systems & Services
The domain I chose for this particular example is the entire global spectrum of rail services. Imagine if you would a system that can track all the trains in the world, or even just the trains in a particular area of the world, like the United States. In the United States the trains can be broken down into logical structures of data for various things like freight trains and passenger trains. Trains operated under a particular operator like Amtrak, Union Pacific, or Norfolk Southern, and their respective consists that the train is made up of. Let’s get into some particular word definitions to fully detail this domain. Continue reading “Beyond CRUD n’ Cruft Data-Modeling”→
Another flight down to the bay area. Today it was Alaska AirFlight 330 from Seattle to San Jose. It was mostly a clear day at start, with a solid layer of bright cloud cover exiting Washington on the way down to Oregon. As we crossed over that arbitrary human defined line of Oregon and California, nature presented us with even more perfectly glowing bright cloud cover. This is Cascadia after all and it’s basically covered in clouds the majority of the time. On departure I also noted Bremerton has three aircraft carriers in dock along with a normal plethora of other naval vessels. The amount of naval power in the area is always pretty awe inspiring.
Why was I in flight once again? I am heading down to teach with Jeff Carpenter (@jscarp) at the South Bay Cassandra User Group‘s Cassandra Day events. These are single day events, where we cover an introduction to Apache Cassandra, concepts of data-modeling for Apache Cassandra, and then a wrap up of application development with the respective drivers. Now if you aren’t in Santa Clara – or ya know Menlo Park, San Jose, Oakland, San Francisco, or well, the surrounding area – there are other days scheduled! We also have days scheduled that aren’t even located in the Bay, so check out the full list of events:
NOTE: If you’re interested in Seattle, Portland, or Vancouver BC area events, scroll all the way down to the end of this blog entry I’ve got more details for you!
Introduction to Apache Cassandra
In the introduction to Apache Cassandra we cover an overview of the architecture and features of the distributed database. Starting off with a definition of a distributed hash ring and how this is used in Apache Cassandra to provide data storage across the nodes that make up the Apache Cassandra Database. Moving on we’ll get into the other capabilities, trade offs of data replication between nodes, configuration settings, and a lot more.
For data modeling we start off with a short review of relational database data modeling to provide something that is more familiar for many people. From this, we then build off of many concepts around denormalization, breaking apart various levels of normalization forms, and then get into the thinking and approach behind modeling an application in a distributed database and go deeper with details around Apache Cassandra.
For application development, focusing around the Java language and technology stack, we’ll start with some concepts around how the drivers connect to and work with Apache Cassandra. We’ll open up some code too, get into some code changes and additions, to get more familiar with how the driver works and some of the capabilities of the driver itself.
In the coming weeks (ok, maybe a month or two) we’ll be updating this material for Apache Cassandra v4 and additionally, I’m aiming to line up some half day and probably some full day workshops in the Cascadian region: Portland, Seattle, and Vancouver BC. They’ll be almost identical except for a few tweaks, but you’ll have to RSVP to find out the details!
Also, if you’re in between any of those cities and have a stop on the Amtrak Cascades, let me know and we’ll get an RSVP list started for your city and see if we can get the required attendee count to make it official!