Normalization in Relational Databases

This is a continuation of my posts on relational databases, started here. Previous posts on this theme, “Data Modeling“, “Let’s Talk About Database Schema“, “The Exasperating Topic of Database Indexes“, “The Keys & Relationships of Relational Databases“, and “Relational Database Query Optimization“.

Consider you’re a die-hard fan of progressive death metal, with a particular affinity for bands like Allegaeon. Over the years, you’ve accumulated a vast collection of CDs, vinyls, and other memorabilia.

Your collection has grown so much that you decide to document every item meticulously. Each piece of memorabilia contains information about the album, track titles, band members, and so forth. If you scribbled down every detail in one continuous list, you’d end up with a lot of repeated information. For instance, both “Proponent for Sentience” and “Formshifter” would mention the same band members like Riley McShane and Michael Stancel.

Normalization is akin to setting up separate lists or sections in your documentation. One section purely for “Band Members” where you detail members of Allegaeon over time. Another section for “Albums”, where instead of listing band members all over again, you simply refer back to the “Band Members” section. This kind of organization cuts down redundancy and ensures if, say, a band member leaves, you have only one spot to update.

Continue reading “Normalization in Relational Databases”

My Top 2 Reading List for Go & Data

Right now I’ve got a couple books in queue.

“Black Hat Go” Go Programming for Hackers and Pentesters by Tom Steele, Chris Patten, and Dan Kottmann.

Alt Text

I picked this book up after a good look at the index. There is a basic intro to Go at the beginning of the book to cover language fundamentals, but immediately after that dives into the meat of the topic. Immediately getting into understanding the TCP handshake, TCP itself, writing a scanner, and a proxy. There is then some basics about HTTP Servers, routers, and middleware in chapter 3 and 4, but returns immediately into topics of interest in chapter 5 around exploiting DNS, and writing DNS clients. I perused the index a bit further to note it covers SMB and NTLM, a chapter on “Abusing Databases and Filesystems”, then on to raw packet processing, writing and porting exploit code, and a host of other topics. This is going to be an interesting book to dig into and write about in the coming weeks. I’m pretty excited about it and hope to write a thorough review upon completion.

“Designing Data-Intensive Applications” by Martin Kleppmann

Alt Text

This book is familiar territory, as I’ve spent much of my career working on similar topics that this book covers. What I suspect is that I’ll enjoy reading the material presented in an orderly and concise way versus the chaos and disruptive way I’ve acquired the knowledge on these topics.

From the index, the book starts off with foundations of data systems and the ideas around building reliable, scalable, and maintainable applications. This provides a good basis in which to dive into the other topics. From there it looks like we’ll get a run into the birth of NoSQL, object-relational mismatches and the related insanity that this has bred in the industry! Then a solid dive into graph-like, traditional, and multi-model data modeling. With the beginning quarter of the book then covering everything from hash indexes, SSTables (another familiar topic), LSM-Trees, B-Trees, and related indexing structures before wrapping up this first 25% of the book with stars and snowflake topics for analytics and column-oriented storage, compression, sort orders in column storage, and related material on aggregation in data cubes and materialized views.

That’s just the first 25%! From there Martin covers a wide range of topics, that if you’re in the industry and plan to deal with large scale data-intensive applications these are topics you need to be intimately familiar with!

Reading

Over the next few months while I read through these books I hope to provide summaries and related notes on the material. Who knows, maybe you’ll want to dive into the material yourself! Until then happy thrashing code and may you have high retention and comprehension in reading!

Beyond CRUD n’ Cruft Data-Modeling

I dig through a lot of internet results and blog entries that show CRUD data modeling all the time. A lot of these blog entries and documentation are pretty solid. Unfortunately, rarely do we end up with data that is accurately or precisely modeled the way it ought to be or the way we would ideally use it. In this post I’m going to take some sample elements of data and model it out for various uses. Then reconstitute that data into different structures for various uses within microservices, loading, reading, both in normalized form and denormalized form.

The Domain: Railroad Systems & Services

The domain I chose for this particular example is the entire global spectrum of rail services. Imagine if you would a system that can track all the trains in the world, or even just the trains in a particular area of the world, like the United States. In the United States the trains can be broken down into logical structures of data for various things like freight trains and passenger trains. Trains operated under a particular operator like Amtrak, Union Pacific, or Norfolk Southern, and their respective consists that the train is made up of. Let’s get into some particular word definitions to fully detail this domain. Continue reading “Beyond CRUD n’ Cruft Data-Modeling”

Go Library Data Generation Timings

Recently I put together some quick code to give some timings on the various data generation libraries available for Go. For each library there were a few key pieces of data generation I wanted to time:

  • First Name – basically a first name of some sort, like Adam, Nancy, or Frank.
  • Full Name – something like Jason McCormick or Sally Smith.
  • Address – A basic street address, or whatever the generator might provide.
  • User Agent – Such as that which is sent along with the browser response.
  • Color – Something like red, blue, green, or other color beyond the basics.
  • Email – A fully formed, albeit faked email address.
  • Phone – A phone number, ideally with area code and prefix too.
  • Credit Card Number – Ideally a properly formed one, which many of the generators seem to provide based on VISA, Mastercard, or related company specifications.
  • Sentence – A stand multi-word lorem ipsum based sentence would be perfect.

I went through and searched for libraries that I wanted to try out. Of all the libraries I found I narrowed it down to three specific libraries. When I add the imports for these libraries, by way of how Go works, it gives you the repo locations:

  • “github.com/bxcodec/faker” – faker – Faker generates data based on a Struct, which is a pretty cool way to determine what type of data you want and to get it returned in a particularly useful format.
  • “github.com/icrowley/fake” – fake – Fake is a library inspired by the ffaker and forgery Ruby gems. Not that you’d be familiar with those, but if you are you have instant insight into how this library works.
  • “github.com/malisit/kolpa” – kolpa – This is another data generator that creates fake data for various types of data, structures, strings, sentences, and more.

Continue reading “Go Library Data Generation Timings”

Restarting Data Diluvium – Four Steps

I’ve got three steps I’m going through to reboot the Data Diluvium Project & the respective CLI app I started about a year ago. I got a little ways into the project and then a bit distracted, it happens. Here’s the next steps I’m taking and for those interested in helping out I’ll be blogging the work here, and also sending out updates via my Thrashing Code Newsletter. You can sign up and select all the news or just the open source project news if you just want to follow the projects.

Step 0: Write Up the Ideas Behind the Project

Ok, so this will arrive subsequently. So far, just wanted to get these notes and intentions written down. Previously I’d written about the idea here, and here. Albeit after many discussion with a number of people, there will be some twists and turns to the project to make it more useful and streamlined in CLI & services.

Step 1: Cleanup The Repository

Currently the repository is kind of a mess. I’m going to aim to do the following over the next few days.

  • Write up contributor issues/files for the repo.
  • Rewrite the documentation (initial docs that is) to detail the intent of the data generator ideas.
  • Incorporate the CLI to a repo that is parallel to this repo that is designed specifically to work against this repo’s project.
  • Write up a README.md that will detail what Data Diluvium is exactly as well as point to the project site and provide installation and setup instructions.
  • Setup the first databases to target as Postgresql, Cassandra, and *maybe* one other database, but I’m not sure which one. Feel free to file an issue with a suggestion.

Step 2: Cleanup & Publish a new Project Website

This is a simple one, I need to write up copy with the details, specific with feature descriptions and intended examples. This will provide the start point to base the work for the project. It will be similar to one of those living documents in that the documentation will, can, and should change as the project is developed.

Step 3: Get More Cats Coding!

catI’ve pinged a few people I know are interested in helping out, but we’re always looking for others to help with PRs and related efforts around the project(s). If you’re game, the easiest way to get started would be to ping me directly via DM on Twitter @adron and to sign up on my Thrashing Code Newsletter and select Open Source Projects Only (unless you want all the things).

…anyway, getting to work on these tasks. Happy coding!