Defining AWS Data Lake vs. Data Warehouse: Choosing the Right Solution

Data Lake vs. Data Warehouse: A Detailed Breakdown with AWS Offerings

When diving into the realms of data storage and management, two terms often come up: data lake and data warehouse. While they might sound similar, they serve distinct purposes and have unique characteristics. Let’s break it down, especially with a focus on what Amazon Web Services (AWS) has to offer.

What is a Data Lake?

A data lake is like a massive reservoir where you can dump data in its raw form. It’s designed to handle structured, semi-structured, and unstructured data. Think of it as a giant pool where data is stored as-is until needed for processing. These lakes are typically built on scalable storage systems such as Hadoop or cloud solutions, making them ideal for big data scenarios.

Continue reading →

Normalization in Relational Databases

This is a continuation of my posts on relational databases, started here. Previous posts on this theme, “Data Modeling“, “Let’s Talk About Database Schema“, “The Exasperating Topic of Database Indexes“, “The Keys & Relationships of Relational Databases“, and “Relational Database Query Optimization“.

Consider you’re a die-hard fan of progressive death metal, with a particular affinity for bands like Allegaeon. Over the years, you’ve accumulated a vast collection of CDs, vinyls, and other memorabilia.

Your collection has grown so much that you decide to document every item meticulously. Each piece of memorabilia contains information about the album, track titles, band members, and so forth. If you scribbled down every detail in one continuous list, you’d end up with a lot of repeated information. For instance, both “Proponent for Sentience” and “Formshifter” would mention the same band members like Riley McShane and Michael Stancel.

Normalization is akin to setting up separate lists or sections in your documentation. One section purely for “Band Members” where you detail members of Allegaeon over time. Another section for “Albums”, where instead of listing band members all over again, you simply refer back to the “Band Members” section. This kind of organization cuts down redundancy and ensures if, say, a band member leaves, you have only one spot to update.

Continue reading →

My Top 2 Reading List for Go & Data

Right now I’ve got a couple books in queue.

“Black Hat Go” Go Programming for Hackers and Pentesters by Tom Steele, Chris Patten, and Dan Kottmann.

I picked this book up after a good look at the index. There is a basic intro to Go at the beginning of the book to cover language fundamentals, but immediately after that dives into the meat of the topic. Immediately getting into understanding the TCP handshake, TCP itself, writing a scanner, and a proxy. There is then some basics about HTTP Servers, routers, and middleware in chapter 3 and 4, but returns immediately into topics of interest in chapter 5 around exploiting DNS, and writing DNS clients. I perused the index a bit further to note it covers SMB and NTLM, a chapter on “Abusing Databases and Filesystems”, then on to raw packet processing, writing and porting exploit code, and a host of other topics. This is going to be an interesting book to dig into and write about in the coming weeks. I’m pretty excited about it and hope to write a thorough review upon completion.

“Designing Data-Intensive Applications” by Martin Kleppmann

This book is familiar territory, as I’ve spent much of my career working on similar topics that this book covers. What I suspect is that I’ll enjoy reading the material presented in an orderly and concise way versus the chaos and disruptive way I’ve acquired the knowledge on these topics.

From the index, the book starts off with foundations of data systems and the ideas around building reliable, scalable, and maintainable applications. This provides a good basis in which to dive into the other topics. From there it looks like we’ll get a run into the birth of NoSQL, object-relational mismatches and the related insanity that this has bred in the industry! Then a solid dive into graph-like, traditional, and multi-model data modeling. With the beginning quarter of the book then covering everything from hash indexes, SSTables (another familiar topic), LSM-Trees, B-Trees, and related indexing structures before wrapping up this first 25% of the book with stars and snowflake topics for analytics and column-oriented storage, compression, sort orders in column storage, and related material on aggregation in data cubes and materialized views.

That’s just the first 25%! From there Martin covers a wide range of topics, that if you’re in the industry and plan to deal with large scale data-intensive applications these are topics you need to be intimately familiar with!

Reading

Over the next few months while I read through these books I hope to provide summaries and related notes on the material. Who knows, maybe you’ll want to dive into the material yourself! Until then happy thrashing code and may you have high retention and comprehension in reading!

Beyond CRUD n’ Cruft Data-Modeling

I dig through a lot of internet results and blog entries that show CRUD data modeling all the time. A lot of these blog entries and documentation are pretty solid. Unfortunately, rarely do we end up with data that is accurately or precisely modeled the way it ought to be or the way we would ideally use it. In this post I’m going to take some sample elements of data and model it out for various uses. Then reconstitute that data into different structures for various uses within microservices, loading, reading, both in normalized form and denormalized form.

The Domain: Railroad Systems & Services

The domain I chose for this particular example is the entire global spectrum of rail services. Imagine if you would a system that can track all the trains in the world, or even just the trains in a particular area of the world, like the United States. In the United States the trains can be broken down into logical structures of data for various things like freight trains and passenger trains. Trains operated under a particular operator like Amtrak, Union Pacific, or Norfolk Southern, and their respective consists that the train is made up of. Let’s get into some particular word definitions to fully detail this domain. Continue reading “Beyond CRUD n’ Cruft Data-Modeling” →

Go Library Data Generation Timings

Recently I put together some quick code to give some timings on the various data generation libraries available for Go. For each library there were a few key pieces of data generation I wanted to time:

First Name – basically a first name of some sort, like Adam, Nancy, or Frank.
Full Name – something like Jason McCormick or Sally Smith.
Address – A basic street address, or whatever the generator might provide.
User Agent – Such as that which is sent along with the browser response.
Color – Something like red, blue, green, or other color beyond the basics.
Email – A fully formed, albeit faked email address.
Phone – A phone number, ideally with area code and prefix too.
Credit Card Number – Ideally a properly formed one, which many of the generators seem to provide based on VISA, Mastercard, or related company specifications.
Sentence – A stand multi-word lorem ipsum based sentence would be perfect.

I went through and searched for libraries that I wanted to try out. Of all the libraries I found I narrowed it down to three specific libraries. When I add the imports for these libraries, by way of how Go works, it gives you the repo locations:

“github.com/bxcodec/faker” – faker – Faker generates data based on a Struct, which is a pretty cool way to determine what type of data you want and to get it returned in a particularly useful format.
“github.com/icrowley/fake” – fake – Fake is a library inspired by the ffaker and forgery Ruby gems. Not that you’d be familiar with those, but if you are you have instant insight into how this library works.
“github.com/malisit/kolpa” – kolpa – This is another data generator that creates fake data for various types of data, structures, strings, sentences, and more.

Continue reading “Go Library Data Generation Timings” →

Tag: data

Defining AWS Data Lake vs. Data Warehouse: Choosing the Right Solution

Data Lake vs. Data Warehouse: A Detailed Breakdown with AWS Offerings

What is a Data Lake?

Like this:

Normalization in Relational Databases

Like this:

My Top 2 Reading List for Go & Data

“Black Hat Go” Go Programming for Hackers and Pentesters by Tom Steele, Chris Patten, and Dan Kottmann.

“Designing Data-Intensive Applications” by Martin Kleppmann

Reading

Like this:

Beyond CRUD n’ Cruft Data-Modeling

The Domain: Railroad Systems & Services

Like this:

Go Library Data Generation Timings

Like this:

Data Lake vs. Data Warehouse: A Detailed Breakdown with AWS Offerings

What is a Data Lake?

Share this:

Like this:

Share this:

Like this:

“Black Hat Go” Go Programming for Hackers and Pentesters by Tom Steele, Chris Patten, and Dan Kottmann.

“Designing Data-Intensive Applications” by Martin Kleppmann

Reading

Share this:

Like this:

The Domain: Railroad Systems & Services

Share this:

Like this:

Share this:

Like this: