My Top 2 Reading List for Go & Data

Right now I’ve got a couple books in queue.

“Black Hat Go” Go Programming for Hackers and Pentesters by Tom Steele, Chris Patten, and Dan Kottmann.

Alt Text

I picked this book up after a good look at the index. There is a basic intro to Go at the beginning of the book to cover language fundamentals, but immediately after that dives into the meat of the topic. Immediately getting into understanding the TCP handshake, TCP itself, writing a scanner, and a proxy. There is then some basics about HTTP Servers, routers, and middleware in chapter 3 and 4, but returns immediately into topics of interest in chapter 5 around exploiting DNS, and writing DNS clients. I perused the index a bit further to note it covers SMB and NTLM, a chapter on “Abusing Databases and Filesystems”, then on to raw packet processing, writing and porting exploit code, and a host of other topics. This is going to be an interesting book to dig into and write about in the coming weeks. I’m pretty excited about it and hope to write a thorough review upon completion.

“Designing Data-Intensive Applications” by Martin Kleppmann

Alt Text

This book is familiar territory, as I’ve spent much of my career working on similar topics that this book covers. What I suspect is that I’ll enjoy reading the material presented in an orderly and concise way versus the chaos and disruptive way I’ve acquired the knowledge on these topics.

From the index, the book starts off with foundations of data systems and the ideas around building reliable, scalable, and maintainable applications. This provides a good basis in which to dive into the other topics. From there it looks like we’ll get a run into the birth of NoSQL, object-relational mismatches and the related insanity that this has bred in the industry! Then a solid dive into graph-like, traditional, and multi-model data modeling. With the beginning quarter of the book then covering everything from hash indexes, SSTables (another familiar topic), LSM-Trees, B-Trees, and related indexing structures before wrapping up this first 25% of the book with stars and snowflake topics for analytics and column-oriented storage, compression, sort orders in column storage, and related material on aggregation in data cubes and materialized views.

That’s just the first 25%! From there Martin covers a wide range of topics, that if you’re in the industry and plan to deal with large scale data-intensive applications these are topics you need to be intimately familiar with!

Reading

Over the next few months while I read through these books I hope to provide summaries and related notes on the material. Who knows, maybe you’ll want to dive into the material yourself! Until then happy thrashing code and may you have high retention and comprehension in reading!

Beyond CRUD n’ Cruft Data-Modeling

I dig through a lot of internet results and blog entries that show CRUD data modeling all the time. A lot of these blog entries and documentation are pretty solid. Unfortunately, rarely do we end up with data that is accurately or precisely modeled the way it ought to be or the way we would ideally use it. In this post I’m going to take some sample elements of data and model it out for various uses. Then reconstitute that data into different structures for various uses within microservices, loading, reading, both in normalized form and denormalized form.

The Domain: Railroad Systems & Services

The domain I chose for this particular example is the entire global spectrum of rail services. Imagine if you would a system that can track all the trains in the world, or even just the trains in a particular area of the world, like the United States. In the United States the trains can be broken down into logical structures of data for various things like freight trains and passenger trains. Trains operated under a particular operator like Amtrak, Union Pacific, or Norfolk Southern, and their respective consists that the train is made up of. Let’s get into some particular word definitions to fully detail this domain. Continue reading “Beyond CRUD n’ Cruft Data-Modeling”

Go Library Data Generation Timings

Recently I put together some quick code to give some timings on the various data generation libraries available for Go. For each library there were a few key pieces of data generation I wanted to time:

  • First Name – basically a first name of some sort, like Adam, Nancy, or Frank.
  • Full Name – something like Jason McCormick or Sally Smith.
  • Address – A basic street address, or whatever the generator might provide.
  • User Agent – Such as that which is sent along with the browser response.
  • Color – Something like red, blue, green, or other color beyond the basics.
  • Email – A fully formed, albeit faked email address.
  • Phone – A phone number, ideally with area code and prefix too.
  • Credit Card Number – Ideally a properly formed one, which many of the generators seem to provide based on VISA, Mastercard, or related company specifications.
  • Sentence – A stand multi-word lorem ipsum based sentence would be perfect.

I went through and searched for libraries that I wanted to try out. Of all the libraries I found I narrowed it down to three specific libraries. When I add the imports for these libraries, by way of how Go works, it gives you the repo locations:

  • “github.com/bxcodec/faker” – faker – Faker generates data based on a Struct, which is a pretty cool way to determine what type of data you want and to get it returned in a particularly useful format.
  • “github.com/icrowley/fake” – fake – Fake is a library inspired by the ffaker and forgery Ruby gems. Not that you’d be familiar with those, but if you are you have instant insight into how this library works.
  • “github.com/malisit/kolpa” – kolpa – This is another data generator that creates fake data for various types of data, structures, strings, sentences, and more.

Continue reading “Go Library Data Generation Timings”

Restarting Data Diluvium – Four Steps

I’ve got three steps I’m going through to reboot the Data Diluvium Project & the respective CLI app I started about a year ago. I got a little ways into the project and then a bit distracted, it happens. Here’s the next steps I’m taking and for those interested in helping out I’ll be blogging the work here, and also sending out updates via my Thrashing Code Newsletter. You can sign up and select all the news or just the open source project news if you just want to follow the projects.

Step 0: Write Up the Ideas Behind the Project

Ok, so this will arrive subsequently. So far, just wanted to get these notes and intentions written down. Previously I’d written about the idea here, and here. Albeit after many discussion with a number of people, there will be some twists and turns to the project to make it more useful and streamlined in CLI & services.

Step 1: Cleanup The Repository

Currently the repository is kind of a mess. I’m going to aim to do the following over the next few days.

  • Write up contributor issues/files for the repo.
  • Rewrite the documentation (initial docs that is) to detail the intent of the data generator ideas.
  • Incorporate the CLI to a repo that is parallel to this repo that is designed specifically to work against this repo’s project.
  • Write up a README.md that will detail what Data Diluvium is exactly as well as point to the project site and provide installation and setup instructions.
  • Setup the first databases to target as Postgresql, Cassandra, and *maybe* one other database, but I’m not sure which one. Feel free to file an issue with a suggestion.

Step 2: Cleanup & Publish a new Project Website

This is a simple one, I need to write up copy with the details, specific with feature descriptions and intended examples. This will provide the start point to base the work for the project. It will be similar to one of those living documents in that the documentation will, can, and should change as the project is developed.

Step 3: Get More Cats Coding!

catI’ve pinged a few people I know are interested in helping out, but we’re always looking for others to help with PRs and related efforts around the project(s). If you’re game, the easiest way to get started would be to ping me directly via DM on Twitter @adron and to sign up on my Thrashing Code Newsletter and select Open Source Projects Only (unless you want all the things).

…anyway, getting to work on these tasks. Happy coding!

Big Ole Invisible Freight Railroads (Let’s Talk Volumes)

Let’s talk about freight shipments in the United States for a moment. You often see 18-wheelers and such hauling stuff back and forth on the roadways, but did you realize they only account for just shy of 30% of freight movement in the United States? At least by ton miles carried, railroads carry a whopping 40% (see this and . That’s right — the relatively invisible, barely interruptive, much cleaner than 18-wheelers or planes — freight railroads!

Ton Miles from FRA Telemetry on Freight Systems (Referenced below in FRA DOT Link)

https://www.fra.dot.gov/Page/P0362

Let’s talk about a few other things. The road systems in the US, which have the largest amount of road damage costs attributed to weather and trucking, cost the US taxpayer over $100 billion a year. That’s often $100 billion beyond the $38–42 billion a year in gas taxes we pay (more). The railroads in the United States however rarely take any taxpayer money and rely almost solely on shipping fees paid by customers. Freight railroads are one of the only entities that actually pay the vast majority of their costs in operation, capital costs, and even beyond that, unlike those dastardly automobile hand-outs, car welfare, and related subsidies! They even contribute back to society with business expansions in communities, charity programs, and more (read here, here, and here). Overall, the freight railroads have built, continue to build, and continue to be one of the greatest industrial assets this nation or any nation has ever seen. But I do digress…


That Freight Rail Data

I didn’t actually start this post to yammer on about how cool and great the freight railroads are. I wanted to talk about some data. You see, I wanted to get hold of some interesting data. I find the massive freight movements of the railroads, and overall in the United States, rather interesting. Thus I set out on a quest to seek out and find all the data I could. This is the path and data I found, and soon you too will find this data showing up in the applications and tooling that I build and distribute — via open source projects and other means — to you dear reader!

Just for context here’s some of the details of the data I’m going to dig into. The following data is primarily monitored and managed for markets and regulatory reasons by the Government entity called the Surface Transportation Board, or STB. The STB’s website is straight up circa 1999, so it’s worth a look just for all that historical glory! Anyway, here’s the short description from the STB’s site itself.

The Surface Transportation Board is an independent adjudicatory and economic-regulatory agency charged by Congress with resolving railroad rate and service disputes and reviewing proposed railroad mergers.

The agency has jurisdiction over railroad rate and service issues and rail restructuring transactions (mergers, line sales, line construction, and line abandonments); certain trucking company, moving van, and non-contiguous ocean shipping company rate matters; certain intercity passenger bus company structure, financial, and operational matters; and rates and services of certain pipelines not regulated by the Federal Energy Regulatory Commission. The agency has authority to investigate rail service matters of regional and national significance.


The STB is one place that has a lot of regulations in which garners up a lot of data from the class I railroads. Overall these class I railroads make up the vast bulk of railroads in the United States, and also Canada and Mexico! Overall, there are over 140,000 miles of track that the railroads operate on. All built and managed by the freight railroads themselves, except for a few hundred miles in the north eastern area of the United States. For comparison that’s 140k miles of railroad and our Eisenhower Interstate Highway System is only 47,856 miles, and costs about $40–66 billion per year (amazing Quora answer w/ tons of ref links) just in maintenance and upgrades!

The defining classification of railroads in the United States by the simple designation class 1 railroad as listed on Wikipedia.

In the United States, the Surface Transportation Board defines a Class I railroad as “having annual carrier operating revenues of $250 million or more in 1991 dollars”, which adjusted for inflation was $452,653,248 in 2012.[1] According to the Association of American Railroads, Class I railroads had a minimum carrier operating revenue of $346.8 million (USD) in 2006,[2] $359 million in 2007,[3] $401.4 million in 2008,[4] $378.8 million in 2009,[5] $398.7 million in 2010[6](p1) and $433.2 million in 2011.[7]

Association of American Railroads (AAR)

The AAR provides a number of rolled up data points, tons carried totals, and many other interesting pieces of data in their data center pages. This isn’t always useful if you want data to work with, but it was one of the first sources I regularly stumbled upon in my search for actual load data and such.

The AAR also posts weekly load data totals which shows some pretty awesome graphs of ton loads and related. There are a lot of things that can be discerned from this data too. Such as, unlike Trump’s stupid declaration about something something coal this and coal that, it’s dying off as a commodity for energy.

Almost down ~40% or so from 2007.

Anyway, amidst the STB and AAR I kept digging and digging for APIs, source data, or something I could get at that had a better granularity. After almost an hour of searching this evening I realized I was getting into this mess a bit deeper than need be, but hot damn this data was fascinating. I am, after all a data nerd, so I kept digging.

I finally stumbled on something with granularity, at least daily, with YCharts. but the issue there was I had to sign up for a subscription before I could check anything out. So that was a non-starter. Especially since I didn’t even know if the granularity would go down to what I’d like to see.

Realizing that looking at regulatory bodies and related entities wasn’t going to get down into the granularity I wanted, I got curious about the individual railroads. There are only 7 class I railroads so why not dig into each specifically. Off I started on that path!

The first thing I stumbled on with this change in effort is this hilarious line, “Union Pacific’s Lynden Tennison doesn’t exactly have a problem with “Big Data.” But unlike most CIOs, he wants one.” in this post about the big data. I immediately thought to myself, “Oh LOLz do the railroads really even know the data they’re collecting, surely they do, but maybe they’re not even sure where it’s actually at!” Of course, this is entirely possible that they’re all just sitting on this treasure trove of data and it’s being squandered off in some regulatory office of bean counters versus actually being available to innovate on. For the railroads or others for that matter. Albeit I’ll give Tennison at least he’s probably on the right path, Union Pacific hasn’t been doing a bad job lately.

I then found the weekly and related reports on Union Pacific’s site under their data area. But again, this was duplicated PDF files that I saw in the AAR reports. Not very useful to work with.

As I realized this was an effort in vain, I also stumbled upon the reality that the railroads buy big hardware and software, vastly overpriced, to do very specific things with. The costs tend to be validated in operational improvements, but much of it is overpriced IBM or Oracle deals where they’re getting raked over the coals. In other words, IBM and Oracle are landing some sweet sales but the railroads aren’t. It kind of makes them all that much more impressive that they cost Americans so little actual money and provide such a massive service in return.

With that, my efforts end without success for today. But I’ve gained a few more tidbits of trivial knowledge about the STB and AAR that I oddly enough, as a fan of rail systems, didn’t have. Thus, it was kind of successful.

Oh well, back to searching for other interesting data to work with!