Strata, Ninjas, Distributed Data Day, and Graph Day Trip Recap

This last week was a helluva set of trips, conferences to attend, topics to discuss, and projects to move forward on. This post I’ll attempt to run through the gamut of events and the graph of things that are transversing from the conference nodes onward! (See what I did there, yeah, transpiling that graph verbiage onto events and related efforts!)

Monday Flight(s)

Monday involved some flying around the country for me via United. It was supposed to be a singular flight, but hey, why not some adventures around the country for shits and giggles right! Two TIL’s (Things I Learned) that I might have known already, but repetition reinforces one’s memory.

  1. If you think you’ve bought a nonstop ticket be sure to verify that there isn’t a stopover half way through the trip. If there’s any delays or related changes your plane might be taken away, you’ll get shuffled off to who know’s what other flight, and then you end up spending the whole day flying around instead of the 6 hour flight you’re supposed to have.
  2. Twitter sentiment tends to be right, it’s good policy to avoid United, they schedule their planes and the logistical positions and crews in ways that generally become problematic quickly when there’s a mere minor delay or two.

Tuesday Strata Day Zero (Train & Workshop Day)

Tuesday rolled in and Strata kicked off with a host of activities. I rolled in to scope out our booth but overall, Tuesday was a low yield activity day. Eventually met up with the team and we rolled out for an impromptu team dinner, drinks, and further discussions. We headed off to Ninja, which if you haven’t been there it’s a worthy adventure for those brave enough. I had enough fun that I felt I should relay this info and provide a link or three so you too could go check it out.

Wednesday Strata Day One

Day two of Strata kicked off and my day involved mostly discussions with speakers, meetings, a few analyst discussions, and going around to booths to check out which technology I needed to add to my “check it out soon” list. Here are a few of the things I noted and are now on the list.

I also worked with the video team and cut some video introductions for Strata and upcoming DataStax Developer Days Announcements. DataStax Developer Days are free events coming to a range of cities. Check them out here and sign up for whichever you’re up for attending. I’m looking forward to teaching those sessions and learning from attendees about their use cases and domains in which they’re working.

The cities you’ll find us coming to soon:

I wish I could come and teach in every city but I narrowed it down to Chicago and Dallas, so if you’re in those cities, I look forward to meeting you there! Otherwise you’ll get to meet other excellent members of the team!

This evening we went to Death Ave. The food was great, drinks solid, and the name was simply straight up metal. Albeit it be a rather upper crust dining experience and no brutal metal was in sight to be seen or heard. However, I’d definitely recommend the joint, especially for groups as they have a whole room you can get if you’ve got enough people and that improves the experience over standard dining.

Thursday Strata Day Two

I scheduled my flights oddly for this day. Which in turn left me without any time to spend at Strata. But that’s the issues one runs into when things are booked back to back on opposite coasts of the country! Thus, this day involved me returning to Newark via Penn Station and flying back out to San Francisco. As some of you may know, I’m a bit of a train geek, so I took a New Jersey NEC (Northeast Corridor) train headed for Trenton out of Penn back to the airport.

The train, whether you’re taking the Acela, Metroliner, NJ Transit, or whatever is rolling along to Newark that day is the way to go in my opinion. I’ve taken the bus, which is slightly cheaper, but meh it’s an icky east coast intercity bus. The difference in price in a buck or three or something, nothing significant, and of course you can jump in an Uber, Taxi, or other transport also. Even when they can make it faster I tend to prefer the train. It’s just more comfortable, I don’t have to deal with a driver, and they’re more reliable. The turnpikes and roadways into NYC from Newark aren’t always 100%, and during rush hour don’t even expect to get to the city in a timely manner. But to each their own, but for those that might not know, beware the taxi price range of $55 base plus tolls which often will put your trip into Manhattan into the $99 or above price range. If you’re going to any other boroughs you better go ahead and take a loan out of the bank.

The trip from Newark to San Francisco was aboard United on a Boeing 757. I kid you not, regardless of airline, if you get to fly on a 757 versus a 737 or Airbus 319 or 320, it’s preferable. Especially for flights in the 2+ hour range. There is just a bit more space, the engines make less noise, the overall plane flies smoother, and the list of comforts is just a smidgen better all around. The 757 is the way to go for cross continent flights!

In San Francisco I took the standard BART route straight into the city and over to the airbnb I was staying at in Protrero Hill. Right by Farley’s on Texas Street if you know the area. I often pick the area because it’s cheap (relatively), super chill, good food nearby, not really noisy, and super close to where the Distributed Data Summit and Graph Day Conferences Venue is located.

The rest of Thursday included some pizza and a short bout of hacking some Go. Then a moderately early turn in around midnight to get rested for the next day.

Friday Distributed Data Summit

I took the short stroll down Texas Street. While walking I watched a few Caltrain Commuter Trains roll by heading into downtown San Francisco. Eventually I got to 16th and cross the rail line and found the walkway through campus to the conference venue. Walked toward the building entrance and there was my fellow DataStaxian Amanda. We chatted a bit and then I headed over to check out the schedule and our DataStax Booth.

We had a plethora of our rather interesting and fun new DataStax tshirts. I’ll be picking some up week after next during our DevRel week get together. I’ll be hauling these back up to Seattle and could prospectively get some sent out to others in the US if you’re interested. Here’s a few pictures of the tshirts.

After that joined the audience for Nate McCall’s Keynote. It was good, he put together a good parallel of life and finding and starting to work with and on Cassandra. Good kick off, and after I delved into a few other talks. Overall, all were solid, and some will even have videos posted on the DataStax Academy Youtube Account. Follow me @Adron or the @DataStaxAcademy account to get the tweets when they’re live, or alternatively just subscribe to the YouTube Channel (honestly, that’s probably the easiest way)!

After the conference wrapped up we rolled through some pretty standard awesome hanging out DevRel DataStax style. It involved the following ordered events:

  1. Happy hour at Hawthorne in San Francisco with drink tickets, some tasty light snacks, and most excellent conversation about anything and everything on the horizon for Cassandra and also a fair bit of chatter about what we’re lining up for upcoming DataStax releases!
  2. BEER over yonder at the world famous Mikeller Bar. This place is always pretty kick ass. Rock n’ Roll, seriously stout beer, more good convo and plotting to take over the universe, and an all around good time.
  3. Chinese Food in CHINA TOWN! So good! Some chow mein, curry, and a host of things. I’m a big fan of always taking a walk into Chinatown in San Francicsco and getting some eats. It’s worth it!

Alright, after that, unlike everybody else that then walked a mere two blocks to their hotel or had taken a Lyft back, I took a solid walk all the way down to the Embarcadero. Walked along for a bit until I decided I’d walked enough and boarded a T-third line train out to Dogpatch. Then walked that last 6 or so blocks up the hill to Texas Street. Twas an excellent night and a great time with everybody!

Saturday Graph Day

Do you do graph stuff? Lately I’ve started looking into Graph Database tech again since I’ll be working on and putting together some reference material and code around the DataStax Graph Database that has been built onto the Cassandra distro. I’m still, honestly kind of a newb at a lot of this but getting it figured out quickly. I do after all have a ton of things I’d like to put into and be able to query against from a graph database perspective. Lot’s of graph problems of course don’t directly correlate to a graph database being a solution, but it’s indeed part of the solution!

Overall, it was an easy day, the video team got a few more talks and I attended several myself. Again, same thing as previously mentioned subscribe to the channel on Youtube or follow me on Twitter @Adron or the crew @DataStaxAcademy to get notified when the videos are released.

Summary

It has been a whirlwind week! Exhausting but worth it. New connections made, my own network of contacts and graph of understanding on many topics has expanded. I even got a short little time in New York among all the activity to do some studying, something I always love to break away and do. I do say though, I’m looking forward to getting back to the coding, Twitch streams, and the day to day in Seattle again. Got some solid material coming together and looking forward to blogging that too, and it only gets put together when I’m on the ground at home in Seattle.

Cheers, happy thrashing code!

Wrap Up for August of 2018

Thrashing Code Sessions via Twitch & Kick Ass Dis-Sys Meetup

Got some excellent coding and systems setup coming up in the next few days. Also a meetup on the 28th with Tim Kellogg and Alena Hall presenting on some interesting topics around distributed database data working on Kubernetes and WebAssembly of the hot temperament type. A new surprise guest addition on my Twitch channel that is scheduled to swing into Valhalla and help build out a cluster and respective needed DHCP, DNS, and related configuration for a setup on the metal!

Schedule

 

Chapter 2 in My Twitch Streaming

A while back I started down the path of getting a Twitch Channel started. At this point I’ve gotten a channel setup which I’ve dubbed Thrashing Code albeit it still just has “adronhall” all over it. I’ll get those details further refined as I work on it more.

Today I recorded a new Twitch stream about doing a twitch stream and created an edited video of all the pieces and cameras and angles. I could prospectively help people get started, it’s just my experiences so far and efforts to get everything connected right. The actual video stream recording is available, and I’ll leave it on the channel. However the video I edited will be available and I’ll post a link here.

Tomorrow will be my first official Twitch stream at 3pm PST. If you’re interested in watching check out my Twitch profile here follow and it’ll ping you when I go live. This first streaming session, or episode, or whatever you want to call it, will include a couple topics. I’ll be approaching these topics from that of someone just starting, so if you join help hold me to that! Don’t let me skip ahead or if you notice I left out something key please join and chat at me during the process. I want to make sure I’m covering all the bases as I step through toward achieving the key objectives. Which speaking of…

Tomorrow’s Mission Objectives

  1. Create a DataStax Enterprise Cassandra Cluster in Google Cloud Platform.
  2. Create a .NET project using the latest cross-platform magical bits that will have a library for abstracting the data source(s), a console interface for using the application, and of course a test project.
  3. Configure & connect to the distributed database cluster.

Mission Stretch Objectives

  1. Start a github repo to share the project with others.
  2. Setup some .github templates for feature request issues or related issues.
  3. Write up some Github Issue Feature requests and maybe even sdd some extra features to the CLI for…??? no idea ??? determine 2-3 during the Twitch stream.

If you’d like to follow along, here’s what I have installed. You’re welcome to a range of tooling to follow along with that is the same as what I’ve got here or a variance of other things. Feel free to bring up tooling if you’re curious about it via chat and I’ll answer questions where and when I can.

  • Ubuntu v18.04
  • .NET core v2.1
  • DataStax Enterprise v6

Distributed Systems: Cassandra, DataStax, a Short SITREP

SITREP = Situation Report. It’s military speak. 💂🏻‍♂️

Apache Cassandra is one of the most popular databases in use today. It has many characteristics and distinctive architectural details. In this post I’ll provide a description and some details for a number of these features and characteristics, divided as such. Then, after that (i.e. toward the end, so skip there if you just want to the differences) I’m doing to summarize key differences with the latest release of the DataStax Enterprise 6 version of the database.

Cassandra Characteristics

Cassandra is a linearly scalable, highly available, fault tolerant, distributed database. That is, just to name a few of the most important characteristics. The Cassandra database is also cross-platform (runs on any operating systems), multi-cloud (runs on and across multiple clouds), and can survive regional data center outages or even in multi-cloud scenarios entire cloud provider outages!

Columnar Store, Column Based, or Column Family? What? Ok, so you might have read a number of things about what Cassandra actually is. Let’s break this down. First off, a columnar or column store or column oriented database guarantees data location for a single column in a node on disk. The column may span a bunch of or all of the rows that depend on where or how you specify partitions. However, this isn’t what the Cassandra Database uses. Cassandra is a column-family database.

A column-family storage architecture makes sure the data is stored based on locality of the data at the partition level, not the column level. Cassandra partitions group rows and columns split by a partition key, then clustered together by a specified clustering column or columns. To query Cassandra, because of this, you must know the partition key in order to avoid full data scans!

Cassandra has these partitions that guarantee to be on the same node and sort strings table (referred to most commonly as an SSTable *) in the same location within that file. Even though, depending on the compaction strategy, this can change things and the partition can be split across multiple files on a disk. So really, data locality isn’t guaranteed.

Column-family stores are great for high throughput writes and the ability to linearly scale horizontally (ya know, getting lots and lots of nodes in the cloud!). Reads using the partition key are extremely fast since this key points to exactly where the data resides. However, this often – at least last I know of – leads to a full scan of the data for any type of ad-hoc query.

A sort of historically trivial but important point is the column-family term comes from the storage engine originally used based on a key value store. The value was a set of column value tuples, which where often referenced as family, and later this family was abstracted into partitions, and then the storage engine was matched to that abstraction. Whew, ok, so that’s a lot of knowledge being coagulated into a solid eh!  [scuse’ my odd artful language use if you visualized that!]

With all of this described, a that little history sprinkled in, when reading the description of Cassandra in the README.asc file of the actual Cassandra Github Repo things make just a little more sense. In the file it starts off with a description,

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.

Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Now that I’ve covered the 101 level of what Cassandra is I’ll give a look at DataStax and their respective offering.

DataStax

DataStax Enterprise at first glance might be a bit confusing since immediate questions pop up like, “Doesn’t DataStax make Cassandra?”, “Isn’t DataStax just selling support for Cassandra?”, or “Eh, wha, who is DataStax and what does this have to do with Cassandra?”. Well, I’m gonna tell ya all about where we are today regarding all of these things fit.

Performance

DataStax provides a whole selection of amenities around a database, which is derived from the Cassandra Distributed Database System. The core product and these amenities are built into what we refer to as the “DataStax Enterprise 6“. Some of specific differences are that the database engine itself has been modified out of band and now delivers 2x the performance of the standard Cassandra implemented database engine. I was somewhat dubious when I joined but after the third party benchmarks where completed that showed the difference I grew more confident. My confidence in this speed increase grew as I’ve gotten to work with the latest version I can tell in more than a few situations that it’s faster.

Read Repair & NodeSync

If you already use Cassandra, read repair works a certain way and that still works just fine in DataStax Enterprise 6. But one also has the option of using NodeSync which can help eliminate scripting, manual intervention, and other repair operations.

Spark SQL Connectivity

There’s also an always on SQL Engine for automated uptime for apps using DataStax Enterprise Analytics. This provides a better level of analytics requests and end -user analytics. Sort of on this related note, DataStax Studio also has notebook support for Spark SQL now. Writing one’s Spark SQL gets a little easier with this option.

Multi-Cloud / Hybrid-Cloud

Another huge advantage of DataStax Enterprise is going multi-cloud or hybrid-cloud with DataStax Enterprise Cassandra. Between the Lifecycle Manager (LCM), OpsCenter, and related tooling getting up and running with a cluster across a varying range of data-centers wherever they may be is quick and easy.

Summary

I’ll be providing deeper dives into the particular technology, the specific differences, and more in the future. For now I’ll wrap up this post as I’ve got a few others coming distinctively related to distributed database systems themselves ranging from specific principles (like CAP Theorem) to operational (how to and best ways to manage) and development (patterns and practices of developing against) related topics.

Overall the solutions that DataStax offers are solid advantages if you’re stepping into any large scale data (big data or whatever one would call their plethora of data) needs. Over the coming months I’ve got a lot of material – from architectural research and guidance to tactical coding implementation work – that I’ll be blogging about and providing. I’m really looking forward to exploring these capabilities, being the developer advocate to DataStax for the community of users, and learning a thing or three million.