Distributed Systems: Cassandra, DataStax, a Short SITREP

SITREP = Situation Report. It’s military speak. 💂🏻‍♂️

Apache Cassandra is one of the most popular databases in use today. It has many characteristics and distinctive architectural details. In this post I’ll provide a description and some details for a number of these features and characteristics, divided as such. Then, after that (i.e. toward the end, so skip there if you just want to the differences) I’m doing to summarize key differences with the latest release of the DataStax Enterprise 6 version of the database.

Cassandra Characteristics

Cassandra is a linearly scalable, highly available, fault tolerant, distributed database. That is, just to name a few of the most important characteristics. The Cassandra database is also cross-platform (runs on any operating systems), multi-cloud (runs on and across multiple clouds), and can survive regional data center outages or even in multi-cloud scenarios entire cloud provider outages!

Columnar Store, Column Based, or Column Family? What? Ok, so you might have read a number of things about what Cassandra actually is. Let’s break this down. First off, a columnar or column store or column oriented database guarantees data location for a single column in a node on disk. The column may span a bunch of or all of the rows that depend on where or how you specify partitions. However, this isn’t what the Cassandra Database uses. Cassandra is a column-family database.

A column-family storage architecture makes sure the data is stored based on locality of the data at the partition level, not the column level. Cassandra partitions group rows and columns split by a partition key, then clustered together by a specified clustering column or columns. To query Cassandra, because of this, you must know the partition key in order to avoid full data scans!

Cassandra has these partitions that guarantee to be on the same node and sort strings table (referred to most commonly as an SSTable *) in the same location within that file. Even though, depending on the compaction strategy, this can change things and the partition can be split across multiple files on a disk. So really, data locality isn’t guaranteed.

Column-family stores are great for high throughput writes and the ability to linearly scale horizontally (ya know, getting lots and lots of nodes in the cloud!). Reads using the partition key are extremely fast since this key points to exactly where the data resides. However, this often – at least last I know of – leads to a full scan of the data for any type of ad-hoc query.

A sort of historically trivial but important point is the column-family term comes from the storage engine originally used based on a key value store. The value was a set of column value tuples, which where often referenced as family, and later this family was abstracted into partitions, and then the storage engine was matched to that abstraction. Whew, ok, so that’s a lot of knowledge being coagulated into a solid eh!  [scuse’ my odd artful language use if you visualized that!]

With all of this described, a that little history sprinkled in, when reading the description of Cassandra in the README.asc file of the actual Cassandra Github Repo things make just a little more sense. In the file it starts off with a description,

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.

Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Now that I’ve covered the 101 level of what Cassandra is I’ll give a look at DataStax and their respective offering.

DataStax

DataStax Enterprise at first glance might be a bit confusing since immediate questions pop up like, “Doesn’t DataStax make Cassandra?”, “Isn’t DataStax just selling support for Cassandra?”, or “Eh, wha, who is DataStax and what does this have to do with Cassandra?”. Well, I’m gonna tell ya all about where we are today regarding all of these things fit.

Performance

DataStax provides a whole selection of amenities around a database, which is derived from the Cassandra Distributed Database System. The core product and these amenities are built into what we refer to as the “DataStax Enterprise 6“. Some of specific differences are that the database engine itself has been modified out of band and now delivers 2x the performance of the standard Cassandra implemented database engine. I was somewhat dubious when I joined but after the third party benchmarks where completed that showed the difference I grew more confident. My confidence in this speed increase grew as I’ve gotten to work with the latest version I can tell in more than a few situations that it’s faster.

Read Repair & NodeSync

If you already use Cassandra, read repair works a certain way and that still works just fine in DataStax Enterprise 6. But one also has the option of using NodeSync which can help eliminate scripting, manual intervention, and other repair operations.

Spark SQL Connectivity

There’s also an always on SQL Engine for automated uptime for apps using DataStax Enterprise Analytics. This provides a better level of analytics requests and end -user analytics. Sort of on this related note, DataStax Studio also has notebook support for Spark SQL now. Writing one’s Spark SQL gets a little easier with this option.

Multi-Cloud / Hybrid-Cloud

Another huge advantage of DataStax Enterprise is going multi-cloud or hybrid-cloud with DataStax Enterprise Cassandra. Between the Lifecycle Manager (LCM), OpsCenter, and related tooling getting up and running with a cluster across a varying range of data-centers wherever they may be is quick and easy.

Summary

I’ll be providing deeper dives into the particular technology, the specific differences, and more in the future. For now I’ll wrap up this post as I’ve got a few others coming distinctively related to distributed database systems themselves ranging from specific principles (like CAP Theorem) to operational (how to and best ways to manage) and development (patterns and practices of developing against) related topics.

Overall the solutions that DataStax offers are solid advantages if you’re stepping into any large scale data (big data or whatever one would call their plethora of data) needs. Over the coming months I’ve got a lot of material – from architectural research and guidance to tactical coding implementation work – that I’ll be blogging about and providing. I’m really looking forward to exploring these capabilities, being the developer advocate to DataStax for the community of users, and learning a thing or three million.

The Conversations and Samples of Multi-Cloud

Over the last few weeks the I’ve been putting together multi-cloud conversations and material related to multi-cloud implementation, conversations, and operational situations that exist today. I took a quick look at some of my repos on Github and realized I’d put together a multi-cloud Node.js sample app some time ago and should update it. I’ll get to that, hopefully, but I also stumbled into some tweets and other material I wanted to collect a few of them together.

Some Demo Code for Multi-Cloud

Conversations on Multi-cloud

  • Mitchell Hashimoto of HashiCorp posted a well written comment/article on what he’s been seeing (for some time) on Reddit.
  • A well worded tweet… lot’s of talk per Google’s somewhat underlying push for GKE on prem. Which means more clouds, more zones, and more multi-cloud options.

  • Distributed Data Show Conversations

 

 

 

Leave a comment, tweet at me (@adron), let me know your thoughts or what you’re working on re: multi-cloud. I’m curious to learn about and know more war stories.

Oh, exFAT Doesn’t Work on Linux

But to the rescue comes the search engine. I found some material on the matter and, as I’ve learned frequently, don’t count out Linux when it comes to support of nearly everything on Earth. Sure enough, there’s support for exFAT (really, why wouldn’t there be?)

Check out this repo: https://github.com/relan/exfat

There’s of course the git clone and make and make install path or there’s also the apt install path.

git clone https://github.com/relan/exfat.git
cd exfat
autoreconf --install
./configure
make

Then make install.

make install

Of course, as with things on Linux, no reboot needed just use it now to mount a drive.

mount.exfat-fuse /dev/spec /mnt/exfat

To note, if you’re using Ubuntu 18.04 the support will just be available now so re-click on the attached drive or memory device you’ve just attached and it will now appear. Pretty sweet. If you want to use apt just run this command.

apt install exfat-fuse

That’s it. Now you’ve

The Method I Use Setting Up a Dev Machine for Node.js

UPDATED: April 4th, 2019, March 17th, 2022, and again on November 18th, 2024.

It seems every few months setup of whatever tech stack is always tweaked a bit. This is a collection of information I used to setup my box recently. First off, for the development box I always use nvm as it is routine to need a different version of Node.js installed for various repositories and such. The best article I’ve found that is super up to date for Ubuntu 18.04 is Digital Ocean’s article (kind of typical for them to have the best article, as their blog is exceptionally good). In it the specific installation of nvm I’ve noticed has changed since I last worked with it some many months ago.

Continue reading “The Method I Use Setting Up a Dev Machine for Node.js”

Quick San Francisco Trip, Multi-Cloud Conversations, and A Network of Excellent People to Follow

A Quick Trip

Boarded the 17x King County Metro Bus at 6:11am yesterday. The trip was entirely uneventful until we arrived in Belltown in downtown Seattle. A guy with only his underwear on boarded the bus. Yup, you got that right, just like those underwear nightmares people have! That guy was living this situation for whatever unknown reasons. He didn’t make to much fuss though and eventually settled into one of the back seats on our packed bus.

Downtown as I got off I was pleasantly surprised that Victrola was open in the newly leased Amazon building near Pine & Pike on 3rd. I remember it being built and opening, but for some reason had completely forgotten it was there until today. Having a Victrola at this location sets it up perfectly for the mornings one needs to get to the airport. It’s located in a way that you can get off any connecting bus, grab an espresso at Victrola, and then enter the downtown tunnel to board LINK to the airport. Previously, the only option was really to wait until you get to the airport and buy some of the lackluster espresso coffee trash options they have at SEATAC.

Once I boarded the LINK I got into some notes for the upcoming Distributed Data Show that I would be recording while in San Francisco. But also got into a little review of some Node.js JavaScript code that I’d pulled down previously. I’ve been hard pressed to get into the code base and add some updates, logging, and minor logic for Killrvideo. Hopefully today, or maybe even this evening while I’m on the rest of this adventure.

All the things in between then occurred. The flight to San Jose, the oddball Taxi ride to Santa Clara DataStax HQ, then the Lyft ride to Caltrain to ride into downtown San Francisco to my hotel. Twas’ an excellently chill ride into the city and even a little productive. A good night of sleep and then up and at em’. Headed to the nearest Blue Bottle, which was on the way to my next effort of the trip. Blue Bottle was solid as always, and into the studios I went.

Multi-Cloud Conversations

At this point of this quick trip I’ve just finished shooting a few new episodes of the Distributed Data Show (subscribe) with Jeff Carpenter (@jscarp) and Amanda Moran (@amandadatastax). Just recently I also got to shoot an episode with Patrick McFadin (@patrickmcfadin) too. We’ve been enjoying a number of conversations around multi-cloud considerations, what going multi-cloud means, and even what exactly does multi-cloud even really mean. It’s been fun and now I’m putting together some additional blog posts and related material to follow the posts with more detail. So keep an eye out for the Distributed Data Shows. You can also take the lazy route and subscribe to be notified when new episodes are released or you could even set your calendar for every Tuesday, because we follow an old school traditional schedule approaches to episode releases.

I’ve included a recap on a few of the recent episodes below, which you may want to check out just to get an idea of what’s to come. Great as a listen while commuting, something to put on while you’re relaxing a bit but want to think about or listen to a conversation on a topic, or just curious what’s up!

In this last episode DataStax Vanguard Lead Chelsea Navo (@mazdagirluk) joined in to talk about how the team helps enterprises meet the challenges posed by disrupting technology and competitors. Some of the highlights of the conversation include:

1:57 Defining what exactly enterprise transformation is.

4:00 Trends including retooling batch workloads around real-time requirements, handing larger data sets and the uber popular use of streaming data we see today.

7:37 Techniques on getting teams trained up on cutting-edge technology.

9:46 There’s an argument for “skateboard solutions“! I now want to write up a whole practice around this!

16:27 Data modeling. Challenges of. Challenges around. Data modeling, it’s a good topic of exploration.

25:56 The data layer is still typically the hardest part of your application! (The assertion is made; agree, disagree, or ??)

Continue reading “Quick San Francisco Trip, Multi-Cloud Conversations, and A Network of Excellent People to Follow”