Lena presents “So You Want to Run Data-Intensive Systems on Kubernetes”

If you’re interested in running data-intensive systems (think Apache Cassandra, DataStax Enterprise, Kafka, Spark, Tensorflow, Elasticsearch, Redis, etc) in Kubernetes this is a great talk. @Lenadroid covers what options are available in Kubernetes, how architectural features around pods, jobs, stateful sets, and replica sets work together to provide distributed systems capabilities. Other features she continues and delves into include custom resource definitions (CRDs), operators, and HELM Charts, which include future and peripheral feature capabilities that can help you host various complex distributed systems. I’ve included references below the video here, enjoy.

References:

Cassandra Datacenter & Racks

The last post in this series is Distributed Database Things to Know: Consistent Hashing.

Let’s talk about the analogy of Apache Cassandra Datacenter & Racks to actual datacenter and racks. I kind of enjoy the use of the terms datacenter and racks to describe architectural elements of Cassandra. However, as time moves on the relationship between these terms and why they’re called datacenter and racks can be obfuscated.

Take for instance, a datacenter could just be a cloud provider, an actual physical datacenter location, a zone in Azure, or region in some other provider. What an actual Datacenter in Cassandra parlance actually is can vary, but the origins of why it’s called a Datacenter remains the same. The elements of racks also can vary, but also remain the same.

Origins: Racks & Datacenters?

Let’s cover the actual things in this industry we call datacenter and racks first, unrelated to Apache Cassandra terms.

Racks: The easiest way to describe a physical rack is to show pictures of datacenter racks via the ole’ Google images.

racks.png

A rack is something that is located in a data-center, or even just someone’s garage in some odd scenarios. Ya know, if somebody wants serious hardware to work with. The rack then has a number of servers, often various kinds, within that rack itself. As you can see from the images above there’s a wide range of these racks.

Datacenter: Again the easiest way to describe a datacenter is to just look at a bunch of pictures of datacenter, albeit you see lots of racks again. But really, that’s what a datacenter is, is a building that has lots and lots of racks.

data-center.png

However in Apache Cassandra (and respectively DataStax Enterprise products) a datacenter and rack do not directly correlate to a physical rack or datacenter. The idea is more of an abstraction than hard mapping to the physical realm. In turn it is better to think of datacenter and racks as a way to structure and organize your DataStax Enterprise or Apache Cassandra architecture. From a tree perspective of organizing your cluster, think of things in this hierarchy.

  • Cluster
    • Datacenter(s)
      • Rack(s)
        • Server(s)
          • Node (vnode)

Apache Cassandra Datacenter

An Apache Cassandra Datacenter is a group of nodes, related and configured within a cluster for replication purposes. Setting up a specific set of related nodes into a datacenter helps to reduce latency, prevent transactions from impact by other workloads, and related effects. The replication factor can also be setup to write to multiple datacenter, providing additional flexibility in architectural design and organization. One specific element of datacenter to note is that they must contain only one node type:

Depending on the replication factor, data can be written to multiple datacenters. Datacenters must never span physical locations.Each datacenter usually contains only one node type. The node types are:

  • Transactional: Previously referred to as a Cassandra node.
  • DSE Graph: A graph database for managing, analyzing, and searching highly-connected data.
  • DSE Analytics: Integration with Apache Spark.
  • DSE Search: Integration with Apache Solr. Previously referred to as a Solr node.
  • DSE SearchAnalytics: DSE Search queries within DSE Analytics jobs.

Apache Cassandra Racks

An Apache Cassandra Rack is a grouped set of servers. The architecture of Cassandra uses racks so that no replica is stored redundantly inside a singular rack, ensuring that replicas are spread around through different racks in case one rack goes down. Within a datacenter there could be multiple racks with multiple servers, as the hierarchy shown above would dictate.

To determine where data goes within a rack or sets of racks Apache Cassandra uses what is referred to as a snitch. A snitch determines which racks and datacenter a particular node belongs to, and by respect of that, determines where the replicas of data will end up. This replication strategy which is informed by the snitch can take the form of numerous kinds of snitches, some examples include;

  • SimpleSnitch – this snitch treats order as proximity. This is primarily only used when in a single-datacenter deployment.
  • Dynamic Snitching – the dynamic snitch monitors read latencies to avoid reading from hosts that have slowed down.
  • RackInferringSnitch – Proximity is determined by rack and datacenter, assumed corresponding to 3rd and 2nd octet of each node’s IP address. This particular snitch is often used as an example for writing a custom snitch class since it isn’t particularly useful unless it happens to match one’s deployment conventions.

In the future I’ll outline a few more snitches, how some of them work with more specific detail, and I’ll get into a whole selection of other topics. Be sure to subscribe to the blog, the ole’ RSS feed works great too, and follow @CompositeCode for blog updates. For discourse and hot takes follow me @Adron.

Distributed Database Things to Know Series

  1. Consistent Hashing
  2. Apache Cassandra Datacenter & Racks (this post)

 

DataStax Developer Days

Over the last week I had the privilege and adventure of coming out to Chicago and Dallas to teach about operations and security capabilities of DataStax Enterprise. More about that later in this post, first I’ll elaborate on and answer the following:

  • What is DataStax Developer Day? Why would you want to attend?
  • Where are the current DataStax Developer Day events that have been held, and were future events are going to be held?
  • Possibilities for future events near a city you live in.

What is DataStax Developer Day?

The way we’ve organized this developer day event at DataStax, is focused around the DataStax Enterprise built on Apache Cassandra product, however I have to add the very important note that this isn’t merely just a product pitch type of thing, you can and will learn about distributed databases and systems in a general sense too. We talk about a number of the core principles behind distributed systems such as the pivotally important consistent hash ring, datacenter and racks, gossip, replication, snitches, and more. We feel it’s important that there’s enough theory that comes along with the configuration and features covered to understand who, what, where, why, and how behind the configuration and features too.

The starting point of the day’s course material is based on the idea that one has not worked with or played with a Apache Cassandra or DataStax Enterprise. However we have a number of courses throughout the day that delve into more specific details and advanced topics. There are three specific tracks:

  1. Cassandra Track – this track consists of three workshops: Core Cassandra, Cassandra Data Modeling, and Cassandra Application Development. [more details]
  2. DSE Track – this track consists of three workshops: DataStax Enterprise Search, DataStax Enterprise Analytics, and DataStax Enterprise Graph. [more details]
  3. Bonus Content – This track has two workshops: DataStax Enterprise Overview and DataStax Enterprise Operations and Security.  [more details]

Why would you want to attend?

  • One huge rad awesome reason is that the developer day events are FREE. But really, nothing is ever free right? You’d want to take a day away from the office to join us, so there’s that.
  • You also might want to even stay a little later after the event as we always have a solidly enjoyable happy hour so we can all extend conversations into the evening and talk shop. After all, working with distributed databases, managing data, and all that jazz is honestly pretty enjoyable when you’ve got awesome systems like this to work with, so an extended conversation into the evening is more than worth it!
  • You’ll get a firm basis of knowledge and skillset around the use, management, and more than a few ideas about how Apache Cassandra and DataStax Enterprise can extend your system’s systemic capabilities.
  • You’ll get a chance to go beyond merely the distributed database system of Apache Cassandra itself and delve into graph, what it is and how it works, analytics, and search too. All workshops take a look at the architecture, uses, and what these capabilities will provide your systems.
  • You’ll also have one on one time with DataStax engineers, and other technical members of the team to ask questions, talk about architecture and solutions that you may be working on, or generally discuss any number of graph, analytics, search, or distributed systems related questions.

Where are the current DataStax Developer Day events that have been held, and were future events are going to be held? So far we’ve held events in New York City, Washington DC, Chicago, and Dallas. We’ve got two more events scheduled with one in London, England and one in Paris, France.

Future events? With a number of events completed and a few on the calendar, we’re interested in hearing about future possible locations for events. Where are you located and where might an event of this sort be useful for the community? I can think of a number of cities, but organizing them into order to know where to get something scheduled next is difficult, which is why the team is looking for input. So ping me via @Adron, email, or just send me a quick message from here.

Riak is… A Whole Big List of Things

What is Riak? Who builds it? Who maintains it? Can I download it? How does it work? What are the features?

Here’s the start of answers to these questions and more.

First, the basic high level description:

Riak is an open source, highly scalable, fault-tolerant distributed database.”

That’s the first line you’ll read when checking out the product via the Basho product link. It provides good information, but here I’m going to add more to the definition without the need to dig around yourself. Maybe I can save you some time & provide some links directly to solid information in the docs. Kind of a “Cliff Notes” of Riak. Let’s take this feature by feature which will in turn get us to a definitive definition of what exactly Riak is.

Riak is Open Source.

Riak is built and contributed to by the community, with Basho being the steward and an active member that extends, builds and provides support for additional products. The avenues to reach the Riak Open Source Community members is pretty straight forward, following known avenues of communication. Hit us up on the email list, especially feel free to contribute & ask questions via the Github Basho organization, there is the Basho Riak Blog, the weekly recap and jump into the IRC chat room #riak on freenode. Oh, and there’s a twitter feed @basho.

So what exactly does this get you, when you become a user or contributor of Riak? The entire community is behind you, will help you get started using Riak and provide help whenever you run into problems. If you want SLAs or 24 hour support Basho can provide this for you. But for bugs, issues, queries, searching and all sorts of other related development questions there is the community. An open source community like this is passionate, which means you’ll have support like no closed source company will ever provide you, and absolutely no closed source product’s community will provide you. We’re talking about a different level of interest, passion and levels of personal involvement.

Riak is a key value based database store.

Riak is a key value store. What exactly is a key value store? It’s pretty simple and you’re probably already familiar with what a key value store is. A key value is made up of two pieces of data, the first is the identifier for the second element within the data structure. This gives a system or developer using key value storage a schema-less way of working with data.

Riak is designed for highly distributed environments.

This type of distributed isn’t the “we put one database over here and one database over here and you gotta figure out how they work together” type of distribution. So this isn’t some of that oddball pretend stuff Oracle keeps hoisting on people. This is the honest to goodness distribution of the sort, when one node goes down you don’t blink, you don’t stop eating dinner, you don’t sweat it. You just continue onward with life knowing full well that you’ll just spool up another node when you need to.

Riak is master-less, with no single point of failure.

This is one of self explanatory features. But what does a master-less system provide us? One thing is no single point of failure. Being that all nodes can act autonomously to work around the loss of one or more nodes it also helps add to the high availability of the system.

Riak is fault tolerant, like a disk drive you wish was real.

Ever have a backup disk drive? What? You don’t have one of those? Ugh. Ok, so imagine you had a backup disk drive that had an unfortunately high failure rate. Well, why, because you know, they have an oddly high failure rate. If you do backups like good practices dictate, eventually you’ll end up with some dead drives.

RAID, both software and hardware, are built specifically to deal with this type of failure. With a distributed system like Riak, it bumps the level of abstraction above software or hardware RAID, enabling another level of even greater fault tolerance. Not to remove the relevance of RAID capabilities, but with a multi-node system like Riak, you can easily remove nodes and swap them out as needed, keeping costs down by using simple drives in simple machines. If you want to, you could indeed get higher I/O machines and faster drives, but it isn’t necessary to insure fault tolerance in a Riak Database System.

Riak scales, with hot swappable nodes enabling zero downtime.

The ability to commit hot swappable changes while in the midst of operating starts at a very low level for Riak. The language used to build Riak, Erlang has the ability to change pieces of an application system in realtime built into the precepts of the language. This provides, at the core, the inherent capability to change out systems, and by proxy of architectural design, the ability for nodes in Riak to be changed out simply by removing them from a cluster ring. Once that is done it is just as simple to add another node or nodes back into the cluster ring, enabling a number of additional practices around upgrades, hot swaps for failures, or even version changes.

Riak can be used as a building block for distributed (aka cloud) infrastructure.

The concepts and contractual components that Riak Database is built on are available for use via the Riak Core Project. If you’re looking into starting a project around distributed systems this is a great place to get start. Also be sure to do a general web engine (re: google) search for “riak core” and you’ll find lots of material around the project, and projects people have started with the project as a base. I’m currently in the process of putting together one of these projects myself.

Riak is eventually consistent.

The term eventually consistent is becoming more and more common place. Riak is one of the many systems, that inherently often apply to distributed systems, that use the concepts of eventual consistency. The idea, is that even though all nodes may not immediately receive a new piece of data, or updated piece of data, they eventually will receive that update and by synchronized with the cluster ring of nodes. This goes back to the equality of nodes and removal of the master-less concepts, providing the availability and other capabilities, with some trade off in the synchronization of data through eventual consistency.

In Summary

That’s round one for the many features of Riak. I’ll be adding more in the future, but for now this is a good starting point in knowing about and knowing what Riak is, what it can be used for, and how it might help you extend, maintain or invent the next great piece of technology.

Distributed Coding Prefunc: Up and Running With Erlang

Erlang LogoBefore diving into architecture, coding, descriptions and other things related to distributed computing over the coming months. It helps to become familiar with a language like Erlang. I’m going to dive immediately into getting Erlang up and running before any theory, description or otherwise, so following the most direct installation…

Installing Erlang

This is easy on OS-X. Pending of course you have the XCode and Developer Tools installed.

[sourcecode language=”bash”]
curl -O http://erlang.org/download/otp_src_R15B01.tar.gz
tar zxvf otp_src_R15B01.tar.gz
cd otp_src_R15B01
[/sourcecode]

Then compile the latest XCode and tools you can use the LLVM Compiler.

[sourcecode language=”bash”]
CFLAGS=-O0 ./configure –disable-hipe –enable-smp-support –enable-threads \
–enable-kernel-poll –enable-darwin-64bit
[/sourcecode]
[sourcecode language=”bash”]
make
sudo make install
[/sourcecode]

For more information on building Erlang you can also check out the Erlang Organization Site. It’s that simple, so now that it is up and running you should be able to check that all is right with the install by pulling a version.

[sourcecode language=”bash”]
erl -version
[/sourcecode]

In my follow up blog entry I’m going to take you through the Rebar Riak Core Templates. This will get you up and running with an Erlang Application. This application can then be used either as a stand alone Erlang App for whatever you want to build with it or as a great starting point to build against Riak.