Got some excellent coding and systems setup coming up in the next few days. Also a meetup on the 28th with Tim Kellogg and Alena Hall presenting on some interesting topics around distributed database data working on Kubernetes and WebAssembly of the hot temperament type. A new surprise guest addition on my Twitch channel that is scheduled to swing into Valhalla and help build out a cluster and respective needed DHCP, DNS, and related configuration for a setup on the metal!
August 23rd 10:00am DataStax Academy Twitch Stream – This session will include David and I as we dig into the Killrvideo Reference Application and discuss the pull request with the SSL additions to the code base. We’ll talk through the changes, why they’ve been made, and what advantage they provide for deployment into a multi-cloud environment.
August 24th, 3:33pm – Go Coding on Colligere – In this session I’ve finally gotten to the point where I can start filling out some of the CLI functionality. I’ll be adding over the next few days items to the issues list on Github too. Feel free to add content, or related items to the issue log too!
One of the features that is often available in Riak Client software (including the CorrguatedIron .NET Client, the riak-js client and others) is the ability to send requests to the Riak Cluster through a round robin style approach. What this means is each IP, of each node within the Riak Cluster is entered into a config file for the client. The client then goes through that list to send off requests to read, write or delete data in the database.
The client being responsible and knowledgeable about the data tier of the application in an architecture is an immediate red flag! The concept around SoC (Separation of Concerns) dictates that
“SoC is a principle for separating a computer program into distinct sections, such that each section addresses a separate concern.“
Having the client provide a network tier layer to round robin communication with the database leaves us in a scenario that should be separated into individual concerns. Below is some basic guidance on eliminating this SoC issue.
Client ONLY sends and receives communication: The client, especially in the situation with a distributed system like Riak should only be dealing with sending and receiving information from the cluster or a facade that provides an interface for that cluster.
Another layer should deal with the network communication and division of nodes and node communication. Ideally, in the case or Riak, and most distributed systems this should be dealt with at the network device layer (router).
The network device (router) layer would ideally be able to have (through software likely) a way to automate the failure, inclusion or exclusion of nodes with the cluster system. If a node goes down, the network device should handle the immediate cessation of communication with that node from all clients, routing the communication accordingly to an active node.
The node itself needs to maintain a continual information state available to the network. Ideally the network state would identify any addition or removal of a node and if possible the immediate failure of a node. Of course it isn’t always possible to be informed of a failure, but the first line of defense should start within the cluster itself among the nodes.
Having the client handle all of these parts of the functional architecture leads to a number of problems, not merely that the guidance of the SoC concept is broken. With the client attempting to track and be aware of the individual nodes in the cluster, it sets the client with a huge responsibility.
Take for instance the riak-js client. If a node goes down the client will need to be aware of which node has gone down. For a few seconds (yes, you have to wait entire seconds at this level) the node will be gone and the client won’t know it is down. The client would just have to reasonably wait. When the communication times out, the client would then have to have the responsibility of marking that particular node as down. At this point the client must track which node it is in some type of data repository local to the client. The client must also set a time or some way to identify when the node comes back up. Several questions start to come up such as;
Does the client do an arbitrary test to determine when the node comes back up?
When the node comes back up is it considered alive or damaged?
How would the client manage the IP (or identifier) of the node that has gone down?
How long would the client store that the node is down?
The list of questions can get long pretty quick, thus the bad karma of not following a good practice around separating your concerns appropriately! One has to be careful, a god class might be right around the corner otherwise! That’s it for this quick journey into some distributed database usage guidelines. Until next, happy data sciencing. 😉
PaaS Systems aren’t always effectively distributed. Heroku has fallen over every time east-1 has gone down at AWS. Not that I’m saying they’ve done bad, just pointing that out. With Cloud Foundry, there’s several key SPOFs (Single Points of Failure), and with all PaaS Systems the data tier is often the neglected pairing of the system. I’ve been wanting to write about this for a few months now and Deploycon has lit a fire for me to do just that.
Deploycon – “Platform Services and Developer Expectations” **
I’m on a panel at Deploycon titled “Platform Services and Developer Expectations” and this leads right back around to that. This SPOF issue is concerning to me as PaaS Providers talk up the offerings more and more with little light actually shone on this issue. In some ways each is moving away form their respective SPOFs, but overall they’re all pretty prevalent throughout. For security, each has a non-distributed database, which technically needs backed up still – no clear replication or other mechanisms setup to ensure data integrity in a failure situation. Of course, the huge saving grace with a PaaS, is that if the overall system goes down or a SPOF blows up, all the existing deployed applications will generally continue to run. Unless of course the routing and networking are also SPOF. This is the largest glaring concern with PaaS Systems that I see today.
One of the other things about PaaS that has always led to a ton of questions is “what about my PostGresql/mysql/Riak/mongodb/database thing and how do I do X, Y, Z with it to ensure scalability in my PaaS.” In almost every case it ends with a simple and unfortunate answer, “…when it comes to data, a PaaS doesn’t really do a damn thing for ya…” This is obviously not very helpful. The entire reason to put a PaaS into place is to simplify life, the sad fact that it barely does a thing for the data tier isn’t very helpful.
Now, hold on a second before you start screaming at me about “but a PaaS does X, Y and Z and isn’t even supposed to touch that aspect of things…” let me elaborate a bit more. The panel at Deploycon states “…Developer Expectations” and when things are getting simplified in the way a PaaS does, developers assume that if it does all this fancy magic for an application it ought to simplify the data side of things too! Right? Well no, and it isn’t going to for the foreseeable future. But no matter what, it doesn’t change the fact that developers often have that expectation.
Now, I could write at length about all the reasons that PaaS doesn’t really do anything for the data tier. I could wax poetic about how a distributed database (re: Riak, Cassandra, etc) just doesn’t lend itself to a cookie cutter approach to deployment under a PaaS or an RDBMS has umpteen different configurations for stability, scaling, hot swappable services, and other such complexities around the data tier. But instead I’m going to skip all, maybe cover some of those things another day, and jump right into some of the things that are actually moving forward to fill this gap.
BOSH, Cloud Foundry, OpenShift & fixing the data tier…
The most obvious reason there isn’t a simple turn key solution to the data side of things with a PaaS ecosystem is that data is complex and extremely diverse. There’s distributed key/value stores (Riak, Cassandra), there’s sort of kind of distributed databases (Mongo), graph databases (Neo4j), the age old RDBMS (DB2, SQL Server, Oracle’s Stuff, etc) and the million solutions around that, there’s key/value in memory styled databases that are insanely fast, like Redis. Expanding just slightly you have software that works around these systems such as Hadoop & Riak CS & the list goes on. All of it focused on the data tier and maintaining one, two or some form of the three points around CAP Theorem (http://en.wikipedia.org/wiki/CAP_theorem), atomicity and other key capbilities.
All of the PaaS Systems, including public and private often have some sort of plug-in style architectures for data. Whether it is Apprenda which is closed to community and closed source or an ongoing open to community PaaS like OpenShift or Cloud Foundry, things still fall almost entirely to the developers or database team to build an architecture around the data. When looking at solutions to simplify data in PaaS Systems the closed source solutions we have no idea what they’re up to in this regard. The one’s that are open source or in large part public and involved in the community PaaSes, like EngineYard, Heroku, Cloudbees and others we can really see the directions and efforts around creating real PaaS style solutions to the data tier problem.
BOSH, Vagrant, etc… One of the best solutions I’ve seen so far is the ability of Bosh, which was created by the Cloud Foundry team while at VMware, to spool up an environment that includes such things as a Riak Cluster (or other cluster). Currently Brian McClain & Dr Nic have worked to put together such Bosh + Vagrant scripts & get things rolling. I myself will be spending some considerable time on just that. But beyond that this is a good start in enabling data tier back ends.
How to close the gap, between absurdly simple application deployment and still arduous and difficult data tier deployment? For the next several years I think we’ll have cumbersome deployment practices around the data tier. There won’t be anything as elegantly simple as Cloud Foundry’s single line deployment or AppFog’s one click deployment of a web application. The best we can do at this time, is to streamline around pieces and architectures, and at least get them into a kind of simple 3 step deployment.
Please drop a comment or two on how you think we might simplify the data side of the PaaS toolchain. Also drop a few tweets in the twitterverse too, I’m sure that’ll be exploding as usual. I’m @adron, ping me.