Riak in a .NET World

Jeremiah's Demo Works, IT WORKS IT WORKS!
Jeremiah’s Demo Works, IT WORKS IT WORKS!

A few days ago Troy Howard, Jeremiah Peschka and I all traveled via Amtrak Cascades up to Seattle. The mission was simple, Jeremiah was presenting “Riak in a .NET World”, I was handling logistics and Troy was handling video.

So I took the video that Troy shot, I edited it, put together some soundtrack to it and let Jeremiah’s big data magic shine. He covers the basics around RDBMSes, SQL Server in this case but easily it applies to any RDBMS in large part. These basics bring us up to where and why an architecture needs to shift from an RDBMS solution to a distributed solution like Riak. After stepping through some of the key reasons to move to Riak, Jeremiah walks through a live demo of using CorrugatedIron, the .NET Client for Riak (Github repo). During the walk through he covers the specific characteristics of how CorrugatedIron interacts with Riak through indexs, buckets and during puts and pulls of data.

Toward the end of the video Joseph Blomstedt @jtuple, Troy Howard @thoward37, Jeremiah Peschka @peschkaj, Clive Boulton @iC and Richard Turner @bitcrazed. Also note, I’ve enabled download for this specific video since it is actually a large video (1.08GB total). So you may want to download and watch it if you don’t have a super reliable high speed internet connection.

Also for more on Jeremiah’s work check out http://www.brentozar.com/articles/riak/  and contact him at http://www.brentozar.com/contact/

Farewell Basho, It’s Been Swell Yo!

Whew, it’s been a total blast working at Basho. I’ve accomplished a ton of things. Riak is a solid distributed database system and I’m glad to have worked with the team on advocating its use, teaching distributed systems ideas and concepts and generally spreading the knowledge. I’ve seen some truly great things that people are hacking together, setting up for projects and redesigning old systems to utilize newer, better, faster and more capable distributed systems concepts and ideas. Some of the things I’m happy to have contributed to in my time at Basho.

…and there has been a whole lot more. Suffice it to say, Basho has provided me with some sweet opportunities to work on some extremely interesting data projects from a very data sciency point of view (yeah I know sciency aint a word). There may be more Riak work and Riak meetups and Riak hacks and Riak who knows what coming from me, but the meetups & such are now at the hands of the core Riak crew and…

Where Am I Headed?

Right now, I’m moving 20 blocks away from where I currently live, setting up a couch to hack on and grabbing a beer. I’ve got a few personal projects I’ve been wanting to work on. Then I’m taking a few weeks to do some side projects that have been on the burner. Keep an eye out, I’ll be kicking off one, maybe two of these open source projects in the next few days. As @tsantero twitted…

…I’m going to attack my own notebook of ideas. Maybe I’ll even work on that Riak CS Video object store that Tom and I spoke about 10 months ago? Either way, whatever the projects are, I’ll have them posted right here. Until then…

Cheers & Happy Hacking!

Philadelphia Riak Training & Presentations

R Graphs
R Graphs

This week I’ve traveled to Philadelphia to meet with a number of the Basho team to work together and receive training with the trainers or the best ways to approach content on Riak and more generally the best ways we can all brainstorm up to approach specific topics. Some of those topics include things like:

  • Access Patterns around Log Storage & Analysis
  • Bloom Filters
  • CRDTs
  • Consensus Protocols
  • Erasure Coding
  • LevelDB and Bitcask Backends
  • MDC Repl

Out of the options we discussed in training today I ran with benchmarking. It is always near and dear to many of the customers’, clients’ and curious’ that I talk to. I dove in to see what exactly we offer with basho_bench (docs info, github repo) in detail and functionality, but also dove into other benchmarks are out there that others may have run in the past.

basho_bench

What exactly is basho_bench? The basho_bench project is a code repo on Github that offers a set of benchmarking tests that are run against a Riak cluster. There are a few prerequisites to the quick steps below, the prerequisites are:

  1. Make sure you have a cluster or a devrel (Basho docs on devrel) setup that you can point basho_bench at.
  2. Erlang R15B03 should be installed (OS-X Compiler battles and fixes).
  3. Make sure R is installed.

Now to get basho_bench setup.

[sourcecode language=”bash”]
git clone git://github.com/basho/basho_bench.git
cd basho_bench
make all
[/sourcecode]

Once that is done building, review the directory structure that is in the basho_bench directory. The following should be available in the directory.

[sourcecode language=”bash”]
$ ls
FAQ deps rebar
LICENSE ebin rebar.config
Makefile examples src
README.org include tests
basho_bench priv
[/sourcecode]

The examples directory has several default config files available to run with basho_bench for testing. If there is a devrel setup with the default 127.0.0.1 IP usage, just run the following command to begin generating stats. If the cluster being tested is not a devrel with 127.0.0.1 then give the configuration section of the docs a read for information on how to point basho_bench at an alternative cluster.

[sourcecode language=”bash”]
./basho_bench examples/http.config
[/sourcecode]

You’ll see something akin to this spit out onto the screen.

[sourcecode language=”bash”]
$ ./basho_bench examples/http.config
10:34:41.736 [debug] Lager installed handler lager_console_backend into lager_event
10:34:41.752 [debug] Lager installed handler {lager_file_backend,"/Users/adronhall/Codez/basho_bench/tests/20130808_103441/error.log"} into lager_event
10:34:41.752 [debug] Lager installed handler {lager_file_backend,"/Users/adronhall/Codez/basho_bench/tests/20130808_103441/console.log"} into lager_event
10:34:41.757 [debug] Lager installed handler error_logger_lager_h into error_logger
10:34:41.758 [info] Application lager started on node nonode@nohost
10:34:41.767 [info] Est. data size: 488.28 KB
10:34:41.807 [debug] Supervisor sasl_safe_sup started alarm_handler:start_link() at pid
10:34:41.808 [debug] Supervisor sasl_safe_sup started overload:start_link() at pid
10:34:41.809 [debug] Supervisor sasl_sup started supervisor:start_link({local,sasl_safe_sup}, sasl, safe) at pid
10:34:41.810 [debug] Supervisor sasl_sup started release_handler:start_link() at pid
10:34:41.811 [info] Application sasl started on node nonode@nohost

….AND A WHOLE LOT MORE HERE….

10:35:42.954 [info] {{{put_re,{"localhost",4567,"/","{\"this\":\"is_json_%%V\"}"},[{‘Content-Type’,’application/json’}]},{put_re,{"localhost",4567,"/","{\"this\":\"is_json_%%V\"}"},[{‘Content-Type’,’application/json’}]}},{put,{conn_failed,{error,econnrefused}}}}: 1701
10:35:42.955 [info] Application basho_bench exited with reason: stopped
10:35:42.955 [info] Test completed after 1 mins.
$
[/sourcecode]

After letting the test run for it’s designated minute, run the following command to get some pretty graphs with R.

[sourcecode language=”bash”]
make results
[/sourcecode]

…or maybe run…

[sourcecode language=”bash”]
priv/summary.r -i tests/current
[/sourcecode]

The reason I post both is that ‘make results‘ doesn’t seem to always work to build the results and the manual execution will actually get the results built. With the results built, check the tests directory in the basho_bench directory for the summary.png file. If you open the file it should look something like this.

Default empty http.config results from basho_bench. (Click for full size image)
Default empty http.config results from basho_bench. (Click for full size image)

From here you can now run basho_bench and get the results that are specific to basho_bench. However, this now leads me to a higher abstract topic of why do benchmarking in the first place.

Why Benchmark? How to Benchmark!

The definition for benchmark,

bench·mark

[bench-mahrk] 
noun
1.a standard of excellence, achievement, etc., against which similar things must be measured or judged: The new hotel is a benchmark in opulence and comfort.
2.any standard or reference by which others can be measured or judged: The current price for crude oil may become the benchmark.
3.Computers. an established point of reference against which computers or programs can be measured in tests comparing their performance, reliability, etc.
4.Surveying. Usually, bench mark. a marked point of known or assumed elevation from which other elevations may be established. Abbreviation:  BM
adjective
5.of, pertaining to, or resulting in a benchmark: benchmark test,benchmark study.

While basho_bench provides an interesting baseline test that shows various pieces of data to work with, it shows nothing by default that is specific to YOUR use case. The basho_bench is not ideal for your production environment, it is not your dev or user acceptance testing or test criteria, it is an example. To truly get effective numbers that really encompass your needs for your project you will need to provide custom configuration for basho_bench or write your own specific benchmark.

The reason behind this is, with Riak as with other NoSQL solutions, is that you’re working toward a goal that is very data specific and unknown. It has specific domain logic and criteria that is specific to the use case, a custom benchmark can provide real data related to that domain logic and criteria.

In the end, even though basho_bench is a great tool to get started, do basic tests, and a great project to get ideas from it is not the panacea benchmark. You’ll need to create the specific benchmark for your use case yourself.

Happy hacking (and benchmarking).

Architectural PaaS Cracks or Crack PaaS

Over the last couple years there have been two prominent open source PaaS Solutions come onto the market. Cloud Foundry & OpenShift. There’s been a lot of talk about these plays and the talk has slowly but steadily turned into traction. Large enterprises are picking these up and giving their developers and operations staff a real chance to make changes. Sometimes disruptive in a very good way.

However, with all the grandeur I’m going to hit on the negatives. These are the missing parts, the serious pain points beyond just some little deployment nuisance. Then a last note on why, even amidst the pain points, you still need to make real movement with PaaS tooling and technologies.

Negative: The Data Story is Lacking

Both Cloud Foundry and OpenShift have a way to plug into databases easily.

Cloud Foundry provides ways to build a Cloud Foundry Service that becomes the bound and hooked in SQL Server, MySQL, Postgresql, Redis or whatever data storage service you need. For more details on building a service, check out the echo example on the vcap sample github project.

OpenShift has what are called Cartridges which provide the ability to add databases and other services into the system. For more information about the cartridges check out Red Hat’s OpenShift Documentation and also the forums.

Cloud Foundry and OpenShift however have distinctive weak spots when it comes to services that go beyond a mere single instance database. In the case of a true distributed database such as Cassandra, HBase or Riak, it is inordinately difficult to integrate a system that any PaaS inter-operates with well. In some cases it’s irrelevant to even try.

The key problem being that both of the PaaS systems assume the mantle of master while subjugating the distributed database a lower tier of coordination. The way to resolve this at the moment is to do an autonomous installation of Riak, Cassandra, Neo4j or other database that may be distributed, stored hot swappable, or otherwise spread across multiple machine or instance points. Then create a bound connection between it and the PaaS Application that is hosted. This is the big negative in PaaS systems and tooling right now, the data story just doesn’t expand well to the latest in data and database technologies. I’ll elaborate more about this below.

Negative: Deployment is Sometimes Easy, Maintenance is Sometimes Hard

Cloud Foundry is extremely rough to deploy, unless you use Bosh to deploy to either VMware Virtualized instances or AWS. Now, you could if resources were available get Bosh to deploy your Cloud Foundry environment anywhere you wanted. However, that’s not easy to do. Bosh is still a bit of a black box. I myself along with others in the community are working to document Bosh, but it is slow going.

OpenShift is dramatically easier to deploy, but is missing a few key pieces once deployed that draw some additional operational overhead. One of those is that OpenShift requires more networking management to handle routing between various parts of the PaaS Ecosystem.

Overall, this boils down to what you need between the two PaaS tool chains. If you want Cloud Foundry’s automatic routing and management between nodes. This is a viable route, but if your team wants to manage the networking tier more autonomous from the PaaS environment then maybe OpenShift is the way to go. In the end, it’s negative bumpy territory to determine which you may or may not want based on that.

Negative: Full Spectrum Polyglot, Missing Some

Cloud Foundry has a wider selection of languages and frameworks with community involvement around those with groups like Iron Foundry. OpenShift I’m sure will be getting to parity in the coming months. I have no doubt between both of these PaaS Ecosystems that they’ll expand to new languages and frameworks over time. Being polyglot after all is a no brainer these days!

Why PaaS Is, IMHO, Still Vitally Important

First toss out the idea that huge, web scale, Facebooks and Googles need to be built. Think about what the majority of developers out there in the world work on. Tons and tons and tons of legacy or greenfield enterprise applications. Sometimes the developer is lucky enough to work on a full vertical mix of things for a small business, but generally, the standard developer in the world is working on an enterprise app.

PaaS tooling takes the vast majority of that enterprise app maintenance from an operational side and tosses it out. Instead of managing a bunch of servers with a bunch of different apps the operations team manages an ecosystem that has a bunch of apps. This, for the enterprises that have enough foresight and have managed their IT assets well enough to be able to implement and use PaaS tooling, is HUGE!

For companies working to stay relevant in the enterprise, for companies looking to make inroads into the enterprise and especially for enterprises that are looking to maintain, grow or struggling to keep ahead of the curve – PaaS tooling is something that is a must have.

Just ask a dev, do they want to spend a few hours configuring and testing a server?  Do they want to deploy their application and focus on building more value into that application?

…being I’ve spent a few years being the developer, I’ll hedge on the side of adding value.

What’s Next?

So what’s next? Two major things in my opinion.

1. Fill the data gap. Most of the PaaS tooling needs to bridge the gap with the data story. I’m working my part with testing, development and efforts to get real options built into these environments, but this often leads back to the data story of PaaS being weak. What’s the solution here? I’m in talks, ongoing, planning sessions ongoing, and we’ll eventually get a solid solution around the data side.

2. Fix deployments & deployment management. Bosh isn’t straight forward or obvious in what it does, Cloud Foundry is easily the hardest thing to deploy with many dependencies. OpenShift is easier to deploy and neither of them actually have a solid management story over time. Bosh does some impressive updates of Cloud Foundry, and OpenShift has some upgrade methods, but still over time and during day to day operations there hasn’t been any clear cut wins with viewing, monitoring and managing nodes and data within these environments.