Light up a Riak Cluster with AWS, A Few Notes…

I wanted to write up an intro to getting Riak installed on AWS, even though the steps are absurdly simple and already available on the Basho Docs site, there’s a few extra notes that can be very helpful for a few specific points during the process.

Start off by logging into AWS. At this point you can take two different paths that are almost identical. You can follow the path of using the pre-built AWS Marketplace image of Riak, or just start form scratch. The difference is a total of about 2 steps; installing & setting some security port connections. I’m going to step through without using the prebuilt image in these instructions.

Security Group

First thing you’ll need to get a security group with the correct permissions setup. For that, you’ll need to make a security group.

NOTE: No, I didn’t mean to misspell Riak, but it’s in there now.  😉

Before adding the ports, go to the security group details tab and copy the security group id. I’ve pointed it out in the image above.

Now add the following three and assign the security group to the ports; 4369, 8099 & 6000-7999. For the source set it to the security group id. Once you get all three added the list should look like this (below). For each rule click the Add Rule button and remember to click the Apply Rule Changes. I often forget this because the screen on some of the machines I use only shows to the bottom of the Add Rule button, so you’ll have to scroll down to find the Apple Rule Changes button.

Now add the standard port 22 for SSH. Next get the final two of 8087 and 8098 setup and we’re ready for moving on to creating the virtual machines.

Server Virtual Machines

For creating virtual machines I just clicked on Launch Instance and used the classic wizard. From there you get a selection of items. I’ve used the AWS image to do this, but would actually suggest using a CentOS image of your choice or Red Hat Enterprise Linux (RHEL). Another great option is to use the Ubuntu 12.04 LTS. Really though, use whatever Linux version or distro you like, there are 1-2 step instructions for installing Riak on almost every distro out.

Next just launch a single instance. We’ll be able to launch duplicates of these further along in the process. I’ve selected a “Micro” here but I’m not intending to do anything with a remotely heavy load right now. At some point, I’ll upgrade this cluster to larger instances when I start putting it under a real load. I’ll have another blog entry to describe exactly how I do this too.

Keep hitting continue until you get to the key pair selection. Pick the key pair you want, either making a new one for this cluster or use one you already have. Either way works fine.

Continue again until you can select the security group that we created above.

Now keep hitting that continue button, until you get to launch, and launch this thing. Once the instance is launched launch your preferred SSH connection tooling. The easiest way I’ve found for getting the most current private IP to connect to with the appropriate command is to right click on the instance in the AWS Console and click on Connect. There you’ll find the command to connect via SSH.

Paste that in and hit enter in your SSH App, you’ll see something akin to this.

[sourcecode language=”bash”]
$ cd Codez/working-content/
$ ssh -i riaktionz.pem root@ec2-54-245-201-97.us-west-2.compute.amazonaws.com
The authenticity of host ‘ec2-54-245-201-97.us-west-2.compute.amazonaws.com (54.245.201.97)’ can’t be established.
RSA key fingerprint is 31:18:ac:1a:ac:fc:6e:6d:55:e8:8a:83:9a:8f:c7:5f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘ec2-54-245-201-97.us-west-2.compute.amazonaws.com,54.245.201.97′ (RSA) to the list of known hosts.
Please login as the user "ubuntu" rather than the user "root".
[/sourcecode]

Enter yes to continue connecting. For some instance types, like Ubuntu you’ll have to do some teaks to log into as “ubuntu” vs. “root” and the same goes for the AWS image or others. I’ll leave that to you, dear reader to get connected via ole’ SSH.

One of the other things, that you may have to do some tweaking about and googling, is figuring out the firewall setups on the various virtual machine images. For the RHEL you’ll want to turn off the firewall or open up the specific connection ports and such. Since the AWS firewall does this, it isn’t particularly important for the OS to continue running its firewall service. In this case, I’ve turned off the OS firewall and just rely on the AWS firewall. To turn off the RHEL firewall, execute the following commands.

[sourcecode language=”bash”]
[root@ip-x-x-x-x]# service iptables save
iptables: Saving firewall rules to /etc/sysconfig/iptables:[ OK ]
[root@ip-x-x-x-x]# service iptables stop
iptables: Flushing firewall rules: [ OK ]
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Unloading modules: [ OK ]
[root@ip-x-x-x-x]# chkconfig iptables off
[root@ip-x-x-x-x]#
[/sourcecode]

Now is a perfect time to start those other instances. Navigate into the AWS Console again and right click on the virtual machine instance you’ve created. On that menu select Launch More Like This.

Go through and check the configuration on each of these, make sure the firewall is turned off, etc. Then move on to the next step and install Riak and cluster them. So it’s time to get to the distributed, massively complex, extensive list of steps to install & cluster Riak. Ok, so that’s sarcasm.  😉

Step 1: Install Riak

Install Riak on each of the instances.

[sourcecode language=”bash”]
package=basho-release-6-1.noarch.rpm && \
wget http://yum.basho.com/gpg/$package -O /tmp/$package && \
sudo rpm -ivh /tmp/$package
sudo yum install riak
[/sourcecode]

NOTE: For other installation methods, such as directly downloading the RPM or other Linux OSes, check out the http://docs.basho.com/riak/latest/tutorials/installation/Installing-on-RHEL-and-CentOS/.

Step 2: Setup the Cluster

On the first instance, get the IP. You won’t need to do anything to this instance, just keep the IP handy. Then move on to the second instance and run the cluster command.

[sourcecode language=”bash”]
sudo riak-admin cluster join riak@<ip_of_the_first_node>
[/sourcecode]

Do this on each of the instances you’ve added, using that first node. When you’ve added them all, on that last instance (or really any of them) then run the plan. This will get you a display plan of what will take place when the cluster is committed.

[sourcecode language=”bash”]
sudo riak-admin cluster plan
[/sourcecode]

If that looks all cool. Commit the plan.

[sourcecode language=”bash”]
sudo riak-admin cluster commit
[/sourcecode]

Get a check of the cluster.

[sourcecode language=”bash”]
sudo riak-admin member_status
[/sourcecode]

That’s it all done. You know have a Riak Cluster. For more operations to try out your cluster, check out this list of base API Operations.

Not So Versus, Riak Versus Redis

Recently a friend of mine and fellow coder, harkening back to my Russell Investments Enterprise Developer days posed the discussion of Redis versus Riak. Well, first off, I thought I myself needed to write down a line by line comparison. I’ve worked with both in various ways but I’d never really thought about them lined up side by side. Often I think of the two pieces of technologies as complementary in various ways. But before I dive into that, let’s take a look at the stats side by side of each.

Company/Maintainer/Builder Basho Technologies @basho Salvatore Sanfilippo @antirez w/ VMware
Official Product Name Riak Redis
License Apache (link) BSD (link)
Storage Type Key Value Key Value
Protocols HTTP/RESTful & Custom Binaries  Telnet like / Proprietary
Replication/Clustering Masterless Master / Slave Replication
Language/Framework Erlang / C C / C++
Best Use Dynamo style architecture & concepts. Primarily used for extreme high availability. Best known and used for extremely fast access to quickly changing data and known size.
Key Feature Fault Tolerant Crazy Fast

Redis is primarily something you’re going to use to move data in and out at crazy fast speeds. However, when you need to store data, it isn’t ideal. There are ways, but it tends to work better handing off to something else to store the data. However Riak on the other hand isn’t always the fastest database, but it’ll withstand serious hits and still maintain integrity of data. It is also tunable for writes, reads and other characteristics that enable tuning and also integrity of the data among nodes. The more nodes in Riak you have the higher available iOPs and fault tolerance. The other thing that Riak can do over time, is truly scale from a horizontal and vertical perspective. Grow Riak tall and wide, it’ll give you linear performance and integrity improvements.

The thing I’ve seen over and over, is Riak as a store and Redis as a cache or other temporal specific data store for websites or other high transaction systems. Overall each serves a very specific purpose but work well in conjunction with each other when you’re rolling together an extremely high performance architecture with an extremely highly available back end data store.

If you’re in Seattle and up for lunch this Wednesday, join me for the Riak Nerd Lunch.

RICON Hits the Airwaves

This last week has been a bit more exciting for a number of reasons than I expected and dragged out for a few other reasons. (shakes fist in air with frustration!)

RICON|East lands in New York City!

Last year, well before I was working for Basho, I signed up for RICON in San Francisco. RICON is a conference with the key intellects behind distributed computing, such as Eric Brewer (he’s slightly involved with that CAP Theorem) and many others. You can even get a taste of the conference since the presentations are online and available to watch anytime you’d like to. For this year in New York City we already have some great speakers lined up and more to come!

2013 Speakers as of now…  (stay tuned for more)

Camille Fournier @skamille – VP of Technical Architecture at Rent The Runway blogs at Elided Branches

Camille is going to provide attendees with some serious Apache ZooKeeper knowledge. She’s, as stated, VP of Technical Architecture at Rent the Runway. Rent the Runway is a NYC-based eCommerce startup that rents designer dresses and accessories. They operate systems that support a business with a unique combination of challenges, from those related to product discovery, to those related to reservation booking and pricing, to a warehouse fulfillment system that handles 100% returns.

The ZooKeeper topic I’ll leave entirely for her to fill you in on! So, among all the others, another talk you’ll want to come to RICON to hear!

Sean Cribbs @seancribbs – Software Engineer at Basho Technologies

Sean blogs at seancribbs, which sounds a bit redundant, but hey… he gets down to the business of writing about some really insightful things. We’re fortunate at Basho to have him hacking away on projects, and super stoked to be working along side him! One of his latest posts “Property Driven Grammer Development” is seriously worth a read. Be sure to check out Sean’s Github and come gain gray matter activity while Sean lays down some brain power.

Kyle Kingsbury @aphyr – Member of Technical Set => {A | A ∉ A}, blogs at Aphyr

Riot starter Kyle is coming back to RICON East. Last year Kyle brought some fire to the RICON Conference and this time he’ll bring some fire in full presentation format this time. I’m looking forward to hearing about whatever he’s going to enlighten us about. Cheers Kyle, welcome to the speaker line up!

Neha Nerula @neha – blog at Transient Neha

Neha is currently a PhD Student at MIT, but has serious cred from working at Digg & Google. She’ll have some content you’ll want to give an attentive listen to! Even though she’s not so transient, she’s been to more than a few cities to work in. Currently she’s at MIT and has worked on the Intelligence Initiative, BFlow, Intrusion Recovery for Database-backed Web Application, and a host of other papers and other projects.

The team is moving so fast, there’s more speakers announced already, and I’ll be giving a portfolio to each of them as they’re announced and as fast as I can type!

As for an example of what RICON is like, here’s a few of the videos from 2012 that I really enjoyed…

2012 Talks

Pattern of Innovation: Riak Usage at BestBuy.com – Joel Crabb, RICON2012 from Basho Technologies on Vimeo.

Keep CALM and Query On – Joe Hellerstein, RICON2012 from Basho Technologies on Vimeo.

Advancing Distributed Systems – Eric Brewer, RICON2012 from Basho Technologies on Vimeo.

Riak in the Cloud – Ines Sombra and Michael Brodhead, RICON2012 from Basho Technologies on Vimeo.

A Few Notes on Riak 1.3 RC

Full context – Riak 1.3 RC came out just a couple dozen hours ago. RC stands for release candidate, which in turn basically means that version 1.3 is complete and any other additions will be for quick fixes or any issues that crop up. I’ve just started rolling a few new systems myself with this new version and hope you’ll join me in taking a hack at it. Let’s jump into a few reason why you’d want to leap into 1.3. You can read about the features below via the release notes also, but I’ve turned them into smaller bit size chunks below.

  • Giddyup in action!
    Giddyup in action!

    The first thing with the latest v1.3 has been the massive effort put into testing via the riak_test and the giddyup repos. Ongoing there will be a much easier way to move forward in features & quality. This is one of the reasons I love working for Basho, the whole team isn’t about smoke and mirrors with testing, they readily and diligently work on testing. Which to add context, remember we’re talking about distributed systems here, which aren’t exactly the easiest thing to test. One doesn’t just merely walk in and write unit tests and assume a distributed systems is tested. This moves us forward, and those that want to contribute and get involved more heavily in Riak now have a platform to dive in confidently when using these testing repositories.

  • Active Anti-Entropy – Alright, now we’re getting to the features with bad ass sounding names. Also referred to as AAE, this feature grabs bad replica data and begins a correction through read repair to protect data. It’s one more layer of protection against any type of data loss, disaster, bit rot, etc).
  • MapReduce Sink Backpressure – This one reminds me of tuning when setting up forced induction, AKA a turbo on a car. But I digress, I’ve snagged a description from the release notes for this feature, “Riak Pipe brought inter-stage backpressure to Riak KV’s MapReduce system. However, prior to Riak 1.3, that backpressure did not extend to the sink. It was assumed that the Protocol Buffers or HTTP endpoint could handle the full output rate of the pipe. With Riak 1.3, backpressure has been extended to the sink so that those endpoint processes no longer become overwhelmed. This backpressure is tunable via a soft cap on the size of the sink’s buffer, and a period at which a worker should check that cap. These can be configured at the Riak console by setting application environment variables” ….suffice it to say this helps out with map reduce in certain situations.
  • Additional IPv6 Support – Riak Handoff and Protocol Buggers listen ala IPv6 now. Nuff’ said.
  • Luke removal – Luke is completely and utterly gone now. Dead. Don’t look for Luke here.
  • Riaknostic – This is now part of the default featureset instead of separate tooling.
  • SmartOS 1.8 Packages – They’re available.
  • Health Check – This is a pretty awesome system that’s been added. Basically it watches the system and enables and disables services based on conditions. It’s super easy, just flick the switch in the app.config.
    [sourcecode language=”yml”]
    {enable_health_checks, true}
    [/sourcecode]
  • Reset Bucket Properties – A quickie definition from the release notes “The HTTP interface now supports resetting bucket properties to their default values. Bucket properties are stored in Riak’s ring structure that is gossiped around the cluster. Resetting bucket properties for buckets that are no longer used or that are using the default properties can reduce the amount of gossiped data.”

There were also a lot of PRs and more that you can check out on Github. These are the main key features that are now available and ready for use in 1.3. Check em’ out, feel free to contact me or any of the team to ask questions, let us know your 2 cents or otherwise banter about. Cheers! Sometime in the coming days I’ll have a quick start, akin to what’s in the docs, but with some specific ops on some IaaS Providers. So keep reading, coming up soon.

Happy hacking!  \m/   \m/

The Database Deluge… Who’s Who

These are the top NoSQL Solutions in the market today that are open source, readily available, with a strong and active community, and actively making forward progress in development and innovations in the technology. I’ve provided them here, in no order, with basic descriptions, links to their main website presence, and with short lists of some of their top users of each database. Toward the end I’ve provided a short summary of the database and the respective history of the movement around No SQL and the direction it’s heading today.

Cassandra

http://cassandra.apache.org/

Cassandra is a distributed databases that offers high availability and scalability. Cassandra supports a host of features around replicating data across multiple datacenters, high availability, horizontal scaling for massive linear scaling, fault tolerance and a focus, like many NoSQL solutions around commodity hardware.

Cassandra is a hybrid key-value & row based database, setup on top of a configuration focused architecture. Cassandra is fairly easy to setup on a single machine or a cluster, but is intended for use on a cluster of machines. To insure the availability of features around fault tolerance, scaling, et al you will need to setup a minimal cluster, I’d suggest at least 5 nodes (5 nodes being my personal minimum clustered database setup, this always seems to be a solid and safe minimum).

Cassandra also has a query language called CQL or Cassandra Query Langauge. Cassandra also support Apache Projects Hive, Pig with Hadoop integration for map reduce.

Who uses Cassandra?

  • IBM
  • HP
  • Netflix
  • …many others…

HBase

http://hbase.apache.org/

In the book, Seven Databases in Seven Weeks, the Apache HBase Project is described as a nail gun. You would not use HBase to catalog your sales list just like you wouldn’t use a nail gun to build a dollhouse. This is an apt description of HBase.

HBase is a column-oriented database. It’s very good at scaling out. The origins of HBase are rooted in BigTable by Google. The proprietary database is described in in the 2006 white paper, “Bigtable: A Distributed Storage System for Structured Data.”

HBase stores data in buckets called tables, the tables contain cells that are at the intersection of rows and columns. Because of this HBase has a lot of similar characteristics to a relational database. However the similarities are only in name.

HBase also has several features that aren’t available in other databases, such as; versioning, compression, garbage collection and in memory tables. One other feature that is usually only available in relational databases is strong consistency guarantees.

The place where HBase really shines however is in queries against enormous datasets.

HBase is designed architecturally to be fault tolerate. It does this through write-ahead logging and distributed configuration. At the core of the architecture HBase is built on Hadoop. Hadoop is a sturdy, scalable computing platform that provides a distribute file system and mapreduce capabilities.

Who is using it?

  • Facebook uses HBase for its messaging infrastructure.
  • Stumpleupon uses it for real-time data storage and analytics.
  • Twitter uses HBase for data generation around people search & storing logging & monitoring data.
  • Meetup uses it for site data.
  • There are many others including Yahoo!, eBay, etc.

Mongo

http://www.mongodb.org/

MongoDB is built and maintained by a company called 10gen. MongoDB was released in 2009 and has been rising in popularity quickly and steadily since then. The name, contrary to the word mongo, comes from the word humongous. The key goals behind MongoDB are performance and easy data access.

The architecture of MongoDB is around document database principles. The data can be queried in an ad-hoc way, with the data persisted in a nested way. This database also, like most NoSQL databases enforces no schema, however can have specific document fields that can be queried off of.

Who is using it?

  • Foursquare
  • bit.ly
  • CERN for collecting data from the large Hadron Collider
  • …others…

Redis

http://redis.io/

Redis stands for Remote Dictionary Service. The most common capability Redis is known for, is blindingly fast speed. This speed comes from trading durability. At a base level Redis is a key-value store, however sometimes classifying it isn’t straight forward.

Redis is a key-value store, and often referred to as a data structure server with keys that can be string, hashes, lists, sets and sorted sets. Redis is also, stepping away from only being a key-value store, into the realm of being a publish-subscribe and queue stack. This makes Redis one very flexible tool in the tool chest.

Who is using it?

  • Blizzard (You know, that World of Warcraft game maker)  😉
  • Craigslist
  • flickr
  • …others…

Couch

http://couchdb.apache.org/

Another Apache Project, CouchDB is the idealized JSON and REST document database. It works as a document database full of key-value pairs with the values a set number of types including nested with other key-value objects.

The primary mode of querying CouchDB is to use incremental mapreduce to produce indexed views.

One other interesting characteristic about CouchDB is that it’s built with the idea of a multitude of deployment scenarios. CouchDB might be deployed to some big servers or may be a mere service running on your Android Phone or Mac OS-X Desktop.

Like many NoSQL options CouchDB is RESTful in operation and uses JSON to send data to and from clients.

The Node.js Community also has an affinity for Couch since NPM and a lot of the capabilities of Couch seem like they’re just native to JavaScript. From the server aspect of the database to the JSON format usage to other capabilities.

Who uses it?

  • NPM – Node Package Manager site and NPM uses CouchDB for storing and providing the packages for Node.js.

Couchbase (UPDATED January 18th)

Ok, I realized I’d neglected to add Couchbase (thus the Jan 18th update), which is an open source and interesting solution built off of Membase and Couch. Membase isn’t particularly a distributed database, or database, but between it and couch joining to form Couchbase they’ve turned it into a distributed database like couch except with some specific feature set differences.

A lot of the core architecture features of Couch are available, but the combination now adds auto-sharding clusters, live/hot swappable upgrades and changes, memchaced APIs, and built in data caching.

Who uses it?

  • Linkedin
  • Orbitz
  • Concur
  • …and others…

Neo4j

http://www.neo4j.org/

Neo4j steps away from many of the existing NoSQL databases with its use of a graph database model. It stored data as a graph, mathematically speaking, that relates to the other data in the database. This database, of all the databases among the NoSQL and SQL world, is very whiteboard friendly.

Neo4j also has a varied deployment model, being able to deploy to a small or large device or system. It has the ability to store dozens of billions of edges and nodes.

Who is using it?

  • Accenture
  • Adobe
  • Lufthansa
  • Mozilla
  • …others…

Riak

Riak is a key-value, distributed, fault tolerant, resilient database written in Erlang.  It uses the Riak Core project as a codebase for the distributed core of the system. I further explained Riak, since yes, I work for Basho who are the makers of Riak, in a separate blog entry “Riak is… A Big List of Things“. So for a description of the features around Riak check that out.

Who is using Riak?

In Summary

One of the things you’ll notice with a lot of these databases and the NoSQL movement in general is that it originated from companies needing to go “web scale” and RDBMSs just couldn’t handle or didn’t meet the specific requirements these companies had for the data. NoSQL is in no way a replacement to relational or SQL databases except in these specific cases where need is outside of the capability or scope of SQL & Relational Databases and RDBMSs.

Almost every NoSQL database has origins that go pretty far back, but the real impetus and push forward with the technology came about with key efforts at Google and Amazon Web Services. At Google it was with BigTable Paper and at Amazon Web Services it was with the Dynamo Paper. As time moved forward with the open source community taking over as the main innovator and development model around big data and the NoSQL database movement. Today the Apache Project has many of the projects under its guidance along with other companies like Basho and 10gen.

In the last few years, many of the larger mainstays of the existing database industry have leapt onto the bandwagon. Companies like Microsoft, Dell, HP and Oracle have made many strategic and tactical moves to stay relevant with this move toward big data and nosql databases solutions. However, the leadership is still outside of these stalwarts and in the hands of the open source community. The related companies and organizations that are focused on that community such as 10gen, Basho and the Apache Organization still hold much of the future of this technology in the strategic and tactical actions that they take since they’re born from and significant parts of the community itself.

For an even larger list of almost every known NoSQL Database in existence check out NoSQL Database .org.