The Database Deluge… Who’s Who

These are the top NoSQL Solutions in the market today that are open source, readily available, with a strong and active community, and actively making forward progress in development and innovations in the technology. I’ve provided them here, in no order, with basic descriptions, links to their main website presence, and with short lists of some of their top users of each database. Toward the end I’ve provided a short summary of the database and the respective history of the movement around No SQL and the direction it’s heading today.

Cassandra

http://cassandra.apache.org/

Cassandra is a distributed databases that offers high availability and scalability. Cassandra supports a host of features around replicating data across multiple datacenters, high availability, horizontal scaling for massive linear scaling, fault tolerance and a focus, like many NoSQL solutions around commodity hardware.

Cassandra is a hybrid key-value & row based database, setup on top of a configuration focused architecture. Cassandra is fairly easy to setup on a single machine or a cluster, but is intended for use on a cluster of machines. To insure the availability of features around fault tolerance, scaling, et al you will need to setup a minimal cluster, I’d suggest at least 5 nodes (5 nodes being my personal minimum clustered database setup, this always seems to be a solid and safe minimum).

Cassandra also has a query language called CQL or Cassandra Query Langauge. Cassandra also support Apache Projects Hive, Pig with Hadoop integration for map reduce.

Who uses Cassandra?

  • IBM
  • HP
  • Netflix
  • …many others…

HBase

http://hbase.apache.org/

In the book, Seven Databases in Seven Weeks, the Apache HBase Project is described as a nail gun. You would not use HBase to catalog your sales list just like you wouldn’t use a nail gun to build a dollhouse. This is an apt description of HBase.

HBase is a column-oriented database. It’s very good at scaling out. The origins of HBase are rooted in BigTable by Google. The proprietary database is described in in the 2006 white paper, “Bigtable: A Distributed Storage System for Structured Data.”

HBase stores data in buckets called tables, the tables contain cells that are at the intersection of rows and columns. Because of this HBase has a lot of similar characteristics to a relational database. However the similarities are only in name.

HBase also has several features that aren’t available in other databases, such as; versioning, compression, garbage collection and in memory tables. One other feature that is usually only available in relational databases is strong consistency guarantees.

The place where HBase really shines however is in queries against enormous datasets.

HBase is designed architecturally to be fault tolerate. It does this through write-ahead logging and distributed configuration. At the core of the architecture HBase is built on Hadoop. Hadoop is a sturdy, scalable computing platform that provides a distribute file system and mapreduce capabilities.

Who is using it?

  • Facebook uses HBase for its messaging infrastructure.
  • Stumpleupon uses it for real-time data storage and analytics.
  • Twitter uses HBase for data generation around people search & storing logging & monitoring data.
  • Meetup uses it for site data.
  • There are many others including Yahoo!, eBay, etc.

Mongo

http://www.mongodb.org/

MongoDB is built and maintained by a company called 10gen. MongoDB was released in 2009 and has been rising in popularity quickly and steadily since then. The name, contrary to the word mongo, comes from the word humongous. The key goals behind MongoDB are performance and easy data access.

The architecture of MongoDB is around document database principles. The data can be queried in an ad-hoc way, with the data persisted in a nested way. This database also, like most NoSQL databases enforces no schema, however can have specific document fields that can be queried off of.

Who is using it?

  • Foursquare
  • bit.ly
  • CERN for collecting data from the large Hadron Collider
  • …others…

Redis

http://redis.io/

Redis stands for Remote Dictionary Service. The most common capability Redis is known for, is blindingly fast speed. This speed comes from trading durability. At a base level Redis is a key-value store, however sometimes classifying it isn’t straight forward.

Redis is a key-value store, and often referred to as a data structure server with keys that can be string, hashes, lists, sets and sorted sets. Redis is also, stepping away from only being a key-value store, into the realm of being a publish-subscribe and queue stack. This makes Redis one very flexible tool in the tool chest.

Who is using it?

  • Blizzard (You know, that World of Warcraft game maker)  😉
  • Craigslist
  • flickr
  • …others…

Couch

http://couchdb.apache.org/

Another Apache Project, CouchDB is the idealized JSON and REST document database. It works as a document database full of key-value pairs with the values a set number of types including nested with other key-value objects.

The primary mode of querying CouchDB is to use incremental mapreduce to produce indexed views.

One other interesting characteristic about CouchDB is that it’s built with the idea of a multitude of deployment scenarios. CouchDB might be deployed to some big servers or may be a mere service running on your Android Phone or Mac OS-X Desktop.

Like many NoSQL options CouchDB is RESTful in operation and uses JSON to send data to and from clients.

The Node.js Community also has an affinity for Couch since NPM and a lot of the capabilities of Couch seem like they’re just native to JavaScript. From the server aspect of the database to the JSON format usage to other capabilities.

Who uses it?

  • NPM – Node Package Manager site and NPM uses CouchDB for storing and providing the packages for Node.js.

Couchbase (UPDATED January 18th)

Ok, I realized I’d neglected to add Couchbase (thus the Jan 18th update), which is an open source and interesting solution built off of Membase and Couch. Membase isn’t particularly a distributed database, or database, but between it and couch joining to form Couchbase they’ve turned it into a distributed database like couch except with some specific feature set differences.

A lot of the core architecture features of Couch are available, but the combination now adds auto-sharding clusters, live/hot swappable upgrades and changes, memchaced APIs, and built in data caching.

Who uses it?

  • Linkedin
  • Orbitz
  • Concur
  • …and others…

Neo4j

http://www.neo4j.org/

Neo4j steps away from many of the existing NoSQL databases with its use of a graph database model. It stored data as a graph, mathematically speaking, that relates to the other data in the database. This database, of all the databases among the NoSQL and SQL world, is very whiteboard friendly.

Neo4j also has a varied deployment model, being able to deploy to a small or large device or system. It has the ability to store dozens of billions of edges and nodes.

Who is using it?

  • Accenture
  • Adobe
  • Lufthansa
  • Mozilla
  • …others…

Riak

Riak is a key-value, distributed, fault tolerant, resilient database written in Erlang.  It uses the Riak Core project as a codebase for the distributed core of the system. I further explained Riak, since yes, I work for Basho who are the makers of Riak, in a separate blog entry “Riak is… A Big List of Things“. So for a description of the features around Riak check that out.

Who is using Riak?

In Summary

One of the things you’ll notice with a lot of these databases and the NoSQL movement in general is that it originated from companies needing to go “web scale” and RDBMSs just couldn’t handle or didn’t meet the specific requirements these companies had for the data. NoSQL is in no way a replacement to relational or SQL databases except in these specific cases where need is outside of the capability or scope of SQL & Relational Databases and RDBMSs.

Almost every NoSQL database has origins that go pretty far back, but the real impetus and push forward with the technology came about with key efforts at Google and Amazon Web Services. At Google it was with BigTable Paper and at Amazon Web Services it was with the Dynamo Paper. As time moved forward with the open source community taking over as the main innovator and development model around big data and the NoSQL database movement. Today the Apache Project has many of the projects under its guidance along with other companies like Basho and 10gen.

In the last few years, many of the larger mainstays of the existing database industry have leapt onto the bandwagon. Companies like Microsoft, Dell, HP and Oracle have made many strategic and tactical moves to stay relevant with this move toward big data and nosql databases solutions. However, the leadership is still outside of these stalwarts and in the hands of the open source community. The related companies and organizations that are focused on that community such as 10gen, Basho and the Apache Organization still hold much of the future of this technology in the strategic and tactical actions that they take since they’re born from and significant parts of the community itself.

For an even larger list of almost every known NoSQL Database in existence check out NoSQL Database .org.

Distributed Coding Prefunc: Installing QuickCheck for Great Testing

Erlang LogoA few weeks ago I kicked off this series of “Distributed Coding Prefunc: Up and Running with Erlang” and had wanted to keep up the momentum, but as life goes I had to tackle a few other things first. But now, it’s time to get back on track with some distributed computing. I intend to write tests with my samples, as I often do, I decided to take a stab at .

Before going forward, note that there is QuickCheck for Haskell and there is a QuickCheck for Erlang. Since the point of this “Distributed Coding Prefunc” is to get started coding with Erlang from zero, I’ll be talking about the Erlang version here. This version is created by John Hughes and Koen Claessen, starting the Quviq Company in 2006.

To download QuickCheck choose the version you intend to use, I’ve chosen the commercial license version from the download page.

At the command prompt, install QuickCheck by running Erlang and then run the install with these commands.

Launch Erlang:

[sourcecode language=”bash”]
$ erl
Erlang R15B01 (erts-5.9.1) [source] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1 (abort with ^G)
1>
[/sourcecode]

Then execute the install:

[sourcecode language=”Erlang”]
1> eqc_install:install().
[/sourcecode]

If the execution of the install displays this error, you’ll need to use sudo.

[sourcecode language=”bash”]
Installing ["pulse-1.27.7","eqc-1.27.7","eqc_mcerlang-1.27.7"].
Failed to copy pulse-1.27.7–copy returned {error,eacces}??
** exception exit: {{error,eacces},"pulse-1.27.7"}
in function eqc_install:’-copy_quickcheck/3-lc$^0/1-0-‘/3 (../src/eqc_install.erl, line 63)
in call from eqc_install:install2/4 (../src/eqc_install.erl, line 44)
[/sourcecode]

Kill Erlang with a ctrl+c and restart Erlang with the sudo command.

[sourcecode language=”bash”]
$ sudo erl
[/sourcecode]

Now when you install you should see the following result or something similar. You’ll be asked to continue, select lowercase ‘y’ to continue. It may be different for some, but when I hit uppercase ‘Y’ (I suppose I got overzealous to install QuickCheck) it finished as if I’d hit no or something else.

[sourcecode language=”bash”]
1> eqc_install:install().
Installation program for "Quviq QuickCheck" version 1.27.7.
Installing in directory /usr/local/lib/erlang/lib.
This will delete conflicting versions of QuickCheck, namely
[]
Proceed? y
Installing ["pulse-1.27.7","eqc-1.27.7","eqc_mcerlang-1.27.7"].
Quviq QuickCheck is installed successfully.
Looking in "/Users/adronhall"… .emacs not found
Could not find your .emacs file!
Try install("path-to-emacs-file") or install(new_emacs).
Bookmark the documentation at /usr/local/lib/erlang/lib/eqc-1.27.7/doc/index.html.
ok
[/sourcecode]

You’ll note above, I don’t currently have emacs installed. The reason it looks for emacs is because QuickCheck has templates/ops mode for emacs. So if you use emacs you’re in luck. I on the other hand, don’t, so I’ll just be using this from wherever I’m using it.

In addition to the lack of emacs, another important thing to note from the message is the link to documentation. Once you get this link open it up and check out the docs. They’re broken out into easily readily topic spaces and are a good place to do initial reference checking while you’re writing up your specs.

If you have a license, it is important to note, that if you’ve used sudo with your installation you’ll need to kill your running Erlang session and start it anew without sudo. Otherwise you’ll run into issue down the road trying to use the libs (unless of course you want to go hack on your permissions manually). Once you’re ready to register the software it’s simply one command, where xxxxx is your license key.

[sourcecode language=”bash”]
eqc:registration("xxxxxxxxxxxx").
[/sourcecode]

Alright, next time we’re on to next steps…

Riak is… A Whole Big List of Things

What is Riak? Who builds it? Who maintains it? Can I download it? How does it work? What are the features?

Here’s the start of answers to these questions and more.

First, the basic high level description:

Riak is an open source, highly scalable, fault-tolerant distributed database.”

That’s the first line you’ll read when checking out the product via the Basho product link. It provides good information, but here I’m going to add more to the definition without the need to dig around yourself. Maybe I can save you some time & provide some links directly to solid information in the docs. Kind of a “Cliff Notes” of Riak. Let’s take this feature by feature which will in turn get us to a definitive definition of what exactly Riak is.

Riak is Open Source.

Riak is built and contributed to by the community, with Basho being the steward and an active member that extends, builds and provides support for additional products. The avenues to reach the Riak Open Source Community members is pretty straight forward, following known avenues of communication. Hit us up on the email list, especially feel free to contribute & ask questions via the Github Basho organization, there is the Basho Riak Blog, the weekly recap and jump into the IRC chat room #riak on freenode. Oh, and there’s a twitter feed @basho.

So what exactly does this get you, when you become a user or contributor of Riak? The entire community is behind you, will help you get started using Riak and provide help whenever you run into problems. If you want SLAs or 24 hour support Basho can provide this for you. But for bugs, issues, queries, searching and all sorts of other related development questions there is the community. An open source community like this is passionate, which means you’ll have support like no closed source company will ever provide you, and absolutely no closed source product’s community will provide you. We’re talking about a different level of interest, passion and levels of personal involvement.

Riak is a key value based database store.

Riak is a key value store. What exactly is a key value store? It’s pretty simple and you’re probably already familiar with what a key value store is. A key value is made up of two pieces of data, the first is the identifier for the second element within the data structure. This gives a system or developer using key value storage a schema-less way of working with data.

Riak is designed for highly distributed environments.

This type of distributed isn’t the “we put one database over here and one database over here and you gotta figure out how they work together” type of distribution. So this isn’t some of that oddball pretend stuff Oracle keeps hoisting on people. This is the honest to goodness distribution of the sort, when one node goes down you don’t blink, you don’t stop eating dinner, you don’t sweat it. You just continue onward with life knowing full well that you’ll just spool up another node when you need to.

Riak is master-less, with no single point of failure.

This is one of self explanatory features. But what does a master-less system provide us? One thing is no single point of failure. Being that all nodes can act autonomously to work around the loss of one or more nodes it also helps add to the high availability of the system.

Riak is fault tolerant, like a disk drive you wish was real.

Ever have a backup disk drive? What? You don’t have one of those? Ugh. Ok, so imagine you had a backup disk drive that had an unfortunately high failure rate. Well, why, because you know, they have an oddly high failure rate. If you do backups like good practices dictate, eventually you’ll end up with some dead drives.

RAID, both software and hardware, are built specifically to deal with this type of failure. With a distributed system like Riak, it bumps the level of abstraction above software or hardware RAID, enabling another level of even greater fault tolerance. Not to remove the relevance of RAID capabilities, but with a multi-node system like Riak, you can easily remove nodes and swap them out as needed, keeping costs down by using simple drives in simple machines. If you want to, you could indeed get higher I/O machines and faster drives, but it isn’t necessary to insure fault tolerance in a Riak Database System.

Riak scales, with hot swappable nodes enabling zero downtime.

The ability to commit hot swappable changes while in the midst of operating starts at a very low level for Riak. The language used to build Riak, Erlang has the ability to change pieces of an application system in realtime built into the precepts of the language. This provides, at the core, the inherent capability to change out systems, and by proxy of architectural design, the ability for nodes in Riak to be changed out simply by removing them from a cluster ring. Once that is done it is just as simple to add another node or nodes back into the cluster ring, enabling a number of additional practices around upgrades, hot swaps for failures, or even version changes.

Riak can be used as a building block for distributed (aka cloud) infrastructure.

The concepts and contractual components that Riak Database is built on are available for use via the Riak Core Project. If you’re looking into starting a project around distributed systems this is a great place to get start. Also be sure to do a general web engine (re: google) search for “riak core” and you’ll find lots of material around the project, and projects people have started with the project as a base. I’m currently in the process of putting together one of these projects myself.

Riak is eventually consistent.

The term eventually consistent is becoming more and more common place. Riak is one of the many systems, that inherently often apply to distributed systems, that use the concepts of eventual consistency. The idea, is that even though all nodes may not immediately receive a new piece of data, or updated piece of data, they eventually will receive that update and by synchronized with the cluster ring of nodes. This goes back to the equality of nodes and removal of the master-less concepts, providing the availability and other capabilities, with some trade off in the synchronization of data through eventual consistency.

In Summary

That’s round one for the many features of Riak. I’ll be adding more in the future, but for now this is a good starting point in knowing about and knowing what Riak is, what it can be used for, and how it might help you extend, maintain or invent the next great piece of technology.

New Relic, The King Makers, MS Open Tech, Riak VMs and Life Gets Easier Today

Today Microsoft released, with partnerships with a number of companies including Basho, Hupstream and Bitnami, the VM Depot. I’ve always followed Bitnami, so it’s really cool to see their VM releases for Jenkins (CI Build Server), WordPress, Ruby 1.9.3 stackNode.js and about everything you can imagine out their along side our Basho Riak CentOS image. If you want a great way to get kick started with Riak and you’re setup with Windows Azure, now there is an even easier way to get rolling.

Over on the Basho blog we’ve announced the MS Open Tech and Basho Collabortation. I won’t repeat what was stated there, but want to point out two important things:

  1. Once you get a Riak image going, remember there’s the whole community and the Basho team itself that is there to help you get things rolling via the mail list. If you’re looking for answers, you’ll be able to get them there. Even if you get everything running smoothly, join in anyway and at least just lurk. 🙂
  2. The RTFM value factor is absolutely huge for Riak. Basho has a superb documentation site here. So definitely, when jumping into or researching Riak as software you may want to build on, use for your distributed systems or the Riak Key Value Databases, check out the documentation. Super easy to find things, super easy to read, and really easy to get going with.

So give Riak a try on Windows Azure via the VM Depot. It gets easier by the day, and gives you even more data storage options, distribution capabilities and high availability that is hard to imagine.

New Relic & The Rise of the New Kingmakers

In other news, my good friends at New Relic have released a new book in partnership with Redmonk Analyst Stephen O’Grady @, have released a book he’s written titled The New Kingmakers, How Developers Conquered the World. You may know New Relic as the huge developer advocates that they are with the great analytics tools they provide. Either way, give a look see and read the book. It’s not a giant thousand page tomb, so it just takes a nice lunch break and you’ll get the pleasure of flipping the pages of the book Stephen has put together. You might have read the blog entry that started the whole “Kingmakers” statement, if you haven’t, give that a read first.

I personally love the statement, and have used it a few times myself. In relation to the saying and the book, I’ll have a short review and more to say in the very near future. Until then…

Cheers, enjoy the read, the virtual images and happy hacking.

Node PDX => Possible Speakers?

I started writing this blog entry about a month ago. I had not ran it by my friend Troy, and I wanted to make sure I didn’t jump the gun. However, I’d sort of let this entry get a little dusty, and Troy @thoward37 on twitter inadvertently brought up the topic and I’ve sprung this

So here it is, consider this blog article my Node PDX 2013 get the ball rolling article. Last year we (Troy, I and a team of volunteers, thanks everybody, ya’ll ROCK!) managed to put together, what we’ve been told was a totally kick ass conference, for zero cost to the conference goers, with great speakers, live video feeds and post conference videos and we did it all in about 3 weeks (maybe it was 3 weeks and 2 days). It’s how we roll, hard core, serious and dedicated to what we create and what we push forward for the community.

This year isn’t going to be an exception, only in that we’re going to give ourselves a bit more than 3 weeks. We haven’t set a date for the conference yet, but we’ll be doing that soon. What I want to throw out there is – who should we bring in to speak this year? I have a few people I’ve seen speak, or know that I’d like to hear speak and wanted to mention them right now and I apologize if this is forward, because I haven’t even spoke to them about it. So this blog entry is a complete surprise to them – so don’t set any expectations yet! Additionally, even though I want to see these individuals speak, that doesn’t mean I’m circumventing our process we used last year. Speakers will indeed have to make a git pull request against our Node PDX Github repo just like we did before.  🙂

With that said, hopefully we can get dates and times set so that these more excellent individuals and a host of others can come and speak this year, enjoy some Portland, have a beer/coffee/donut/yerba mate/ or two on us and have a great time!

Kelly Somers – I’d love to hear Kelly come into PDX and give us the lowdown on some big data, distributed systems & JavaScripty libs that help us tie all that together beautifully.
Blog: http://kellabyte.com/
Twitter: https://twitter.com/kellabyte

Max Ogden – This guy is hard core, seriously, coding his way around the world and getting people involved. He’s been at more civic hacks and helped connect cities in more ways than I knew one could. It’d be awesome if he could swing into town and inform us about some of those connections. (Plus, maybe we can get some proper metal thrashin’ & a guest appearance for the sound track on the upcoming hard core coder show.
Blog: http://maxogden.com/
Twitter: https://twitter.com/maxogden
Github: https://github.com/maxogden

Chris Williams
Twitter: Chris got off of Twitter, yet the account still exists, to focus on the things that are important – namely real life, family and people. I totally support him in his efforts. For more about Chris, check out and follow the JSConf circuit. You’ll find him organizing the hell out of the conferences and making them awesome while keeping the negativity on the run!

Angelina Fabbro – I posted some jsconf videos a while back and really enjoyed Angelina’s talk. Today via twitter, thanks to @thoward37 I now know she also likes some Club Mate
Twitter: https://twitter.com/angelinamagnum
Site: http://realityhacking.net/

So anyway, like I was saying, no promises here. @thoward and I @adron will be kicking off the call for proposals real soon. So as always, keep reading, subscribe, and I’ll have more news soon.