Deploycon, PaaS & the pending data tier gravity fallout…

For a quick recap of last years Deploycon & related talks, check out my “Day #3 => DeployCon && Enterprise && Data Gravity” entry from last year.

PaaS Systems aren’t always effectively distributed. Heroku has fallen over every time east-1 has gone down at AWS. Not that I’m saying they’ve done bad, just pointing that out. With Cloud Foundry, there’s several key SPOFs (Single Points of Failure), and with all PaaS Systems the data tier is often the neglected pairing of the system. I’ve been wanting to write about this for a few months now and Deploycon has lit a fire for me to do just that.

Deploycon – “Platform Services and Developer Expectations” **

I’m on a panel at Deploycon titled “Platform Services and Developer Expectations” and this leads right back around to that. This SPOF issue is concerning to me as PaaS Providers talk up the offerings more and more with little light actually shone on this issue. In some ways each is moving away form their respective SPOFs, but overall they’re all pretty prevalent throughout. For security, each has a non-distributed database, which technically needs backed up still – no clear replication or other mechanisms setup to ensure data integrity in a failure situation. Of course, the huge saving grace with a PaaS, is that if the overall system goes down or a SPOF blows up, all the existing deployed applications will generally continue to run. Unless of course the routing and networking are also SPOF. This is the largest glaring concern with PaaS Systems that I see today.

One of the other things about PaaS that has always led to a ton of questions is “what about my PostGresql/mysql/Riak/mongodb/database thing and how do I do X, Y, Z with it to ensure scalability in my PaaS.” In almost every case it ends with a simple and unfortunate answer, “…when it comes to data, a PaaS doesn’t really do a damn thing for ya…” This is obviously not very helpful. The entire reason to put a PaaS into place is to simplify life, the sad fact that it barely does a thing for the data tier isn’t very helpful.

Now, hold on a second before you start screaming at me about “but a PaaS does X, Y and Z and isn’t even supposed to touch that aspect of things…” let me elaborate a bit more. The panel at Deploycon states “…Developer Expectations” and when things are getting simplified in the way a PaaS does, developers assume that if it does all this fancy magic for an application it ought to simplify the data side of things too! Right? Well no, and it isn’t going to for the foreseeable future. But no matter what, it doesn’t change the fact that developers often have that expectation.

Now, I could write at length about all the reasons that PaaS doesn’t really do anything for the data tier. I could wax poetic about how a distributed database (re: Riak, Cassandra, etc) just doesn’t lend itself to a cookie cutter approach to deployment under a PaaS or an RDBMS has umpteen different configurations for stability, scaling, hot swappable services, and other such complexities around the data tier. But instead I’m going to skip all, maybe cover some of those things another day, and jump right into some of the things that are actually moving forward to fill this gap.

BOSH, Cloud Foundry, OpenShift & fixing the data tier…

The most obvious reason there isn’t a simple turn key solution to the data side of things with a PaaS ecosystem is that data is complex and extremely diverse. There’s distributed key/value stores (Riak, Cassandra), there’s sort of kind of distributed databases (Mongo), graph databases (Neo4j), the age old RDBMS (DB2, SQL Server, Oracle’s Stuff, etc) and the million solutions around that, there’s key/value in memory styled databases that are insanely fast, like Redis. Expanding just slightly you have software that works around these systems such as Hadoop & Riak CS & the list goes on. All of it focused on the data tier and maintaining one, two or some form of the three points around CAP Theorem (http://en.wikipedia.org/wiki/CAP_theorem), atomicity and other key capbilities.

All of the PaaS Systems, including public and private often have some sort of plug-in style architectures for data. Whether it is Apprenda which is closed to community and closed source or an ongoing open to community PaaS like OpenShift or Cloud Foundry, things still fall almost entirely to the developers or database team to build an architecture around the data. When looking at solutions to simplify data in PaaS Systems the closed source solutions we have no idea what they’re up to in this regard. The one’s that are open source or in large part public and involved in the community PaaSes, like EngineYard, Heroku, Cloudbees and others we can really see the directions and efforts around creating real PaaS style solutions to the data tier problem.

BOSH, Vagrant, etc…  One of the best solutions I’ve seen so far is the ability of Bosh, which was created by the Cloud Foundry team while at VMware, to spool up an environment that includes such things as a Riak Cluster (or other cluster). Currently Brian McClain & Dr Nic have worked to put together such Bosh + Vagrant scripts & get things rolling. I myself will be spending some considerable time on just that. But beyond that this is a good start in enabling data tier back ends.

How to close the gap, between absurdly simple application deployment and still arduous and difficult data tier deployment? For the next several years I think we’ll have cumbersome deployment practices around the data tier. There won’t be anything as elegantly simple as Cloud Foundry’s single line deployment or AppFog’s one click deployment of a web application. The best we can do at this time, is to streamline around pieces and architectures, and at least get them into a kind of simple 3 step deployment.

Please drop a comment or two on how you think we might simplify the data side of the PaaS toolchain. Also drop a few tweets in the twitterverse too, I’m sure that’ll be exploding as usual. I’m @adron, ping me.

Cheers, happy data architecting.

** the Deployconpanel will be at 4:30pm in Santa Clara on April 2nd. Come check it out.

ORMs Suck, I’m Asking & I’m Telling

Here’s a thing that’s come up already. ORMs, or Object Relational Mapper, are a RDBMS based thing for devs that want, in essence a statically typed object to deal with when writing code (yes, I know there’s a ton of other things an ORM can do or be used for, but I’m going with a simple explanation here, here’s more info). For some situations, I can see where that’s a productivity booster, but in other situations it is a very broken ideal. Especially when it comes to needs around highly complex query needs, performance and a host of other functionalities when you get into the higher end demands of Enterprise or *web scale* type systems.

Two things that has brought up this question in a new light, is:

  • One: Why would you use an ORM in a dynamic language like JavaScript, that has a simple native format like JSON? It tends to make far less sense to do this. What’s your take? Like the idea? Hate the idea?
  • Two: In a schemaless database, forcing an ORM on something that is built specifically to not have this featureset (for a number of reasons) seems to break the design ideas of the systems. Whether it is Mongo, Riak, Cassandra or whatever, why would you actually want to or not want to use an ORM? I’ve got my own thoughts but would like to know what people think about this notion.

-Please comment, I’ll be diving into any feedback too.

Cloud Computing and Distributed Computing, Something is Broken

First off, I’m going to start off with some definitions to clarify things for this conversation.

Cloud Computing, in general, has been perverted to mean almost anything available for sale today in technology. It’s rhetorically stupid. But we all still use the term to some degree. Going back to cloud computing at the core, we’re talking about systems, virtually managed and often distributed. Distributed geographically with no single real point of failure. Almost every cloud computing enabled site or system these days are a complete lie when it comes to geographically dispersed, cloned nodes with no real point of failure, that is resilient to outages and related problems.

Distributed Systems, this is a term that is not contorted or misused – albeit at this moment in time. There’s always possibilities that the media completely botches it up later. But right now, distributed computing, distributed databases and distributed systems generally refers to what cloud computing used to sort of mean. So here’s some specific definitions of distributed technology.

  • Distributed computing refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, which communicate autonomously. ref: http://en.wikipedia.org/wiki/Distributed_computing
  • A distributed database is a database in which storage devices are not all attached to a common processing unit such as the CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Unlike parallel systems, in which the processors are tightly coupled and constitute a single database system, a distributed database system consists of loosely coupled sites that share no physical components.Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites.To ensure that the distributive databases are up to date and current, there are two processes: replication and duplication. Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be very complex and time consuming depending on the size and number of the distributive databases. ref: http://en.wikipedia.org/wiki/Distributed_database

So these definitions provide a basis for my next topic point and frustration with the current state of “cloud computing” providers. To summarize what this problem I have is, it simply is that almost every provider continues to perpetuate legacy client to server, or server heavy with a RDBMS or single point of failure database on the back end. This is completely missing the advantages of cloud computing in so many ways. The high availability, the resiliency, the performance of scaling by adding nodes versus other vastly more expensive means. One of those standard means is throwing away one machine to get a bigger more powerful machine, which has distinct and clear limitations. Let’s look at some specific examples that are encouraged by the providers.

The Failures in Cloud Computing & Distributed Systems

SharePoint – I’m not picking on SharePoint in this particular scenario because of its notoriously poor user experience or worse developer experience, I’m calling it out for a single point of failure architecture. Sharepoint doesn’t rely on a distributed database. It also can’t be easily installed in any easy way on multiple web application servers to share load. Even if it was, the relational database holds very distinct and specific limitations that cannot be overcome. In large environments it must be sharded at an application level, so in large corporations with heavy usage the system runs into all sorts of complex bottlenecks and performance nightmares.

Standard RDBMS + Web App – This is a very common database configuration which, if kept in a RDBMS dramatically raises data storage cost for any site that needs scaled. The largest problem with scaling, is that an RDBMS is setup for vertical scale improvements, in other words the “buy a bigger machine with more resources” solution. This is not very ideal if you want to actually maintain high availability. In all actuality, having an RDBMS alone as the primary data repository is probably one of the worst existing and continually encouraged architectural decisions made for any website that may one day need to scale.

CRMs and Other ERP, Single Repo Mail Servers – This list is huge. Whether it is Exchange attached to a proprietary data store, stuck on top of Oracle, or glued to some sharded database or data store of sorts it is another bad implementation of distributed computing resources. CRMs almost always sit on top of some relational database that is geographically bound. Another one that is a huge problem is ERP tooling in general. These frequently sit on top of price coupled, proprietary databases like Oracle or SQL Server, with no clear scaling plan except to just “wait for it to process” types of situations. Many large corporations end up having to work around these problems with massively expensive custom solutions of these products, especially when you’re employee count is over 50k, which there are more than a few of those out there.

The Future

So don’t be fooled into thinking you’re getting some “cloud solution” with these solutions using traditionally designed architectures. The solutions of the future will be distributed, utilizing the computing grid better, and prospectively cheaper in many ways. Whatever the case the marketing push for the cloud has become worthless, and if you’re looking for the real power in computing these days you’ll look into distributed systems and how those work for you. There’s massive potential in properly distributing systems and building out applications accordingly, much of this potential has only begun to be tapped.

Raven DB, A Kick Starter using Tier 3 IaaS

I’m putting together a pretty sweet little application. It does some basic things that are slowly but surely expanding to cover some interesting distributed cloud computing business cases. But today I’m going to dive into my Raven DB experience. The idea is that Raven DB will act as the data repository for a set of API web services (that seems kind of obvious, I state the obvious sometimes).

The first thing we need is a server instance to get our initial node up and running. You can use whatever service, virtualization tools, or a physical server if you want. I’m going to use Tier 3’s Services in my example, so just extrapolate for your own situation.

First I’ve logged in to the Tier 3 Control Site and am going to create a server instance.

Building the Tier 3 Node for Raven DB

Creating the Server Instance (Click for full size image)
Creating the Server Instance (Click for full size image)

Next step is to assign resources. Since this is just a single Raven DB node, and I won’t have it under heavy load, I’ve minimized the resources. This is more of a developers install, but it could easily be a production deploy, just allocate more resources as needed. Also note, I’ve added 50 GB of storage to this particular instance.

Setting Resources (Click for full size image)
Setting Resources (Click for full size image)

Now that we’ve set these, click next and on the next screen click on server task. Here add the public IP option and select the following services to open their ports.

Setting up a Public IP and the respective ports/services (Click for full size image)
Setting up a Public IP and the respective ports/services (Click for full size image)

The task will display once added as an item on the create server view. Once that is done, click create server so the server build out will start.

Creating the Server (Click for full size image)
Creating the Server (Click for full size image)

Now log in with RDP to start setting up the server in preparation of loading Raven DB. The first thing you’ll want to do is go ahead and get Windows Update turned on. My preference is to just turn it on and get every update that is available. Once that is done, make sure to get the latest .NET 4 download from Windows Update too.

Getting Windows Update Turned On (Click for full size image)
Getting Windows Update Turned On (Click for full size image)

Once all of the updates are finished and .NET 4 is installed we’ll get down to the business of getting Raven DB Installed. In this specific example I’ll be installing the Raven DB as a windows service, it however can be installed under IIS so there are many other options depending on how you need it installed.

Installing Raven DB

To get the software to install, navigate over to the Raven DB site at http://ravendb.net/ from the new instance we’ve just spun up. Click on the Download button and you’ll find the latest build over on the right hand side. Click to download the latest software package to a preferred location on the system.

Raven DB - Open Source 2nd Generation Document DB (Click to navigate to the site)
Raven DB – Open Source 2nd Generation Document DB (Click to navigate to the site)

Once you’ve downloaded it (I’ve put my download in the root R:\ partition I created) unzip it into a directory (I’ve just unzipped it here into R:\ to make the paths easy to find, feel free to put it anywhere you would prefer. In our Tier 3 environment the R drive is on a higher speed, thus higher IOP drive system, thus the abilities exceed your standard EBS/AMI or S3 style storage mechanisms.).

Saving the Raven DB Download (Click for full size image)
Saving the Raven DB Download (Click for full size image)
Saving to R:/ (Click for full size image)
Saving to R:/ (Click for full size image)

At this point, open a command prompt to install Raven DB as a service. Navigate to the drive and folder location you’ve saved the file to. Below I displayed a list of the folder and files in the structure.

CLI actions (click for full size image)
CLI actions (click for full size image)

Once you’re in the path of the Raven.Server.exe file then run a slash install on it to get a Windows Service of the Raven DB running.

Raven DB Installation results (click for full size image)
Raven DB Installation results (click for full size image)

To verify that it is up and running (which if you’ve gotten these results, you can rest assured it is, but I always like to just see the services icon) check out the services MMC.

Launching services (Click for full size image)
Launching services (Click for full size image)

There it is running…

Now, you’re not complete yet. There are a few other things you may want to take note of to be sure you’re up and running in every way you need to be.

The management and http transport for Raven DB is done on port 8080. So you’ll have to open that port if you want to connect to the services of the database externally. On windows, open up the Windows Firewall. Right click on the Inbound Rules and click Add Rule.

Select Port (Click for full size image)
Select Port (Click for full size image)
Select the Raven.Server.exe (click to see full size)
Select the Raven.Server.exe (click to see full size)
Inbound Rule (Click for full size image)
Inbound Rule (Click for full size image)
Open up however needed. (Click for full size image)
Open up however needed. (Click for full size image)
Public Private etc. (Click for full size image)
Public Private etc. (Click for full size image)

Enter a name and description on the next wizard dialog screen and click on Finish.

Displayed active firewall rule (Click for full size image)
Displayed active firewall rule (Click for full size image)

Now if you navigate to the IP of the instance with port 8080 you’ll be able to load the management portal for Raven DB and verify it is running and you have remote access.

Raven DB Management Screen
Raven DB Management Screen (Click for full size image)

At this point, if you’d like more evidence of success, click on the “Create Sample Data” button and the management screen will generate some data.

Raven DB Management console with data (Click for full size image)
Raven DB Management console with data (Click for full size image)

At this point you have a live Raven DB instance up and running in Tier 3. Next step is to break out and add nodes for better data integrity, etc.

Summary

In this write up I’ve shown a short how-to on installing and getting Raven DB ready for use on Windows Server 2008 in Tier 3’s Enterprise Cloud environment. In the very near future I’ll broach the topics of big data with Raven DB, and other databases like Riak and their usage in a cloud environment like Tier 3. Thanks for reading, cheers!

Going Hard Core: Vmware’s Cloud Foundry Forks Uhuru & Iron Foundry Review

Back in December Uhuru Software and Tier 3 released two different forks of Cloud Foundry that enabled .NET Support. I wasn’t sure which I wanted to use, since I had some serious Cloud Foundry work I was about to dive into, so I’ve picked them apart to determine how each works. This is what I’ve found so far.

Uhuru

Iron Foundry

That covers the basic links to the downloads, community, and other points of presence, now it is time to dig into some of the differences I’ve found. First though, I got a good environment setup to test each of the forks, from within the same Cloud Foundry Environment! So this is how I’ve set this up… Setting up the Virtual Machines w/ VMware Fusion I suspect, you could tangibly do this with some other virtualization software, but VMware is probably the easiest to use and setup on OS-X & Windows. I haven’t tried this on Linux so there’s another space I’d have to give it a go. Using ESX I also suspect this would also be extremely easy to setup. It’s up to you, but I’m doing all of this with VMware Fusion. The environment I’m using for this comparison consists of the following virtual images:

Micro Cloud Foundry Instances

These instances were easy, I just downloaded them from the Cloud Foundry Site on the Micro Cloud Foundry Download Page. The simple configuration is outlined in “Micro Cloud Foundry Installation & Setup“.

Iron Foundry Instances

For this, I downloaded the available VM on the Iron Foundry Site here.

Uhuru Instances

I setup the Uhuru Instances using the instructions available from Uhuru Software here.

Setting up Some Controllers

So the first thing I did was dive into setting up a controller, or actually two, because I wanted to have an Iron Foundry Environment and a Uhuru Software Environment. After that I’d then try to mix and match them and figure out differences or conflicts. The instructions listed under the “Uhuru Instances” has information regarding setup of a controller for the Uhuru Software Environment, which is what I followed. It is also a good idea to get setup with Putty or ready with SSH for usage of Cloud Foundry, Uhuru Software, and Iron Foundry.