Join the Apollo Beta for FREE! Help the Databass!

Hello to all the data curious, database lovers, and sciency datamungers! I have a small favor to ask of you all. At DataStax we just opened up our Apollo service i.e. “Apache Cassandra as a Service” i.e. DBaaS offering and I’m looking for people that want to test drive the database! Now, you don’t have to actually tell me you’re using it or anything, but I’d love to know if you are. Maybe we could even chat about your experience using it.

To get started:

Sign up here.
Create a database here.
Pick a driver here. [C#/F#, Node.js/JavaScript, Java, C++, and Python] – I added F# cuz ya know, that’s how F# works and all, you just use the C# driver and BOOM, you’ve got F# access!!
Write CQL and execute the database!
Profit!

Alright, where profit is that’s when you let me know what works for you and what doesn’t. Feel free to comment here, ping me via Twitter @adron, or via the response form here, or however you’ve got to message me. I’d be super stoked to chat!

Getting Started Specifics

To create a database, once you’ve got an account, just navigate to https://apollo.datastax.com/createDatabase and you’ll get prompted with the following screen.

Currently during beta we have AWS as the provider option, and you can choose between Developer, Startup, Standard, and Enterprise. Each offering various configurations and future prospective SLA’s and such.

Once you have the database name, keyspace, user name, and you password set, click on Launch Database and the spin up of the multi-node database will begin. You’ll be greeted with a message notifying you that it’ll take a little bit of time for the database to spin up and an email will be sent once it is done. Enjoy a coffee in the meantime.

Once the database spins up there are two key sections on the database page. First, there is the connection details. They’re located in the bottom left of the database page.

If you click on the “Learn How” you’ll get directly linked to the docs pages with multiple examples of how to get connected to the database you’ve just created. You can also reset your password here and retrieve the security bundle (it’s a tar/zip file) that you’ll need to authenticate any applications with.

The other part that can be really helpful, especially as you do any development or testing with your database is the grafana dashboard. It’s on the Health tab of the database page.

A trick that I used, to get an easier and full screen view of all the metrics, is to inspect the page right at the metrics, within that you’ll find the iframe in which to get the link specifically to the Grafana metrics. They look pretty nice broken out of frame! As you work through queries and such keep an eye on this for extra insight.

Any other thoughts, contemplation, or otherwise do get in touch!

Jonathan Ellis talks about Five Lessons in Distributed Databases

Notes on the talk…

If it’s not SQL it’s not a database. Watch, you’ll get to hear why… ha!

Then Jonathan covers the recent history (sort of recent, the last ~20ish years) of the industry and how we’ve gotten to this point in database technology.

It takes 5+ years to build a database.

Also the tens of millions of dollars with that period of time. Both are needed, in droves, time and money.

…more below the video.

The customer is always right.

Even when they’re clearly wrong, they’re largely right.

For number 4 and 5 you’ll have to watch the video. Lot’s good stuff in this video including comparisons of Cosmos, Dynamo DB, Apache Cassandra, DataStax Enterprise, and how these distributed databases work, their performance (3rd Party metrics are shown) and more details!

Distributed Systems: Cassandra, DataStax, a Short SITREP

SITREP = Situation Report. It’s military speak. 💂🏻‍♂️

Apache Cassandra is one of the most popular databases in use today. It has many characteristics and distinctive architectural details. In this post I’ll provide a description and some details for a number of these features and characteristics, divided as such. Then, after that (i.e. toward the end, so skip there if you just want to the differences) I’m doing to summarize key differences with the latest release of the DataStax Enterprise 6 version of the database.

Cassandra Characteristics

Cassandra is a linearly scalable, highly available, fault tolerant, distributed database. That is, just to name a few of the most important characteristics. The Cassandra database is also cross-platform (runs on any operating systems), multi-cloud (runs on and across multiple clouds), and can survive regional data center outages or even in multi-cloud scenarios entire cloud provider outages!

Columnar Store, Column Based, or Column Family? What? Ok, so you might have read a number of things about what Cassandra actually is. Let’s break this down. First off, a columnar or column store or column oriented database guarantees data location for a single column in a node on disk. The column may span a bunch of or all of the rows that depend on where or how you specify partitions. However, this isn’t what the Cassandra Database uses. Cassandra is a column-family database.

A column-family storage architecture makes sure the data is stored based on locality of the data at the partition level, not the column level. Cassandra partitions group rows and columns split by a partition key, then clustered together by a specified clustering column or columns. To query Cassandra, because of this, you must know the partition key in order to avoid full data scans!

Cassandra has these partitions that guarantee to be on the same node and sort strings table (referred to most commonly as an SSTable *) in the same location within that file. Even though, depending on the compaction strategy, this can change things and the partition can be split across multiple files on a disk. So really, data locality isn’t guaranteed.

Column-family stores are great for high throughput writes and the ability to linearly scale horizontally (ya know, getting lots and lots of nodes in the cloud!). Reads using the partition key are extremely fast since this key points to exactly where the data resides. However, this often – at least last I know of – leads to a full scan of the data for any type of ad-hoc query.

A sort of historically trivial but important point is the column-family term comes from the storage engine originally used based on a key value store. The value was a set of column value tuples, which where often referenced as family, and later this family was abstracted into partitions, and then the storage engine was matched to that abstraction. Whew, ok, so that’s a lot of knowledge being coagulated into a solid eh! [scuse’ my odd artful language use if you visualized that!]

With all of this described, a that little history sprinkled in, when reading the description of Cassandra in the README.asc file of the actual Cassandra Github Repo things make just a little more sense. In the file it starts off with a description,

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.

Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Now that I’ve covered the 101 level of what Cassandra is I’ll give a look at DataStax and their respective offering.

DataStax

DataStax Enterprise at first glance might be a bit confusing since immediate questions pop up like, “Doesn’t DataStax make Cassandra?”, “Isn’t DataStax just selling support for Cassandra?”, or “Eh, wha, who is DataStax and what does this have to do with Cassandra?”. Well, I’m gonna tell ya all about where we are today regarding all of these things fit.

Performance

DataStax provides a whole selection of amenities around a database, which is derived from the Cassandra Distributed Database System. The core product and these amenities are built into what we refer to as the “DataStax Enterprise 6“. Some of specific differences are that the database engine itself has been modified out of band and now delivers 2x the performance of the standard Cassandra implemented database engine. I was somewhat dubious when I joined but after the third party benchmarks where completed that showed the difference I grew more confident. My confidence in this speed increase grew as I’ve gotten to work with the latest version I can tell in more than a few situations that it’s faster.

Read Repair & NodeSync

If you already use Cassandra, read repair works a certain way and that still works just fine in DataStax Enterprise 6. But one also has the option of using NodeSync which can help eliminate scripting, manual intervention, and other repair operations.

Spark SQL Connectivity

There’s also an always on SQL Engine for automated uptime for apps using DataStax Enterprise Analytics. This provides a better level of analytics requests and end -user analytics. Sort of on this related note, DataStax Studio also has notebook support for Spark SQL now. Writing one’s Spark SQL gets a little easier with this option.

Multi-Cloud / Hybrid-Cloud

Another huge advantage of DataStax Enterprise is going multi-cloud or hybrid-cloud with DataStax Enterprise Cassandra. Between the Lifecycle Manager (LCM), OpsCenter, and related tooling getting up and running with a cluster across a varying range of data-centers wherever they may be is quick and easy.

Summary

I’ll be providing deeper dives into the particular technology, the specific differences, and more in the future. For now I’ll wrap up this post as I’ve got a few others coming distinctively related to distributed database systems themselves ranging from specific principles (like CAP Theorem) to operational (how to and best ways to manage) and development (patterns and practices of developing against) related topics.

Overall the solutions that DataStax offers are solid advantages if you’re stepping into any large scale data (big data or whatever one would call their plethora of data) needs. Over the coming months I’ve got a lot of material – from architectural research and guidance to tactical coding implementation work – that I’ll be blogging about and providing. I’m really looking forward to exploring these capabilities, being the developer advocate to DataStax for the community of users, and learning a thing or three million.

Tag: cql

Join the Apollo Beta for FREE! Help the Databass!

Getting Started Specifics

Like this:

Jonathan Ellis talks about Five Lessons in Distributed Databases

Like this:

Getting Started Specifics

Share this:

Like this:

Share this:

Like this:

Cassandra Characteristics

DataStax

Performance

Read Repair & NodeSync

Spark SQL Connectivity

Multi-Cloud / Hybrid-Cloud

Summary

Share this:

Like this: