Adron's Composite Code multi-cloud

I’ve been deep in the trenches of multi-cloud tooling for years now, exploring everything from Kubernetes to Terraform, and all the glue that holds modern infrastructures together. Recently, I took a deep dive into Azure Arc, Microsoft’s hybrid and multi-cloud management solution, to see what it brings to the table. What follows is a breakdown of Azure Arc, framed through the lens of someone who’s seen the evolution of these tools over time.

The Core Idea Behind Azure Arc

At its core, Azure Arc is Microsoft’s answer to the complexities of hybrid and multi-cloud management. The idea is simple yet powerful: provide a unified management experience for resources, regardless of where they live. Whether you’re running workloads on-premises, across different cloud providers, or out on the edge, Azure Arc aims to bring them all under a single, cohesive management umbrella.

What Azure Arc Really Does

Azure Arc extends Azure’s management capabilities beyond its own boundaries. Think of it as a bridge that connects your existing infrastructure to Azure’s powerful tools and services. Once your resources are “Arc-enabled,” you can manage them just like you would any native Azure resource. This means applying policies, leveraging Azure’s security features, and using monitoring tools – all from within the Azure portal.

The beauty of Azure Arc is that it doesn’t discriminate based on where your resources are. Whether it’s a Linux server running in your own datacenter, a Kubernetes cluster on Google Cloud, or even a SQL database on AWS, Azure Arc brings it all together. This isn’t just about management, though. Azure Arc also allows you to deploy Azure services, like SQL and PostgreSQL Hyperscale, directly into your non-Azure environments.

Azure Arc for Kubernetes

Azure Arc’s support for Kubernetes is where things get particularly interesting. If you’re managing Kubernetes clusters across different environments—whether it’s AKS, EKS on AWS, GKE on Google Cloud, or even an on-premises setup—Azure Arc brings these disparate clusters into the fold under Azure’s management.

With Azure Arc, you can attach your Kubernetes clusters to Azure, enabling you to deploy applications consistently across all your clusters using GitOps, apply consistent security and governance policies, and even monitor and manage them from the Azure portal. This is incredibly powerful in multi-cloud environments where you might have clusters spread across different platforms but need a unified approach to management and operations.

The integration with AKS is seamless, of course, but the real power lies in Azure Arc’s ability to connect with other cloud providers’ Kubernetes offerings. Whether you’re dealing with AWS’s EKS, GCP’s GKE, or a custom Kubernetes setup, Azure Arc enables a level of control and consistency that can be a game-changer in complex, hybrid environments.

Defining Azure Arc

Azure Arc is, in essence, a set of technologies that extend Azure’s control plane to wherever your resources reside. Here’s what it means in practice:

Unified Server Management: You can connect and manage your Windows and Linux servers across on-premises, edge, and multi-cloud environments from a single pane of glass within Azure.
Kubernetes Cluster Integration: Azure Arc allows you to attach and manage Kubernetes clusters from anywhere. This means consistent management, monitoring, and governance across your entire Kubernetes estate, regardless of where those clusters are running.
Data Services Anywhere: With Azure Arc, you can run Azure’s data services, like Azure SQL and PostgreSQL Hyperscale, in any environment. This gives you the flexibility to use Azure’s data capabilities wherever you need them most.
Consistent Governance and Security: Perhaps one of the biggest wins here is the ability to enforce compliance and governance policies consistently across all your resources, no matter where they’re deployed.

In short, Azure Arc is Microsoft’s play to bring coherence and control to the sprawling, often chaotic, world of hybrid and multi-cloud environments. It’s a tool that’s not just about visibility but about giving you the power to manage, secure, and optimize your entire infrastructure from a single point of control. And in a world where resources are scattered across different platforms and locations, that’s a game-changer.

Cassandra Characteristics

Cassandra is a linearly scalable, highly available, fault tolerant, distributed database. That is, just to name a few of the most important characteristics. The Cassandra database is also cross-platform (runs on any operating systems), multi-cloud (runs on and across multiple clouds), and can survive regional data center outages or even in multi-cloud scenarios entire cloud provider outages!

Columnar Store, Column Based, or Column Family? What? Ok, so you might have read a number of things about what Cassandra actually is. Let’s break this down. First off, a columnar or column store or column oriented database guarantees data location for a single column in a node on disk. The column may span a bunch of or all of the rows that depend on where or how you specify partitions. However, this isn’t what the Cassandra Database uses. Cassandra is a column-family database.

A column-family storage architecture makes sure the data is stored based on locality of the data at the partition level, not the column level. Cassandra partitions group rows and columns split by a partition key, then clustered together by a specified clustering column or columns. To query Cassandra, because of this, you must know the partition key in order to avoid full data scans!

Cassandra has these partitions that guarantee to be on the same node and sort strings table (referred to most commonly as an SSTable *) in the same location within that file. Even though, depending on the compaction strategy, this can change things and the partition can be split across multiple files on a disk. So really, data locality isn’t guaranteed.

Column-family stores are great for high throughput writes and the ability to linearly scale horizontally (ya know, getting lots and lots of nodes in the cloud!). Reads using the partition key are extremely fast since this key points to exactly where the data resides. However, this often – at least last I know of – leads to a full scan of the data for any type of ad-hoc query.

A sort of historically trivial but important point is the column-family term comes from the storage engine originally used based on a key value store. The value was a set of column value tuples, which where often referenced as family, and later this family was abstracted into partitions, and then the storage engine was matched to that abstraction. Whew, ok, so that’s a lot of knowledge being coagulated into a solid eh! [scuse’ my odd artful language use if you visualized that!]

With all of this described, a that little history sprinkled in, when reading the description of Cassandra in the README.asc file of the actual Cassandra Github Repo things make just a little more sense. In the file it starts off with a description,

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.

Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Now that I’ve covered the 101 level of what Cassandra is I’ll give a look at DataStax and their respective offering.

DataStax

DataStax Enterprise at first glance might be a bit confusing since immediate questions pop up like, “Doesn’t DataStax make Cassandra?”, “Isn’t DataStax just selling support for Cassandra?”, or “Eh, wha, who is DataStax and what does this have to do with Cassandra?”. Well, I’m gonna tell ya all about where we are today regarding all of these things fit.

Performance

DataStax provides a whole selection of amenities around a database, which is derived from the Cassandra Distributed Database System. The core product and these amenities are built into what we refer to as the “DataStax Enterprise 6“. Some of specific differences are that the database engine itself has been modified out of band and now delivers 2x the performance of the standard Cassandra implemented database engine. I was somewhat dubious when I joined but after the third party benchmarks where completed that showed the difference I grew more confident. My confidence in this speed increase grew as I’ve gotten to work with the latest version I can tell in more than a few situations that it’s faster.

Read Repair & NodeSync

If you already use Cassandra, read repair works a certain way and that still works just fine in DataStax Enterprise 6. But one also has the option of using NodeSync which can help eliminate scripting, manual intervention, and other repair operations.

Spark SQL Connectivity

There’s also an always on SQL Engine for automated uptime for apps using DataStax Enterprise Analytics. This provides a better level of analytics requests and end -user analytics. Sort of on this related note, DataStax Studio also has notebook support for Spark SQL now. Writing one’s Spark SQL gets a little easier with this option.

Multi-Cloud / Hybrid-Cloud

Another huge advantage of DataStax Enterprise is going multi-cloud or hybrid-cloud with DataStax Enterprise Cassandra. Between the Lifecycle Manager (LCM), OpsCenter, and related tooling getting up and running with a cluster across a varying range of data-centers wherever they may be is quick and easy.

Summary

I’ll be providing deeper dives into the particular technology, the specific differences, and more in the future. For now I’ll wrap up this post as I’ve got a few others coming distinctively related to distributed database systems themselves ranging from specific principles (like CAP Theorem) to operational (how to and best ways to manage) and development (patterns and practices of developing against) related topics.

Overall the solutions that DataStax offers are solid advantages if you’re stepping into any large scale data (big data or whatever one would call their plethora of data) needs. Over the coming months I’ve got a lot of material – from architectural research and guidance to tactical coding implementation work – that I’ll be blogging about and providing. I’m really looking forward to exploring these capabilities, being the developer advocate to DataStax for the community of users, and learning a thing or three million.