Development Workspace with Terraform on Azure: Part 4 – DSE w/ Packer + Importing State 4 Terraform

The next thing I wanted setup for my development workspace is a DataStax Enterprise Cluster. This will give me all of the Apache Cassandra database features plus a lot of additional features around search, OpsCenter, analytics, and more. I’ll elaborate on that in some future posts. For now, let’s get an image built we can use to add nodes to our cluster and setup some other elements.

1: DataStax Enterprise

The general installation instructions for the process I’m stepping through here in this article can be found in this documentation. To do this I started with a Packer template like the one I setup in the second part of this series. It looks, with the installation steps taken out, just like the code below.

[code language=”javascript”]
{
“variables”: {
“client_id”: “{{env `TF_VAR_clientid`}}”,
“client_secret”: “{{env `TF_VAR_clientsecret`}}”,
“tenant_id”: “{{env `TF_VAR_tenant_id`}}”,
“subscription_id”: “{{env `TF_VAR_subscription_id`}}”,
“imagename”: “”,
“storage_account”: “adronsimagestorage”,
“resource_group_name”: “adrons-images”
},

“builders”: [{
“type”: “azure-arm”,

“client_id”: “{{user `client_id`}}”,
“client_secret”: “{{user `client_secret`}}”,
“tenant_id”: “{{user `tenant_id`}}”,
“subscription_id”: “{{user `subscription_id`}}”,

“managed_image_resource_group_name”: “{{user `resource_group_name`}}”,
“managed_image_name”: “{{user `imagename`}}”,

“os_type”: “Linux”,
“image_publisher”: “Canonical”,
“image_offer”: “UbuntuServer”,
“image_sku”: “18.04-LTS”,

“azure_tags”: {
“dept”: “Engineering”,
“task”: “Image deployment”
},

“location”: “westus2”,
“vm_size”: “Standard_DS2_v2”
}],
“provisioners”: [{
“execute_command”: “chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh ‘{{ .Path }}'”,
“inline”: [
“”
],
“inline_shebang”: “/bin/sh -x”,
“type”: “shell”
}]
}
[/code]

In the section marked “inline” I setup the steps for installing DataStax Enterprise.

[code language=”javascript”]
“apt-get update”,
“apt-get install -y openjdk-8-jre”,
“java -version”,
“apt-get install libaio1”,
“echo \”deb https://debian.datastax.com/enterprise/ stable main\” | sudo tee -a /etc/apt/sources.list.d/datastax.sources.list”,
“curl -L https://debian.datastax.com/debian/repo_key | sudo apt-key add -“,
“apt-get update”,
“apt-get install -y dse-full”
[/code]

The first part of this process the machine image needs Open JDK installed, which I opted for the required version of 1.8. For more information about the Open JDK check out this material:

The next thing I needed to do was to get everything setup so that I could use this Azure Image to build an actual Virtual Machine. Since this process however is built outside of the primary Terraform main build process, I need to import the various assets that are created for Packer image creation and the actual Packer images. By importing these asset resources into Terraform’s state I can then write configuration and code around them as if I’d created them within the main Terraform build process. This might sound a bit confusing, but check out this process and it might make more sense. If it is still confusing do let me know, ping me on Twitter @adron and I’ll elaborate or edit things so that they read better.

check-box-64Verification Checklist

  • At this point there now exists a solidly installed and baked image available for use to create a Virtual Machine.

2: Terraform State & Terraform Import Resources

Ok, if you check out part 1 of this series I setup Azure CLI, Terraform, and the pertinent configuration and parts to build out infrastructure as code using HCL (Hashicorp Configuration Language) with a little bit of Bash as glue here and there. Then in Part 2 and Part 3 I setup Packer images and some Terraform resources like Kubernetes and such. All of that is great, but these two parts of the process are now in entirely two different unknown states. The two pieces are:

  1. Packer Images
  2. Terraform Infrastructure

The Terraform Infrastructure doesn’t know the Packer Images exist, but they are sitting there in a resource group in Azure. The way to make Terraform aware that these images exist is to import the various things that store the images. To import these resources into the Terraform state, before doing an apply, run the terraform import command.

In order to get all of the resources we need in which to operate and build images, the following import commands need issued. I wrote a script file to help me out with each of these, and used jq to make retrieval of the Packer created Azure Image ID’s a bit easier. That code looks like this:

[code language=”bash”]
BASECASSANDRA=$(az image list | jq ‘map({name: “basecassandra”, id})’ | jq -r ‘.[0].id’)
BASEDSE=$(az image list | jq ‘map({name: “basedse”, id})’ | jq -r ‘.[0].id’)
[/code]

Breaking down the jq commands above, the following actions are being taken. First, the Azure CLI command is issued, az image list which is then piped | into the jq command of jq 'map({name: "theimagenamehere", id})'. This command takes the results of the Azure CLI command and finds the name element with the appropriate image name, matches that and then gets the id along with it. That command is then piped into another command that returns just the value of the id jq -r '.id'. The -r is a switch that tells jq to just return the raw data, without enclosing double quotes.

I also needed to import the resource group all of these are in too, which following a similar jq command style of piping one command’s results into another, issued this command to get the Resource Group ID RG-IMPORT=$(az group show --name adronsimages | jq -r '.id'). With those three ID’s there is one more element needed to be able to import these into Terraform state.

The Terraform resources that these imported pieces of state will map to need declared, which means the Terraform HCL itself needs written out. For that, there are the two images that are needed and the Resource Group. I added the images in an images.tf files and the Resource Group goes in the resource_groups.tf file.

[code language=”javascript”]
resource “azurerm_image” “basecassandra” {
name = “basecassandra”
location = “West US”
resource_group_name = azurerm_resource_group.imported_adronsimages.name

os_disk {
os_type = “Linux”
os_state = “Generalized”
blob_uri = “{blob_uri}”
size_gb = 30
}
}

resource “azurerm_image” “basedse” {
name = “basedse”
location = “West US”
resource_group_name = azurerm_resource_group.imported_adronsimages.name

os_disk {
os_type = “Linux”
os_state = “Generalized”
blob_uri = “{blob_uri}”
size_gb = 30
}
}
[/code]

Then the Resource Group.

[code language=”javascript”]
resource “azurerm_resource_group” “imported_adronsimages” {
name = “adronsimages”
location = var.locationlong

tags = {
environment = “Development Images”
}
}
[/code]

Now, issuing these Terraform commands will pull the current state of those resource into the state, which we can then issue further Terraform commands and applies from.

[code language=”bash]
terraform import azurerm_image.basedse $BASEDSE
terraform import azurerm_image.basecassandra $BASECASSANDRA
terraform import azurerm_resource_group.imported_adronsimages $RG_IMPORT
[/code]

Running those commands, the results come back something like this.

terraform-imports

Verification Checklist

  • At this point there now exists a solidly installed and baked image available for use to create a Virtual Machine.
  • Now there is also state in Terraform, that understands where and what these resources are.

Summary, for now.

This post is shorter than I’d like it to be. But it was taking to long for the next steps to get written up – but fear not they’re on the way! In the coming post I’ll cover more of other resource elements we’ll need to import, what is next for getting a virtual machine built off of the image that is now available, some Terraform HCL refactoring, and most importantly putting together the actual DataStax Enterprise / Apache Cassandra Clusters! So stay tuned, subscribe to the blog, and of course follow me on the Twitters @Adron.

Let’s Talk Top 7 Options for Database Gumbo

When one starts to dig into databases things get really complex really fast. There’s not only a whole plethora of database companies and projects, but database types, storage engines, and other options and functionality to choose from. One place to get a start is just to take a look at the crazy long list of databases on db-engines. In this post I’m going to take a look at a few of the top database engines to create a starting point – which I’ll reference – for future video streaming coding sessions (follow me @ twitch.tv/adronhall).

My Options for Database Gumbo

  1. Apache Cassandra / DataStax Enterprise
  2. Postgresql
  3. SQL Server
  4. Elasticsearch
  5. Redis
  6. SQLite
  7. Dynamo DB

The Reasons

Ok, so the list is as such, and as stated it’s my list. There are a lot of databases, and of course some are still more used such as Oracle. However here’s some of the logic and reasoning behind my choices above.

Oracle

First off I feel like I need to broach the Oracle topic. Mostly because of their general use in industry. I’m not doing anything with Oracle now, nor have I for years for a long, long, LONG list of reasons. Using their software tends to be buried in bureaucratic, oddly broken and unnecessary usage today anyway. They use predatory market tactics, completely dishonorable approach to sales and services, as well as threatening and suing people for doing benchmarks, and a host of other practices. In face to face experiences, Oracle tends to give off experiences, that Lawrence from Office Space would say, “naw man, I think you’d get your ass kicked for that!” and I agree. Oracle’s practices are too often disgusting. But even from the purely technical point of view, the Oracle Database and ecosystem itself really isn’t better than other options out there. It is indeed a better, more intelligently strategic and tactical option to use a number of alternatives.

Apache Cassandra / DataStax Enterprise

This combo has multiple reasons and logic to be on the list. First and foremost, much of my work today is using DataStax Enterprise (DSE) and Apache Cassandra since I work for DataStax. But it’s important to know I didn’t just go to DataStax because I needed a job, but because I chose them (and obviously they chose me by hiring me) because of the team and technology. Yes, they pay me, but it’s very much a two way street, I advocate Cassandra and DSE because I personally know the tech is top tier and solid.

On the fact that Apache Cassandra is top tier and solid, it is simply the remaining truly masterless distributed database that provides a linear path of scalability on the market that you can use, buy support for, and is actually actively and knowingly maintained not just by DataStax but by members of the community. One could make an argument for MongoDB but I’ll maybe elaborate on that in the future.

In addition to being a solid distributed database there are capabilities inherent in Apache Cassandra because of the data types and respective the CQL (Cassandra Query Language) that make it a great database to use too. DataStax Enterprise extends that to provide spatial (re: GIS/Geo Data/Queries), graph data, analytics engine, and more built on other components like SOLR and related technology. Overall a great database and great prospective combinations with the database.

Postgresql

Postgres is a relational database that has been around for a long time. It’s got some really awesome features like native JSON support, which I’m a big fan of. But I digress, there’s tons of other material that lays out thoroughly why to use Postgres which I very much agree with.

Just from the perspective of the extensive and rich data types Postgres is enough to be put on this list, but considering there are a lot of reasons around multi-tenancy, scalability, and related characteristics that are mostly unique to Postgres it’s held a solid position.

SQL Server

This one is on my list for a few reasons that have nothing to do with features or capabilities. This is the first database I was responsible for in its entirety. Administration, queries, query tuning, setup, and developer against with the application tier. I think of all my experience, this database I’ve spent the most time with, with Apache Cassandra being a close second, then Postgres and finally Riak.

Kind of a pattern there eh? Relational, distributed, relational, distributed!

The other thing about SQL Server however is the integrations, tooling, and related development ecosystem around SQL Server is above and beyond most options out there. Maybe, with a big maybe, Oracle’s ecosystem might be comparable but the pricing is insanely different. In that SQL Server basically can carry the whole workload, reporting, ETL, and other feature capabilities that the Oracle ecosystem has traditionally done. Combine SQL Server with SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), and other online systems like Azure’s SQL Database and the support, tooling, and ecosystem is just massive. Even though I’ve had my ins and outs with Microsoft over the years, I’ve always found myself enjoying working on SQL Server and it’s respective tooling options and such. It’s a feature rich, complete, solidly, and generally well performing relational database, full stop.

Elasticsearch

Ok, this is kind of a distributed database of sorts but focused more exclusively (not totally since it’s kind of expanded its roles) search engine. Overall I’ve had good experiences with Elasticsearch and it’s respective ELK (or Elastic ecosystem) of tooling and such, with some frustrating flakiness here and there over the years. Most of my experience has come from an operational point of view with Elasticsearch. I’ve however done a fair bit of work over the years in supporting teams that are doing actual software development against the system. I probably won’t write a huge amount about Elasticsearch in the coming months, but I’ll definitely bring it up at certain times.

Redis / SQLite / DynamoDB

These I’ll be covering in the coming months. For Redis and DynamoDB I have wanted to dig in for some comparison analysis from the perspective of implementing data tiers against these databases, where they are a good option, and determining where they’re just an outright bad option.

For SQLite I’ve used it on and off for many years, but have wanted to sit down and just learn it and try out some of its features a bit more.

Property Graph Modeling with an FU Towards Supernodes – Jonathan Lacefield

Some notes along with this talk. Which is about ways to mitigate super nodes, partitioning strategies, and related efforts. Jonathan’s talk is vendor neutral, even though he works at DataStax. Albeit that’s not odd to me, since that’s how we roll at DataStax anyway. We take pride in working with DSE but also with knowing the various products out there, as things are, we’re all database nerds after all. (more below video)

In the video, I found the definition slide for super node was perfect.

supernodes.png

See that super node? Wow, Florida is just covered up by the explosive nature of that super node! YIKES!

In the talk Jonathan also delves deeper into the vertexes, adjacent vertices, and the respective neighbors. With definitions along the way, so it’s a great talk to watch even if you’re not up to speed on graph databases and graph math and all that related knowledge.

impacts

traversal.png

The super node problem he continues on to describe have two specific problems that are detailed; query performance traversals and storage retrieval. Such as a Gremlin traversal (one’s query), moving along creating traversers, until it hits a super node, where a computational explosion occurs.

Whatever your experience, this talk has some great knowledge to expand your ideas on how to query, design, and setup data in your graph databases to work against. Along with that more than a few elements of knowledge about what not to do when designing a schema for your graph data. Give a listen, it’s worth your time.

 

Jonathan Ellis talks about Five Lessons in Distributed Databases

Notes on the talk…

  1. If it’s not SQL it’s not a database. Watch, you’ll get to hear why… ha!

Then Jonathan covers the recent history (sort of recent, the last ~20ish years) of the industry and how we’ve gotten to this point in database technology.

  1. It takes 5+ years to build a database.

Also the tens of millions of dollars with that period of time. Both are needed, in droves, time and money.

…more below the video.

  1. The customer is always right.

Even when they’re clearly wrong, they’re largely right.

For number 4 and 5 you’ll have to watch the video. Lot’s good stuff in this video including comparisons of Cosmos, Dynamo DB, Apache Cassandra, DataStax Enterprise, and how these distributed databases work, their performance (3rd Party metrics are shown) and more details!

DataStax Developer Days

Over the last week I had the privilege and adventure of coming out to Chicago and Dallas to teach about operations and security capabilities of DataStax Enterprise. More about that later in this post, first I’ll elaborate on and answer the following:

  • What is DataStax Developer Day? Why would you want to attend?
  • Where are the current DataStax Developer Day events that have been held, and were future events are going to be held?
  • Possibilities for future events near a city you live in.

What is DataStax Developer Day?

The way we’ve organized this developer day event at DataStax, is focused around the DataStax Enterprise built on Apache Cassandra product, however I have to add the very important note that this isn’t merely just a product pitch type of thing, you can and will learn about distributed databases and systems in a general sense too. We talk about a number of the core principles behind distributed systems such as the pivotally important consistent hash ring, datacenter and racks, gossip, replication, snitches, and more. We feel it’s important that there’s enough theory that comes along with the configuration and features covered to understand who, what, where, why, and how behind the configuration and features too.

The starting point of the day’s course material is based on the idea that one has not worked with or played with a Apache Cassandra or DataStax Enterprise. However we have a number of courses throughout the day that delve into more specific details and advanced topics. There are three specific tracks:

  1. Cassandra Track – this track consists of three workshops: Core Cassandra, Cassandra Data Modeling, and Cassandra Application Development. [more details]
  2. DSE Track – this track consists of three workshops: DataStax Enterprise Search, DataStax Enterprise Analytics, and DataStax Enterprise Graph. [more details]
  3. Bonus Content – This track has two workshops: DataStax Enterprise Overview and DataStax Enterprise Operations and Security.  [more details]

Why would you want to attend?

  • One huge rad awesome reason is that the developer day events are FREE. But really, nothing is ever free right? You’d want to take a day away from the office to join us, so there’s that.
  • You also might want to even stay a little later after the event as we always have a solidly enjoyable happy hour so we can all extend conversations into the evening and talk shop. After all, working with distributed databases, managing data, and all that jazz is honestly pretty enjoyable when you’ve got awesome systems like this to work with, so an extended conversation into the evening is more than worth it!
  • You’ll get a firm basis of knowledge and skillset around the use, management, and more than a few ideas about how Apache Cassandra and DataStax Enterprise can extend your system’s systemic capabilities.
  • You’ll get a chance to go beyond merely the distributed database system of Apache Cassandra itself and delve into graph, what it is and how it works, analytics, and search too. All workshops take a look at the architecture, uses, and what these capabilities will provide your systems.
  • You’ll also have one on one time with DataStax engineers, and other technical members of the team to ask questions, talk about architecture and solutions that you may be working on, or generally discuss any number of graph, analytics, search, or distributed systems related questions.

Where are the current DataStax Developer Day events that have been held, and were future events are going to be held? So far we’ve held events in New York City, Washington DC, Chicago, and Dallas. We’ve got two more events scheduled with one in London, England and one in Paris, France.

Future events? With a number of events completed and a few on the calendar, we’re interested in hearing about future possible locations for events. Where are you located and where might an event of this sort be useful for the community? I can think of a number of cities, but organizing them into order to know where to get something scheduled next is difficult, which is why the team is looking for input. So ping me via @Adron, email, or just send me a quick message from here.