Dynamic Data Generation with JavaScript

This video shows the process detailed below in this blog entry, to provide the choice of video or a quick read! πŸ‘πŸ»πŸ˜

I coded up some JavaScript to generate some data for a table recently and it seemed relatively useful, so here it is ready to use as you may. (The complete js file is below the description of the individual code segments below). This file simple data generation is something I put together to create a csv for some quick data imports into a database (Postgres, SQL Server, or anything you may want). With that in mind, I added the libraries and initialized the repo with the libraries I would need.

npm install faker
npm install fs
faker = require('faker');
fs = require('fs');

Next up I included the column row of data for the csv. I decided to go ahead and setup the variable at this point, as it would be needed as I would add the rest of the csv data to the variable itself. There is probably a faster way to do this, but this was the quickest path from the perspective of getting something working right now.

After the colum row, I also setup the base 8 UUIDs that would related to the project_id values to randomly use throughout data generation. The idea behind this is that the project_id values are the range of values that would be in the data that Subhendu would have, and all the ip and other recorded data would be recorded with and related to a specific project_id. I used a UUID generation site to generate these first 8 values, that site is available here.

After that I went ahead and added the for loop that would be used to step through and generate each record.

var data = "id,country,ip,created_at,updated_at,project_id\n";
let project_ids = [
    'c16f6dd8-facb-406f-90d9-45529f4c8eb7',
    'b6dcbc07-e237-402a-bf11-12bf2226c243',
    '33f45cab-0e14-4830-a51c-fd44a62d1adc',
    '5d390c9e-2cfa-471d-953d-f6727972aeba',
    'd6ef3dfd-9596-4391-b0ef-3d7a8a1a6d10',
    'e72c0ed8-d649-4c53-97c5-da793d7a8228',
    'bf020fd2-2514-4709-8108-a2810e61c503',
    'ead66a4a-968a-448c-a796-51c6a1da0c20'];

for (var i = 0; i < 500000; i++) {
    // TODO: Generation will go here.
}

The next thing that I wanted to sort out are the two dates. One would be the created_at value and the other the updated_at value. The updated_at date needed to show as occurring after the created_at date, for obvious reasons. To make sure I could get this calculated I added a function to perform the randomization! First two functions to get additions for days and hours, then getting the random value to add for each, then getting the calculated dates.

function addDays(datetime, days) {
    let date = new Date(datetime.valueOf());
    date.setDate(date.getDate() + days);
    return date;
}

function addHours(datetime, hours) {
    let time = new Date(datetime.valueOf())
    time.setTime(time.getTime() + (hours*60*60*1000));
    return time;
}

var days = faker.datatype.number({min:0, max:7})
var hours = faker.datatype.number({min:0, max:24})

var updated_at = new Date(faker.date.past())
var created_at = addHours(addDays(updated_at, -days), -hours)

With the date time stamps setup for the row data generation I moved on to selecting the specific project_id for the row.

var proj_id = project_ids[faker.datatype.number({min:0, max: 7})]

One other thing that I knew I’d need to do is filter for the ' or , values located in the countries that would be selected. The way I clean that data to ensure it doesn’t break the SQL bulk import process is kind of cheap and in production data I wouldn’t do this, but it works great for generated data like this.

var cleanCountry = faker.address.country().replace(",", " ").replace("'", " ")

If you’re curious why I’m calculating these before I do the general data generation and set the row up, I like to keep the row of actual data calls to either a set variable assignment or at most one dot level deep in my calls. As you’ll see now in the row level data being generated below.

data2 += 
    faker.datatype.uuid() + "," +
    cleanCountry + "," +
    faker.internet.ip() + "," +
    created_at.toISOString() + "," +
    updated_at.toISOString() + "," +
    proj_id + "\n"

Now the last step is to create the file for all these csv rows.

fs.writeFile('kundu_table_data.csv', data, function (err) {
  if (err) return console.log(err);
  console.log('Data file written.');
});

The results.

Setup Postgres, and GraphQL API with Hasura on Azure

Key Technologies: HasuraPostgresTerraformDocker, and Azure.

I created a data model to store railroad systems, services, scheduled, time points, and related information, detailing the schema “Beyond CRUD n’ Cruft Data-Modeling” with a few tweaks. The original I’d created for Apache Cassandra, and have since switched to Postgres giving the option of primary and foreign keys, relations, and the related connections for the model.

In this post I’ll use that schema to build out an infrastructure as code solution with Terraform, utilizing Postgres and Hasura (OSS).

Prerequisites

Docker Compose Development Environment

For the Docker Compose file I just placed them in the root of the repository. Add a docker-compose.yaml file and then added services. The first service I setup was the Postgres/PostgreSQL database. This is using the standard Postgres image on Docker Hub. I opted for version 12, I do want it to always restart if it gets shutdown or crashes, and then the last of the obvious settings is the port which maps from 5432 to 5432.

For the volume, since I might want to backup or tinker with the volume, I put the db_data location set to my own Codez directory. All my databases I tend to setup like this in case I need to debug things locally.

The POSTGRES_PASSWORD is an environment variable, thus the syntax ${PPASSWORD}. This way no passwords go into the repo. Then I can load the environment variable via a standard export POSTGRES_PASSWORD="theSecretPasswordHere!" line in my system startup script or via other means.

services:
  postgres:
    image: postgres:12
    restart: always
    volumes:
      - db_data:/Users/adron/Codez/databases
    environment:
      POSTGRES_PASSWORD: ${PPASSWORD}
    ports:
      - 5432:5432

For the db_data volume, toward the bottom I add the key value setting to reference it.

volumes:
  db_data:

Next I added the GraphQL solution with Hasura. The image for the v1.1.0 probably needs to be updated (I believe we’re on version 1.3.x now) so I’ll do that soon, but got the example working with v1.1.0. Next I’ve got the ports mapped to open 8080 to 8080. Next, this service will depend on the postgres service already detailed. Restart, also set on always just as the postgres service. Finally two evnironment variables for the container:

  • HASURA_GRAPHQL_DATABASE_URL – this variable is the base postgres URL connection string.
  • HASURA_GRAPHQL_ENABLE_CONSOLE – this is the variable that will set the console user interface to initiate. We’ll definitely want to have this for the development environment. However in production I’d likely want this turned off.
  graphql-engine:
    image: hasura/graphql-engine:v1.1.0
    ports:
      - "8080:8080"
    depends_on:
      - "postgres"
    restart: always
    environment:
      HASURA_GRAPHQL_DATABASE_URL: postgres://postgres:logistics@postgres:5432/postgres
      HASURA_GRAPHQL_ENABLE_CONSOLE: "true"

At this point the commands to start this are relatively minimal, but in spite of that I like to create a start and stop shell script. My start script and stop script simply look like this:

Starting the services.

docker-compose up -d

For the first execution of the services you may want to skip the -d and instead watch the startup just to become familiar with the events and connections as they start.

Stopping the services.

docker-compose down

πŸš€ That’s it for the basic development environment, we’re launched and ready for development. With the services started, navigate to https://localhost:8080/console to start working with the user interface, which I’ll have a more details on the “Beyond CRUD n’ Cruft Data-Modeling” swap to Hasura and Postgres in an upcoming blog post.

For full syntax of the docker-compose.yaml check out this gist: https://gist.github.com/Adron/0b2ea637b5e00681f4d62404805c3a00

Terraform Production Environment

For the production deployment of this stack I want to deploy to Azure, use Terraform for infrastructure as code, and the Azure database service for Postgres while running Hasura for my API GraphQL tier.

For the Terraform files I created a folder and added a main.tf file. I always create a folder to work in, generally, to keep the state files and initial prototyping of the infrastructre in a singular place. Eventually I’ll setup a location to store the state and fully automate the process through a continues integration (CI) and continuous delivery (CD) process. For now though, just a singular folder to keep it all in.

For this I know I’ll need a few variables and add those to the file. These are variables that I’ll use to provide values to multiple resources in the Terraform templating.

variable "database" {
  type = string
}

variable "server" {
  type = string
}

variable "username" {
  type = string
}

variable "password" {
  type = string
}

One other variable I’ll want so that it is a little easier to verify what my Hasura connection information is, will look like this.

output "hasura_url" {
  value = "postgres://${var.username}%40${azurerm_postgresql_server.logisticsserver.name}:${var.password}@${azurerm_postgresql_server.logisticsserver.fqdn}:5432/${var.database}"
}

Let’s take this one apart a bit. There is a lot of concatenated and interpolated variables being wedged together here. This is basically the Postgres connection string that Hasura will need to make a connection. It includes the username and password, and all of the pertinent parsed and string escaped values. Note specifically the %40 between the ${var.username} and ${azurerm_postgresql_server.logisticsserver.name} variables while elsewhere certain characters are not escaped, such as the @ sign. When constructing this connection string, it is very important to be prescient of all these specific values being connected together. But, I did the work for you so it’s a pretty easy copy and paste now!

Next I’ll need the Azure provider information.

provider "azurerm" {
  version = "=2.20.0"
  features {}
}

Note that there is a features array that is just empty, it is now required for the provider to designate this even if the array is empty.

Next up is the resource group that everything will be deployed to.

resource "azurerm_resource_group" "adronsrg" {
  name     = "adrons-rg"
  location = "westus2"
}

Now the Postgres Server itself. Note the location and resource_group_name simply map back to the resource group. Another thing I found a little confusing, as I wasn’t sure if it was a Terraform name or resource name tag or the server name itself, is the “name” key value pair in this resource. It is however the server name, which I’ve assigned var.server. The next value assigned “B_Gen5_2” is the Azure designator, which is a bit cryptic. More on that in a future post.

After that information the storage is set to, I believe if I RTFM’ed correctly to 5 gigs of storage. For what I’m doing this will be fine. The backup is setup for 7 days of retention. This means I’ll be able to fall back to a backup from any of the last seven days, but after 7 days the backups are rolled and the last day is deleted to make space for the newest backup. The geo_redundant_backup_enabled setting is set to false, because with Postgres’ excellent reliability and my desire to not pay for that extra reliability insurance, I don’t need geographic redundancy. Last I set auto_grow_enabled to true, albeit I do need to determine the exact flow of logic this takes for this particular implementation and deployment of Postgres.

The last chunk of details for this resource are simply the username and password, which are derived from variables, which are derived from environment variables to keep the actual username and passwords out of the repository. The last two bits set the ssl to enabled and the version of Postgres to v9.5.

resource "azurerm_postgresql_server" "logisticsserver" {
  name = var.server
  location = azurerm_resource_group.adronsrg.location
  resource_group_name = azurerm_resource_group.adronsrg.name
  sku_name = "B_Gen5_2"

  storage_mb                   = 5120
  backup_retention_days        = 7
  geo_redundant_backup_enabled = false
  auto_grow_enabled            = true

  administrator_login          = var.username
  administrator_login_password = var.password
  version                      = "9.5"
  ssl_enforcement_enabled      = true
}

Since the database server is all setup, now I can confidently add an actual database to that database. Here the resource_group_name pulls from the resource group resource and the server_name pulls from the server resource. The name, being the database name itself, I derive from a variable too. Then the character set is UTF8 and collation is set to US English, which are generally standard settings on Postgres being installed for use within the US.

resource "azurerm_postgresql_database" "logisticsdb" {
  name                = var.database
  resource_group_name = azurerm_resource_group.adronsrg.name
  server_name         = azurerm_postgresql_server.logisticsserver.name
  charset             = "UTF8"
  collation           = "English_United States.1252"
}

The next thing I discovered, after some trial and error and a good bit of searching, is the Postgres specific firewall rule. It appears this is related to the Postgres service in Azure specifically, as for a number of trials and many errors I attempted to use the standard available firewalls and firewall rules that are available in virtual networks. My understanding now is that the Postgres Servers exist outside of that paradigm and by relation to that have their own firewall rules.

This firewall rule basically attaches the firewall to the resource group, then the server itself, and allows internal access between the Postgres Server and the Hasura instance.

resource "azurerm_postgresql_firewall_rule" "pgfirewallrule" {
  name                = "allow-azure-internal"
  resource_group_name = azurerm_resource_group.adronsrg.name
  server_name         = azurerm_postgresql_server.logisticsserver.name
  start_ip_address    = "0.0.0.0"
  end_ip_address      = "0.0.0.0"
}

The last and final step is setting up the Hasura instance to work with the Postgres Server and the designated database now available.

To setup the Hasura instance I decided to go with the container service that Azure has. It provides a relatively inexpensive, easier to setup, and more concise way to setup the server than setting up an entire VM or full Kubernetes environment just to run a singular instance.

The first section sets up a public IP address, which of course I’ll need to change as the application is developed and I’ll need to provide an actual secured front end. But for now, to prove out the deployment, I’ve left it public, setup the DNS label, and set the OS type.

The next section in this resource I then outline the container details. The name of the container can be pretty much whatever you want it to be, it’s your designator. The image however is specifically hasura/graphql-engine. I’ve set the CPU and memory pretty low, at 0.5 and 1.5 respectively as I don’t suspect I’ll need a ton of horsepower just to test things out.

Next I set the port available to port 80. Then the environment variables HASURA_GRAPHQL_SERVER_PORT and HASURA_GRAPHQL_ENABLE_CONSOLE to that port to display the console there. Then finally that wild concatenated interpolated connection string that I have setup as an output variable – again specifically for testing – HASURA_GRAPHQL_DATABASE_URL.

resource "azurerm_container_group" "adronshasure" {
  name                = "adrons-hasura-logistics-data-layer"
  location            = azurerm_resource_group.adronsrg.location
  resource_group_name = azurerm_resource_group.adronsrg.name
  ip_address_type     = "public"
  dns_name_label      = "logisticsdatalayer"
  os_type             = "Linux"


  container {
    name   = "hasura-data-layer"
    image  = "hasura/graphql-engine"
    cpu    = "0.5"
    memory = "1.5"

    ports {
      port     = 80
      protocol = "TCP"
    }

    environment_variables = {
      HASURA_GRAPHQL_SERVER_PORT = 80
      HASURA_GRAPHQL_ENABLE_CONSOLE = true
    }
    secure_environment_variables = {
      HASURA_GRAPHQL_DATABASE_URL = "postgres://${var.username}%40${azurerm_postgresql_server.logisticsserver.name}:${var.password}@${azurerm_postgresql_server.logisticsserver.fqdn}:5432/${var.database}"
    }
  }

  tags = {
    environment = "datalayer"
  }
}

With all that setup it’s time to test. But first, just for clarity here’s the entire Terraform file contents.

provider "azurerm" {
  version = "=2.20.0"
  features {}
}

resource "azurerm_resource_group" "adronsrg" {
  name     = "adrons-rg"
  location = "westus2"
}

resource "azurerm_postgresql_server" "logisticsserver" {
  name = var.server
  location = azurerm_resource_group.adronsrg.location
  resource_group_name = azurerm_resource_group.adronsrg.name
  sku_name = "B_Gen5_2"

  storage_mb                   = 5120
  backup_retention_days        = 7
  geo_redundant_backup_enabled = false
  auto_grow_enabled            = true

  administrator_login          = var.username
  administrator_login_password = var.password
  version                      = "9.5"
  ssl_enforcement_enabled      = true
}

resource "azurerm_postgresql_database" "logisticsdb" {
  name                = var.database
  resource_group_name = azurerm_resource_group.adronsrg.name
  server_name         = azurerm_postgresql_server.logisticsserver.name
  charset             = "UTF8"
  collation           = "English_United States.1252"
}

resource "azurerm_postgresql_firewall_rule" "pgfirewallrule" {
  name                = "allow-azure-internal"
  resource_group_name = azurerm_resource_group.adronsrg.name
  server_name         = azurerm_postgresql_server.logisticsserver.name
  start_ip_address    = "0.0.0.0"
  end_ip_address      = "0.0.0.0"
}

resource "azurerm_container_group" "adronshasure" {
  name                = "adrons-hasura-logistics-data-layer"
  location            = azurerm_resource_group.adronsrg.location
  resource_group_name = azurerm_resource_group.adronsrg.name
  ip_address_type     = "public"
  dns_name_label      = "logisticsdatalayer"
  os_type             = "Linux"


  container {
    name   = "hasura-data-layer"
    image  = "hasura/graphql-engine"
    cpu    = "0.5"
    memory = "1.5"

    ports {
      port     = 80
      protocol = "TCP"
    }

    environment_variables = {
      HASURA_GRAPHQL_SERVER_PORT = 80
      HASURA_GRAPHQL_ENABLE_CONSOLE = true
    }
    secure_environment_variables = {
      HASURA_GRAPHQL_DATABASE_URL = "postgres://${var.username}%40${azurerm_postgresql_server.logisticsserver.name}:${var.password}@${azurerm_postgresql_server.logisticsserver.fqdn}:5432/${var.database}"
    }
  }

  tags = {
    environment = "datalayer"
  }
}

variable "database" {
  type = string
}

variable "server" {
  type = string
}

variable "username" {
  type = string
}

variable "password" {
  type = string
}

output "hasura_url" {
  value = "postgres://${var.username}%40${azurerm_postgresql_server.logisticsserver.name}:${var.password}@${azurerm_postgresql_server.logisticsserver.fqdn}:5432/${var.database}"
}

To run this, similarly to how I setup the dev environment, I’ve setup a startup and shutdown script. The startup script named prod-start.sh has the following commands. Note the $PUSERNAME and $PPASSWORD are derived from environment variables, where as the other two values are just inline.

cd terraform

terraform apply -auto-approve \
    -var 'server=logisticscoresystemsdb' \
    -var 'username='$PUSERNAME'' \
    -var 'password='$PPASSWORD'' \
    -var 'database=logistics'

For the full Terraform file check out this gist: https://gist.github.com/Adron/6d7cb4be3a22429d0ff8c8bd360f3ce2

Executing that script gives me results that, if everything goes right, looks similarly to this.

./prod-start.sh 
azurerm_resource_group.adronsrg: Creating...
azurerm_resource_group.adronsrg: Creation complete after 1s [id=/subscriptions/77ad15ff-226a-4aa9-bef3-648597374f9c/resourceGroups/adrons-rg]
azurerm_postgresql_server.logisticsserver: Creating...
azurerm_postgresql_server.logisticsserver: Still creating... [10s elapsed]
azurerm_postgresql_server.logisticsserver: Still creating... [20s elapsed]


...and it continues.

Do note that this process will take a different amount of time and is completely normal for it to take ~3 or more minutes. Once the server is done in the build process a lot of the other activities start to take place very quickly. Once it’s all done, toward the end of the output I get my hasura_url output variable so that I can confirm that it is indeed put together correctly! Now that this is preformed I can take next steps and remove that output variable, start to tighten security, and other steps. Which I’ll detail in a future blog post once more of the application is built.

... other output here ...


azurerm_container_group.adronshasure: Still creating... [40s elapsed]
azurerm_postgresql_database.logisticsdb: Still creating... [40s elapsed]
azurerm_postgresql_database.logisticsdb: Still creating... [50s elapsed]
azurerm_container_group.adronshasure: Still creating... [50s elapsed]
azurerm_postgresql_database.logisticsdb: Creation complete after 51s [id=/subscriptions/77ad15ff-226a-4aa9-bef3-648597374f9c/resourceGroups/adrons-rg/providers/Microsoft.DBforPostgreSQL/servers/logisticscoresystemsdb/databases/logistics]
azurerm_container_group.adronshasure: Still creating... [1m0s elapsed]
azurerm_container_group.adronshasure: Creation complete after 1m4s [id=/subscriptions/77ad15ff-226a-4aa9-bef3-648597374f9c/resourceGroups/adrons-rg/providers/Microsoft.ContainerInstance/containerGroups/adrons-hasura-logistics-data-layer]

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

Outputs:

hasura_url = postgres://postgres%40logisticscoresystemsdb:theSecretPassword!@logisticscoresystemsdb.postgres.database.azure.com:5432/logistics

Now if I navigate over to logisticsdatalayer.westus2.azurecontainer.io I can view the Hasura console! But where in the world is this fully qualified domain name (FQDN)? Well, the quickest way to find it is to navigate to the Azure portal and take a look at the details page of the container itself. In the upper right the FQDN is available as well as the IP that has been assigned to the container!

Navigating to that FQDN URI will bring up the Hasura console!

Next Steps

From here I’ll take up next steps in a subsequent post. I’ll get the container secured, map the user interface or CLI or whatever the application is that I build lined up to the API end points, and more!

References

Sign Up for Thrashing Code

For JavaScript, Go, Python, Terraform, and more infrastructure, web dev, and coding in general I stream regularly on Twitch at https://twitch.tv/adronhall, post the VOD’s to YouTube along with entirely new tech and metal content at https://youtube.com/c/ThrashingCode.

Survey of Go Libraries for Database Work

Over the past few months I’ve picked up a number of libraries in the Go ecosystem to help me get work done around database engineering. These libraries are ones that I have used to do a range of work primarily around Apache Cassandra, DataStax Enterprise, PostgreSQL, and to a lesser degree MS SQL Server, MySQL, and others. The following is a survey of libraries that I’ve found to be pretty solid for getting the job done.

DevOps Days Vancouver - Architecture Guidance - Venomous Database Reliability Engineering (5)

I’ve broken the follow tooling libraries out into the following categories:

  • Observability, Monitoring, & Insight – I created this section, and added libraries to it based specifically on the specific and peculiarly pedantic nature of observability in light of monitoring that work to provide insight into one’s applications they’re responsible for. For additional information about observability check out the Wikipedia article on the topic observability, it’s a great starting point. For monitoring however it gets more specific with a breakdown of monitoring types: application performance monitoring, network monitoring, system monitoring, and business transaction monitoring. The libraries in this section apply to some or all of the criteria in this definitions.
  • Data Schema Migration – Managing one’s data schema for a database, even really, truly, honestly if you have a schema-less system you still need to manage the underlying schema at some level.
  • Flow, Pipelines, Extraction, Transformation, and Loading – This section is mutative in the sense that it includes a lot of various types of libraries that have a very wide range of work to do and they offers a plethora of ways to do this work. Creating pipelines, to flow sequences, to extraction and transformation, to standard bulk loading. These libraries provide ways to get the data where you need it when you need it there in effective and reliable ways.
  • Database Backup Libraries – There are a zillion different things to maintaining effective and useful database backups; onsite storage, offsite storage, rotation periods, transmission & security control, scheduling, full or differential, and other topics of concern. One of the most important and often overlooked aspect of database backups is actually restoring the database from backup! These libraries can be used to get those backups, automate, and implement restoration of data in a more seamless way.
  • Database Drivers – At the core of any programmable automation of databases, one needs to have some way to connect to and work with the databases they’re automating, that’s where database drivers come into play. For Go, there’s a ton of support on every relatively known database in existence. MS SQL, Apache Cassandra, PostgreSQL, and dozens more!

DevOps Days Vancouver - Architecture Guidance - Venomous Database Reliability Engineering

Veneur – Largely used by and originating from Stripe. This library works as a distributed, fault tolerant pipeline for data emitted from run time on systems and services throughout your environment. It has server implementations of the DogStatsD protocol or SSFΒ (Sensor Sensibility Format)Β for aggregating metrics and sending these metrics for storage or via sinks to various other systems. The system can also works up histograms, sets, and counters as global aggregator.

TLDR;

Veneur is a convenient sink for various observability primitives with lots of outputs!

Honeycomb.io – Honeycomb I did some work for back in February of 2018 and gotta say I loved the team. Charity @mipsytipsy, Christine @cyen, Ben @maplebed and crew are tops! Friendly, wildly smart, and humble thrown in for good measure. With that said, I’m also a fan of the product. It’s a solid high cardinality, query and event intake system for observability. There are libraries for Go as well as others, and it’s pretty easy to use the library to setup ingest for appropriately instrumented applications.

TLDR;

Honeycomb.io is a Saas tool with available libraries for Go to provide observability insight and data collection for your applications!

OpenCensus – This framework and toolsetprovides ways to get telemetry out of your services. CurrentlyΒ  there are libraries for a number of languages that allow you to capture, manipulate, and export metrics and distributed traces to your data store of choice. The key idea is that OpenCensus works via tracing through the course of events in an application and that data is logged for awareness, insight, and thus observability of your systems.

TLDR;

OpenCensus is a library that provides ways to gather telemetry for your services and store it in your choice of a location.

RxGo – This library is a reactive extensions built for Go. This one is as much a programming concept as it is a way to enhance and specifically focus on observability, so let’s take a look at the intro example they’ve got on the actual repo README.md itself.

ReactiveX, or Rx for short, is an API for programming with observable streams. This is a ReactiveX API for the Go language.

ReactiveXΒ is a new, alternative way of asynchronous programming to callbacks, promises and deferred. It is about processing streams of events or items, with events being any occurrences or changes within the system.

In Go, it is simpler to think of a observable stream as a channel which canΒ SubscribeΒ to a set of handler or callback functions.

The pattern is that youΒ SubscribeΒ to anΒ ObservableΒ using anΒ Observer:

subscription := observable.Subscribe(observer)

AnΒ ObserverΒ is a type consists of threeΒ EventHandlerΒ fields, theΒ NextHandler,Β ErrHandler, andΒ DoneHandler, respectively. These handlers can be evoked withΒ OnNext,Β OnError, andΒ OnDoneΒ methods, respectively.

TheΒ ObserverΒ itself is also anΒ EventHandler. This means all types mentioned can be subscribed to anΒ Observable.

nextHandler := func(item interface{}) interface{} {
    if num, ok := item.(int); ok {
        nums = append(nums, num)
    }
}

// Only next item will be handled.
sub := observable.Subscribe(handlers.NextFunc(nextHandler))

TLDR;

RxGo are the reactive extensions that make it easier to go full scale and spectrum observability, with significantly greater insight into your applications over time and the events they execute.

DevOps Days Vancouver - Architecture Guidance - Venomous Database Reliability Engineering (1)

Go-Migrate – This library is written in Go and handles data schema migrations for a significant number of databases; PostgreSQL, MySQL, SQLite, RedShift, Neo4j, CockroadDB, and that’s just a few.

Example:

migrate -source file://path/to/migrations -database postgres://localhost:5432/database up 2

TLDR;

Go-Migrate is an open source library that can be used via CLI or in code to manage all your schema migration needs.

Gocqlx Migrate – This library primarily provides extensions to the Go CQL driver library, and one of those extensions specifically is a data-schema migration functionality.

Example:

package main

import (
    "context"

    "github.com/scylladb/gocqlx/migrate"
)

const dir = "./cql" 

func main() {
    session := CreateSession()
    defer session.Close()

    ctx := context.Background()
    if err := migrate.Migrate(ctx, session, dir); err != nil {
        panic(err)
    }
}

TLDR;

Gocqlx Migrate is a feature of the Gocqlx extensions library that can be used for schema migrations from within code.

DevOps Days Vancouver - Architecture Guidance - Venomous Database Reliability Engineering (2)

Pachyderm – (Open Source Repo) A pachyderm is

a very large mammal with thick skin, especially an elephant, rhinoceros, or hippopotamus.

So it is kind of a fitting name for this library. The library, the project itself, has found funding and bills itself as “Scalable, Reproducible Data Science“. I’ve used it minimally myself, but find it continually popping up on my “use this tool because you’ll need a ton of the features” list.

TLDR;

Pachyderm is an open source library, and paired capital funded company, that does indeed provide scalable, reproducible data science in addition to being a great library for your ETL and related data management needs.

Reflow – This library provides incremental data processing in the cloud. Providing this ability gives scientists and engineers the ability to put tools together, packaged in Docker images, using programming constructs. The library then evaluates the programs transparently parallelizing the work and memoizing results – i.e. using go routines and caching data appropriately to speed up tasks. The library was created atΒ GRAILΒ to manage our NGS (next generation sequencing) bioinformatics workloads onΒ AWS, but has also been used for many other applications, including model training and ad-hoc data analyses. Severl of Reflow’s key features include:

  • functional, lazy, type-safe Domain Specific Language (DSL) for writing workflow programs.
  • the runtime for the DSL evaluates incrementally, coordinating cluster execution, and memoization.
  • a cluster scheduler to dynamically provision and tear down resources in the cloud (currently AWS is supported).
  • with containers the same processing workloads can also be executed locally.

TLDR;

Reflow provides a way for data scientists, and by proxy database administrators, data programmers, programmers, and anybody that needs to work through ETL or related work to write programs against that data in the cloud or locally.

DevOps Days Vancouver - Architecture Guidance - Venomous Database Reliability Engineering (3)

ResticΒ (Github) – Restic is a backup CLI and Go library that will backup to a number of sources, a few including; local directory, sftp, http REST, S3, Google Cloud Storage, Azure Blob Storage, and others.

Restic follows several objectives:

  • The tool aims to be easy, with minimal singular steps to execute a backup.
  • The tool aims to be fast, using appropriate mechanisms to ensure speedy backups.
  • The tool aims to provide verifiable backups that can easily be restored.
  • The tool aims to incorporate cryptographic guarantees of confidentiality to make sure the backups are secure.
  • The tool aims to be efficient with additional snapshots only taking the storage of the actual increment and de-duplicated to save space in the storage back end.

DevOps Days Vancouver - Architecture Guidance - Venomous Database Reliability Engineering (4)

For each of these there’s a particular single driver that I use for each. Except in the case of Apache Cassandra and DataStax Enterprise I have also picked up gocqlx to add to my gocql usage.

PostgreSQL – Features:

  • SSL
  • Handles bad connections for database/sql
  • Scan time.Time correctly (i.e. timestamp[tz], time[tz], date)
  • Scan binary blobs correctly (i.e. bytea)
  • Package for hstore support
  • COPY FROM support
  • pq.ParseURL for converting urls to connection strings for sql.Open.
  • Many libpq compatible environment variables
  • Unix socket support
  • Notifications: LISTEN/NOTIFY
  • pgpass support

Gocql & Gocqlx

Gocql Features:

  • Modern Cassandra client using the native transport
  • Automatic type conversions between Cassandra and Go
    • Support for all common types including sets, lists and maps
    • Custom types can implement a Marshaler and Unmarshaler interface
    • Strict type conversions without any loss of precision
    • Built-In support for UUIDs (version 1 and 4)
  • Support for logged, unlogged and counter batches
  • Cluster management
    • Automatic reconnect on connection failures with exponential falloff
    • Round robin distribution of queries to different hosts
    • Round robin distribution of queries to different connections on a host
    • Each connection can execute up to n concurrent queries (whereby n is the limit set by the protocol version the client chooses to use)
    • Optional automatic discovery of nodes
    • Policy based connection pool with token aware and round-robin policy implementations
  • Support for password authentication
  • Iteration over paged results with configurable page size
  • Support for TLS/SSL
  • Optional frame compression (using snappy)
  • Automatic query preparation
  • Support for query tracing
  • Support for Cassandra 2.1+ binary protocol version 3
    • Support for up to 32768 streams
    • Support for tuple types
    • Support for client side timestamps by default
    • Support for UDTs via a custom marshaller or struct tags
  • Support for Cassandra 3.0+ binary protocol version 4
  • An API to access the schema metadata of a given keyspace

Gocqlx Features:

  • Binding query parameters form struct or map
  • Scanning results directly into struct or slice
  • CQL query builder (package qb)
  • Super simple CRUD operations based on table model (package table)
  • Database migrations (package migrate)

Go-MSSQLDB – Features:

  • Can be used with SQL Server 2005 or newer
  • Can be used with Microsoft Azure SQL Database
  • Can be used on all go supported platforms (e.g. Linux, Mac OS X and Windows)
  • Supports new date/time types: date, time, datetime2, datetimeoffset
  • Supports string parameters longer than 8000 characters
  • Supports encryption using SSL/TLS
  • Supports SQL Server and Windows Authentication
  • Supports Single-Sign-On on Windows
  • Supports connections to AlwaysOn Availability Group listeners, including re-direction to read-only replicas.
  • Supports query notifications

So this is just a few of the libraries I use, have worked with, and suggest checking out if you’re delving into database work and especially building systems around databases for reliability and related efforts.

If you’ve got other libraries that you’ve used, or really like, definitely leave a comment and let me know and I’ll update the post to include new libraries for Go. Subscribe to the blog too as I’ve got more posts in the cooker for database work, Go libraries and usage with databases, and a lot more. Happy thrashing code!

Bunches of Databases in Bunches of Weeks – PostgreSQL Day 1

May the database deluge begin, it’s time for “Bunches of Databases in Bunches of Weeks”. We’ll get into looking at databases similar to how they’re approached in “7 Databases in 7 Weeks“. In this session I got into a hard look at PostgreSQL or as some refer to it just Postgres. This is the first of a few sessions on PostgreSQL in which I get the database installed locally on Ubuntu. Which is transferable to any other operating system really, PostgreSQL is awesome like that. Then after installing and getting pgAdmin 4, the user interface for PostgreSQL working against that, I go the Docker route. Again, pointing pgAdmin 4 at that and creating a database and an initial table.

Below the video here I’ve added the timeline and other details, links, and other pertinent information about this series.

0:00 – The intro image splice and metal intro with tunes..
3:34 – Start of the video database content.
4:34 – Beginning the local installation of Postgres/PostgreSQL on the local machine.
20:30 – Getting pgAdmin 4 installed on local machine.
24:20 – Taking a look at pgAdmin 4, a stroll through setting up a table, getting some basic SQL from and executing with pgAdmin 4.
1:00:05 – Installing Docker and getting PostgreSQL setup as a container!
1:00:36 – Added the link to the stellar post at Digital Ocean’s Blog.
1:00:55 – My declaration that if Digital Ocean just provided documentation I’d happily pay for it, their blog entries, tutorials, and docs are hands down some of the best on the web!
1:01:10 – Installing Postgesql on Ubuntu 18.04.
1:06:44 – Signing in to Docker hub and finding the official Postgresql Docker Image.
1:09:28 – Starting the container with Docker.
1:10:24 – Connecting to the Docker Postgresql Container with pgadmin4.
1:13:00 – Creating a database and working with SQL, tables, and other resources with pgAdmin4 against the Docker container.
1:16:03 – The hacker escape outtro. Happy thrashing code!

For each of these sessions for the “Bunches of Databases in Bunches of Weeks” series I’ll follow this following sequence. I’ll go through each database in this list of my top 7 databasesΒ for day 1 (see below), then will go through each database and work through the day 2, and so on. Accumulating additional days similarly to the “7 Databases in 7 Weeks

Day 1” of the Database, I’ll work toward building a development installation of the particular database. For example, in this session I setup PostgreSQL by installing it to the local machine and also pulled a Docker image to run PostgreSQL.

Day 2” of the respective database, I’ll get into working against the database with CQL, SQL, or whatever that one would use to work specifically with the database directly. At this point I’ll also get more deeply into the types, inserting, and storing data in the respective database.

Day 3” of the respective database, I’ll get into connecting an application with C#, Node.js, and Go. Implementing a simple connection, prospectively a test of the connection, and do a simple insert, update, and delete of some sort against the respective database built on the previous day 2 of the same database.

Day 4” and onward I’ll determine the path and layout of the topic later, so subscribe on YouTube and Twitch, and tune in. The events are scheduled, with the option to be notified when a particular episode is coming on that you’d like to watch here on Twitch.

Next Events for “Bunches of Databases in Bunches of Days

Let’s Talk Top 7 Options for Database Gumbo

When one starts to dig into databases things get really complex really fast. There’s not only a whole plethora of database companies and projects, but database types, storage engines, and other options and functionality to choose from. One place to get a start is just to take a look at the crazy long list of databases on db-engines. In this post I’m going to take a look at a few of the top database engines to create a starting point – which I’ll reference – for future video streaming coding sessions (follow me @ twitch.tv/adronhall).

My Options for Database Gumbo

  1. Apache Cassandra / DataStax Enterprise
  2. Postgresql
  3. SQL Server
  4. Elasticsearch
  5. Redis
  6. SQLite
  7. Dynamo DB

The Reasons

Ok, so the list is as such, and as stated it’s my list. There are a lot of databases, and of course some are still more used such as Oracle. However here’s some of the logic and reasoning behind my choices above.

Oracle

First off I feel like I need to broach the Oracle topic. Mostly because of their general use in industry. I’m not doing anything with Oracle now, nor have I for years for a long, long, LONG list of reasons. Using their software tends to be buried in bureaucratic, oddly broken and unnecessary usage today anyway. They use predatory market tactics, completely dishonorable approach to sales and services, as well as threatening and suing people for doing benchmarks, and a host of other practices. In face to face experiences, Oracle tends to give off experiences, that Lawrence from Office Space would say, “naw man, I think you’d get your ass kicked for that!” and I agree. Oracle’s practices are too often disgusting. But even from the purely technical point of view, the Oracle Database and ecosystem itself really isn’t better than other options out there. It is indeed a better, more intelligently strategic and tactical option to use a number of alternatives.

Apache Cassandra / DataStax Enterprise

This combo has multiple reasons and logic to be on the list. First and foremost, much of my work today is using DataStax Enterprise (DSE) and Apache Cassandra since I work for DataStax. But it’s important to know I didn’t just go to DataStax because I needed a job, but because I chose them (and obviously they chose me by hiring me) because of the team and technology. Yes, they pay me, but it’s very much a two way street, I advocate Cassandra and DSE because I personally know the tech is top tier and solid.

On the fact that Apache Cassandra is top tier and solid, it is simply the remaining truly masterless distributed database that provides a linear path of scalability on the market that you can use, buy support for, and is actually actively and knowingly maintained not just by DataStax but by members of the community. One could make an argument for MongoDB but I’ll maybe elaborate on that in the future.

In addition to being a solid distributed database there are capabilities inherent in Apache Cassandra because of the data types and respective the CQL (Cassandra Query Language) that make it a great database to use too. DataStax Enterprise extends that to provide spatial (re: GIS/Geo Data/Queries), graph data, analytics engine, and more built on other components like SOLR and related technology. Overall a great database and great prospective combinations with the database.

Postgresql

Postgres is a relational database that has been around for a long time. It’s got some really awesome features like native JSON support, which I’m a big fan of. But I digress, there’s tons of other material that lays out thoroughly why to use Postgres which I very much agree with.

Just from the perspective of the extensive and rich data types Postgres is enough to be put on this list, but considering there are a lot of reasons around multi-tenancy, scalability, and related characteristics that are mostly unique to Postgres it’s held a solid position.

SQL Server

This one is on my list for a few reasons that have nothing to do with features or capabilities. This is the first database I was responsible for in its entirety. Administration, queries, query tuning, setup, and developer against with the application tier. I think of all my experience, this database I’ve spent the most time with, with Apache Cassandra being a close second, then Postgres and finally Riak.

Kind of a pattern there eh? Relational, distributed, relational, distributed!

The other thing about SQL Server however is the integrations, tooling, and related development ecosystem around SQL Server is above and beyond most options out there. Maybe, with a big maybe, Oracle’s ecosystem might be comparable but the pricing is insanely different. In that SQL Server basically can carry the whole workload, reporting, ETL, and other feature capabilities that the Oracle ecosystem has traditionally done. Combine SQL Server with SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), and other online systems like Azure’s SQL Database and the support, tooling, and ecosystem is just massive. Even though I’ve had my ins and outs with Microsoft over the years, I’ve always found myself enjoying working on SQL Server and it’s respective tooling options and such. It’s a feature rich, complete, solidly, and generally well performing relational database, full stop.

Elasticsearch

Ok, this is kind of a distributed database of sorts but focused more exclusively (not totally since it’s kind of expanded its roles) search engine. Overall I’ve had good experiences with Elasticsearch and it’s respective ELK (or Elastic ecosystem) of tooling and such, with some frustrating flakiness here and there over the years. Most of my experience has come from an operational point of view with Elasticsearch. I’ve however done a fair bit of work over the years in supporting teams that are doing actual software development against the system. I probably won’t write a huge amount about Elasticsearch in the coming months, but I’ll definitely bring it up at certain times.

Redis / SQLite / DynamoDB

These I’ll be covering in the coming months. For Redis and DynamoDB I have wanted to dig in for some comparison analysis from the perspective of implementing data tiers against these databases, where they are a good option, and determining where they’re just an outright bad option.

For SQLite I’ve used it on and off for many years, but have wanted to sit down and just learn it and try out some of its features a bit more.