Concurrency Patterns in Go: A Short Deep Dive Series

Introduction

Concurrency is one of those topics that can make even experienced developers break out in a cold sweat. It’s like trying to juggle flaming chainsaws while riding a unicycle on a tightrope. But here’s the thing – in today’s world of multi-core processors, distributed systems, and high-performance applications, understanding concurrency isn’t just a nice-to-have skill; it’s absolutely essential.

Go, with its goroutines and channels, makes concurrency more approachable than most languages. But just because it’s easier doesn’t mean it’s easy. You still need to understand the patterns, the pitfalls, and the best practices to build robust, scalable systems.

That’s what this series is about. We’re going to dive deep into the concurrency patterns that every Go developer should know. Not just the theory – we’ll look at real, working code examples that you can run, modify, and learn from. This blog post is going to also act as the index to the posts, with today starting with the Pipeline Pattern.

What You’ll Learn

This series covers 12 essential concurrency patterns, each with practical examples and detailed explanations. Here’s what’s coming:

Continue reading →

CAP Theorem Insights for Apache Kafka and Flink

In this article, I’ll explore CAP Theorem and its implications on distributed systems, particularly focusing on Apache Kafka, Apache Flink, and Apache Cassandra. I’ll then dissect how CAP influences these systems in real-world scenarios, delve into some of the edge cases like split-brain scenarios, and offer actionable strategies to mitigate challenges. Finally, a wrap up with deployment strategies for self-hosted environments and discuss how Confluent Cloud tackles CAP-related challenges.

What is the CAP Theorem?

The CAP Theorem, introduced by Eric Brewer, states that in a distributed data system, you can only guarantee two out of the following three properties:

Consistency (C): Every read receives the most recent write or an error.
Availability (A): Every request receives a response, even if it’s not the most recent write.
Partition Tolerance (P): The system continues to function despite network partitions.

This means that distributed systems inherently make trade-offs, and understanding these trade-offs is key to designing robust architectures.

CAP Theorem and Apache Kafka/Flink

Continue reading →

Join the Apollo Beta for FREE! Help the Databass!

Hello to all the data curious, database lovers, and sciency datamungers! I have a small favor to ask of you all. At DataStax we just opened up our Apollo service i.e. “Apache Cassandra as a Service” i.e. DBaaS offering and I’m looking for people that want to test drive the database! Now, you don’t have to actually tell me you’re using it or anything, but I’d love to know if you are. Maybe we could even chat about your experience using it.

To get started:

Sign up here.
Create a database here.
Pick a driver here. [C#/F#, Node.js/JavaScript, Java, C++, and Python] – I added F# cuz ya know, that’s how F# works and all, you just use the C# driver and BOOM, you’ve got F# access!!
Write CQL and execute the database!
Profit!

Alright, where profit is that’s when you let me know what works for you and what doesn’t. Feel free to comment here, ping me via Twitter @adron, or via the response form here, or however you’ve got to message me. I’d be super stoked to chat!

Getting Started Specifics

To create a database, once you’ve got an account, just navigate to https://apollo.datastax.com/createDatabase and you’ll get prompted with the following screen.

Currently during beta we have AWS as the provider option, and you can choose between Developer, Startup, Standard, and Enterprise. Each offering various configurations and future prospective SLA’s and such.

Once you have the database name, keyspace, user name, and you password set, click on Launch Database and the spin up of the multi-node database will begin. You’ll be greeted with a message notifying you that it’ll take a little bit of time for the database to spin up and an email will be sent once it is done. Enjoy a coffee in the meantime.

Once the database spins up there are two key sections on the database page. First, there is the connection details. They’re located in the bottom left of the database page.

If you click on the “Learn How” you’ll get directly linked to the docs pages with multiple examples of how to get connected to the database you’ve just created. You can also reset your password here and retrieve the security bundle (it’s a tar/zip file) that you’ll need to authenticate any applications with.

The other part that can be really helpful, especially as you do any development or testing with your database is the grafana dashboard. It’s on the Health tab of the database page.

A trick that I used, to get an easier and full screen view of all the metrics, is to inspect the page right at the metrics, within that you’ll find the iframe in which to get the link specifically to the Grafana metrics. They look pretty nice broken out of frame! As you work through queries and such keep an eye on this for extra insight.

Any other thoughts, contemplation, or otherwise do get in touch!

In Flight to Apache Cassandra Days

Another flight down to the bay area. Today it was Alaska Air Flight 330 from Seattle to San Jose. It was mostly a clear day at start, with a solid layer of bright cloud cover exiting Washington on the way down to Oregon. As we crossed over that arbitrary human defined line of Oregon and California, nature presented us with even more perfectly glowing bright cloud cover. This is Cascadia after all and it’s basically covered in clouds the majority of the time. On departure I also noted Bremerton has three aircraft carriers in dock along with a normal plethora of other naval vessels. The amount of naval power in the area is always pretty awe inspiring.

Why was I in flight once again? I am heading down to teach with Jeff Carpenter (@jscarp) at the South Bay Cassandra User Group‘s Cassandra Day events. These are single day events, where we cover an introduction to Apache Cassandra, concepts of data-modeling for Apache Cassandra, and then a wrap up of application development with the respective drivers. Now if you aren’t in Santa Clara – or ya know Menlo Park, San Jose, Oakland, San Francisco, or well, the surrounding area – there are other days scheduled! We also have days scheduled that aren’t even located in the Bay, so check out the full list of events:

https://www.datastax.com/company/events

NOTE: If you’re interested in Seattle, Portland, or Vancouver BC area events, scroll all the way down to the end of this blog entry I’ve got more details for you!

Introduction to Apache Cassandra

In the introduction to Apache Cassandra we cover an overview of the architecture and features of the distributed database. Starting off with a definition of a distributed hash ring and how this is used in Apache Cassandra to provide data storage across the nodes that make up the Apache Cassandra Database. Moving on we’ll get into the other capabilities, trade offs of data replication between nodes, configuration settings, and a lot more.

Data Modeling

For data modeling we start off with a short review of relational database data modeling to provide something that is more familiar for many people. From this, we then build off of many concepts around denormalization, breaking apart various levels of normalization forms, and then get into the thinking and approach behind modeling an application in a distributed database and go deeper with details around Apache Cassandra.

Application Development

For application development, focusing around the Java language and technology stack, we’ll start with some concepts around how the drivers connect to and work with Apache Cassandra. We’ll open up some code too, get into some code changes and additions, to get more familiar with how the driver works and some of the capabilities of the driver itself.

Most of the code, concepts, and related material in use around Java and the tech stack are directly usable on C#, JavaScript, and even using the community open source Go CQL Library.

Coming soon…

In the coming weeks (ok, maybe a month or two) we’ll be updating this material for Apache Cassandra v4 and additionally, I’m aiming to line up some half day and probably some full day workshops in the Cascadian region: Portland, Seattle, and Vancouver BC. They’ll be almost identical except for a few tweaks, but you’ll have to RSVP to find out the details!

Also, if you’re in between any of those cities and have a stop on the Amtrak Cascades, let me know and we’ll get an RSVP list started for your city and see if we can get the required attendee count to make it official!

Survey of Go Libraries for Database Work

Over the past few months I’ve picked up a number of libraries in the Go ecosystem to help me get work done around database engineering. These libraries are ones that I have used to do a range of work primarily around Apache Cassandra, DataStax Enterprise, PostgreSQL, and to a lesser degree MS SQL Server, MySQL, and others. The following is a survey of libraries that I’ve found to be pretty solid for getting the job done.

I’ve broken the follow tooling libraries out into the following categories:

Observability, Monitoring, & Insight – I created this section, and added libraries to it based specifically on the specific and peculiarly pedantic nature of observability in light of monitoring that work to provide insight into one’s applications they’re responsible for. For additional information about observability check out the Wikipedia article on the topic observability, it’s a great starting point. For monitoring however it gets more specific with a breakdown of monitoring types: application performance monitoring, network monitoring, system monitoring, and business transaction monitoring. The libraries in this section apply to some or all of the criteria in this definitions.
Data Schema Migration – Managing one’s data schema for a database, even really, truly, honestly if you have a schema-less system you still need to manage the underlying schema at some level.
Flow, Pipelines, Extraction, Transformation, and Loading – This section is mutative in the sense that it includes a lot of various types of libraries that have a very wide range of work to do and they offers a plethora of ways to do this work. Creating pipelines, to flow sequences, to extraction and transformation, to standard bulk loading. These libraries provide ways to get the data where you need it when you need it there in effective and reliable ways.
Database Backup Libraries – There are a zillion different things to maintaining effective and useful database backups; onsite storage, offsite storage, rotation periods, transmission & security control, scheduling, full or differential, and other topics of concern. One of the most important and often overlooked aspect of database backups is actually restoring the database from backup! These libraries can be used to get those backups, automate, and implement restoration of data in a more seamless way.
Database Drivers – At the core of any programmable automation of databases, one needs to have some way to connect to and work with the databases they’re automating, that’s where database drivers come into play. For Go, there’s a ton of support on every relatively known database in existence. MS SQL, Apache Cassandra, PostgreSQL, and dozens more!

Veneur – Largely used by and originating from Stripe. This library works as a distributed, fault tolerant pipeline for data emitted from run time on systems and services throughout your environment. It has server implementations of the DogStatsD protocol or SSF (Sensor Sensibility Format) for aggregating metrics and sending these metrics for storage or via sinks to various other systems. The system can also works up histograms, sets, and counters as global aggregator.

TLDR;

Veneur is a convenient sink for various observability primitives with lots of outputs!

Honeycomb.io – Honeycomb I did some work for back in February of 2018 and gotta say I loved the team. Charity @mipsytipsy, Christine @cyen, Ben @maplebed and crew are tops! Friendly, wildly smart, and humble thrown in for good measure. With that said, I’m also a fan of the product. It’s a solid high cardinality, query and event intake system for observability. There are libraries for Go as well as others, and it’s pretty easy to use the library to setup ingest for appropriately instrumented applications.

TLDR;

Honeycomb.io is a Saas tool with available libraries for Go to provide observability insight and data collection for your applications!

OpenCensus – This framework and toolset provides ways to get telemetry out of your services. Currently there are libraries for a number of languages that allow you to capture, manipulate, and export metrics and distributed traces to your data store of choice. The key idea is that OpenCensus works via tracing through the course of events in an application and that data is logged for awareness, insight, and thus observability of your systems.

TLDR;

OpenCensus is a library that provides ways to gather telemetry for your services and store it in your choice of a location.

RxGo – This library is a reactive extensions built for Go. This one is as much a programming concept as it is a way to enhance and specifically focus on observability, so let’s take a look at the intro example they’ve got on the actual repo README.md itself.

ReactiveX, or Rx for short, is an API for programming with observable streams. This is a ReactiveX API for the Go language.

ReactiveX is a new, alternative way of asynchronous programming to callbacks, promises and deferred. It is about processing streams of events or items, with events being any occurrences or changes within the system.

In Go, it is simpler to think of a observable stream as a channel which can Subscribe to a set of handler or callback functions.

The pattern is that you Subscribe to an Observable using an Observer:
subscription := observable.Subscribe(observer)
An Observer is a type consists of three EventHandler fields, the NextHandler, ErrHandler, and DoneHandler, respectively. These handlers can be evoked with OnNext, OnError, and OnDone methods, respectively.

The Observer itself is also an EventHandler. This means all types mentioned can be subscribed to an Observable.
nextHandler := func(item interface{}) interface{} {
    if num, ok := item.(int); ok {
        nums = append(nums, num)
    }
}

// Only next item will be handled.
sub := observable.Subscribe(handlers.NextFunc(nextHandler))

TLDR;

RxGo are the reactive extensions that make it easier to go full scale and spectrum observability, with significantly greater insight into your applications over time and the events they execute.

Go-Migrate – This library is written in Go and handles data schema migrations for a significant number of databases; PostgreSQL, MySQL, SQLite, RedShift, Neo4j, CockroadDB, and that’s just a few.

Example:

migrate -source file://path/to/migrations -database postgres://localhost:5432/database up 2

TLDR;

Go-Migrate is an open source library that can be used via CLI or in code to manage all your schema migration needs.

Gocqlx Migrate – This library primarily provides extensions to the Go CQL driver library, and one of those extensions specifically is a data-schema migration functionality.

Example:

package main

import (
    "context"

    "github.com/scylladb/gocqlx/migrate"
)

const dir = "./cql" 

func main() {
    session := CreateSession()
    defer session.Close()

    ctx := context.Background()
    if err := migrate.Migrate(ctx, session, dir); err != nil {
        panic(err)
    }
}

TLDR;

Gocqlx Migrate is a feature of the Gocqlx extensions library that can be used for schema migrations from within code.

Pachyderm – (Open Source Repo) A pachyderm is

a very large mammal with thick skin, especially an elephant, rhinoceros, or hippopotamus.

So it is kind of a fitting name for this library. The library, the project itself, has found funding and bills itself as “Scalable, Reproducible Data Science“. I’ve used it minimally myself, but find it continually popping up on my “use this tool because you’ll need a ton of the features” list.

TLDR;

Pachyderm is an open source library, and paired capital funded company, that does indeed provide scalable, reproducible data science in addition to being a great library for your ETL and related data management needs.

Reflow – This library provides incremental data processing in the cloud. Providing this ability gives scientists and engineers the ability to put tools together, packaged in Docker images, using programming constructs. The library then evaluates the programs transparently parallelizing the work and memoizing results – i.e. using go routines and caching data appropriately to speed up tasks. The library was created at GRAIL to manage our NGS (next generation sequencing) bioinformatics workloads on AWS, but has also been used for many other applications, including model training and ad-hoc data analyses. Severl of Reflow’s key features include:

functional, lazy, type-safe Domain Specific Language (DSL) for writing workflow programs.
the runtime for the DSL evaluates incrementally, coordinating cluster execution, and memoization.
a cluster scheduler to dynamically provision and tear down resources in the cloud (currently AWS is supported).
with containers the same processing workloads can also be executed locally.

TLDR;

Reflow provides a way for data scientists, and by proxy database administrators, data programmers, programmers, and anybody that needs to work through ETL or related work to write programs against that data in the cloud or locally.

Restic (Github) – Restic is a backup CLI and Go library that will backup to a number of sources, a few including; local directory, sftp, http REST, S3, Google Cloud Storage, Azure Blob Storage, and others.

Restic follows several objectives:

The tool aims to be easy, with minimal singular steps to execute a backup.
The tool aims to be fast, using appropriate mechanisms to ensure speedy backups.
The tool aims to provide verifiable backups that can easily be restored.
The tool aims to incorporate cryptographic guarantees of confidentiality to make sure the backups are secure.
The tool aims to be efficient with additional snapshots only taking the storage of the actual increment and de-duplicated to save space in the storage back end.

For each of these there’s a particular single driver that I use for each. Except in the case of Apache Cassandra and DataStax Enterprise I have also picked up gocqlx to add to my gocql usage.

PostgreSQL – Features:

SSL
Handles bad connections for database/sql
Scan time.Time correctly (i.e. timestamp[tz], time[tz], date)
Scan binary blobs correctly (i.e. bytea)
Package for hstore support
COPY FROM support
pq.ParseURL for converting urls to connection strings for sql.Open.
Many libpq compatible environment variables
Unix socket support
Notifications: LISTEN/NOTIFY
pgpass support

Gocql & Gocqlx –

Gocql Features:

Modern Cassandra client using the native transport
Automatic type conversions between Cassandra and Go
- Support for all common types including sets, lists and maps
- Custom types can implement a Marshaler and Unmarshaler interface
- Strict type conversions without any loss of precision
- Built-In support for UUIDs (version 1 and 4)
Support for logged, unlogged and counter batches
Cluster management
- Automatic reconnect on connection failures with exponential falloff
- Round robin distribution of queries to different hosts
- Round robin distribution of queries to different connections on a host
- Each connection can execute up to n concurrent queries (whereby n is the limit set by the protocol version the client chooses to use)
- Optional automatic discovery of nodes
- Policy based connection pool with token aware and round-robin policy implementations
Support for password authentication
Iteration over paged results with configurable page size
Support for TLS/SSL
Optional frame compression (using snappy)
Automatic query preparation
Support for query tracing
Support for Cassandra 2.1+ binary protocol version 3
- Support for up to 32768 streams
- Support for tuple types
- Support for client side timestamps by default
- Support for UDTs via a custom marshaller or struct tags
Support for Cassandra 3.0+ binary protocol version 4
An API to access the schema metadata of a given keyspace

Gocqlx Features:

Binding query parameters form struct or map
Scanning results directly into struct or slice
CQL query builder (package qb)
Super simple CRUD operations based on table model (package table)
Database migrations (package migrate)

Go-MSSQLDB – Features:

Can be used with SQL Server 2005 or newer
Can be used with Microsoft Azure SQL Database
Can be used on all go supported platforms (e.g. Linux, Mac OS X and Windows)
Supports new date/time types: date, time, datetime2, datetimeoffset
Supports string parameters longer than 8000 characters
Supports encryption using SSL/TLS
Supports SQL Server and Windows Authentication
Supports Single-Sign-On on Windows
Supports connections to AlwaysOn Availability Group listeners, including re-direction to read-only replicas.
Supports query notifications

So this is just a few of the libraries I use, have worked with, and suggest checking out if you’re delving into database work and especially building systems around databases for reliability and related efforts.

If you’ve got other libraries that you’ve used, or really like, definitely leave a comment and let me know and I’ll update the post to include new libraries for Go. Subscribe to the blog too as I’ve got more posts in the cooker for database work, Go libraries and usage with databases, and a lot more. Happy thrashing code!

Category: Distributed Things

Concurrency Patterns in Go: A Short Deep Dive Series

Introduction

What You’ll Learn

Like this: