Category Archives: Databases

Distributed Database Things to Know: Gossip

Some of the names used can seem to conflate the actual purpose of a feature’s functionality in distributed databases. However gossip is pretty spot on. Within a group of people gossiping the purpose is to find out each other’s business. What’s going on with Frank, who’s he seeing, and Sally started a business, say what! In the end, all gossippers get into the business and understand what Frank, Sally, and the whole crew are up to. This is a good analogy for what gossip does in a distributed database, or distributed systems in general.

The way gossip works in node, is on a peer-to-peer basis. It’s a communication protocol with the purpose of minding the other nodes business so the singular node gossiping can go about its business. The process runs every second and exchanges state messages between the nodes, which then can update their respective state and keep all nodes informed.

Preventing over-communication and mixed messages, the list is derived from seed nodes for all nodes in the cluster. When a node boots up it initiates its gossip from this seed node, which we usually have a few of, and then continues with that gossip list. Note, that seed nodes aren’t a single point of failure, as other nodes in the cluster will take their place if need be, they’re just kind of designated as the lead to initiate a gossip list from.

It is important in Apache Cassandra to also designate a single seed node per replication group (i.e. datacenter) for the seed list. This is recommended for fault tolerance, else gossip has to communicate across higher latency to hit each datacenter, which can eat at response time and performance of the gossip. Think of sending a snail mail USPS letter to a friend to get gossip news! That would take months just to find out what’s going on, kind of the same version of that for computer nodes going across datacenters to talk to the seed node.

Bunches of Databases in Bunches of Weeks – PostgreSQL Day 1

May the database deluge begin, it’s time for “Bunches of Databases in Bunches of Weeks”. We’ll get into looking at databases similar to how they’re approached in “7 Databases in 7 Weeks“. In this session I got into a hard look at PostgreSQL or as some refer to it just Postgres. This is the first of a few sessions on PostgreSQL in which I get the database installed locally on Ubuntu. Which is transferable to any other operating system really, PostgreSQL is awesome like that. Then after installing and getting pgAdmin 4, the user interface for PostgreSQL working against that, I go the Docker route. Again, pointing pgAdmin 4 at that and creating a database and an initial table.

Below the video here I’ve added the timeline and other details, links, and other pertinent information about this series.

0:00 – The intro image splice and metal intro with tunes..
3:34 – Start of the video database content.
4:34 – Beginning the local installation of Postgres/PostgreSQL on the local machine.
20:30 – Getting pgAdmin 4 installed on local machine.
24:20 – Taking a look at pgAdmin 4, a stroll through setting up a table, getting some basic SQL from and executing with pgAdmin 4.
1:00:05 – Installing Docker and getting PostgreSQL setup as a container!
1:00:36 – Added the link to the stellar post at Digital Ocean’s Blog.
1:00:55 – My declaration that if Digital Ocean just provided documentation I’d happily pay for it, their blog entries, tutorials, and docs are hands down some of the best on the web!
1:01:10 – Installing Postgesql on Ubuntu 18.04.
1:06:44 – Signing in to Docker hub and finding the official Postgresql Docker Image.
1:09:28 – Starting the container with Docker.
1:10:24 – Connecting to the Docker Postgresql Container with pgadmin4.
1:13:00 – Creating a database and working with SQL, tables, and other resources with pgAdmin4 against the Docker container.
1:16:03 – The hacker escape outtro. Happy thrashing code!

For each of these sessions for the “Bunches of Databases in Bunches of Weeks” series I’ll follow this following sequence. I’ll go through each database in this list of my top 7 databases for day 1 (see below), then will go through each database and work through the day 2, and so on. Accumulating additional days similarly to the “7 Databases in 7 Weeks

Day 1” of the Database, I’ll work toward building a development installation of the particular database. For example, in this session I setup PostgreSQL by installing it to the local machine and also pulled a Docker image to run PostgreSQL.

Day 2” of the respective database, I’ll get into working against the database with CQL, SQL, or whatever that one would use to work specifically with the database directly. At this point I’ll also get more deeply into the types, inserting, and storing data in the respective database.

Day 3” of the respective database, I’ll get into connecting an application with C#, Node.js, and Go. Implementing a simple connection, prospectively a test of the connection, and do a simple insert, update, and delete of some sort against the respective database built on the previous day 2 of the same database.

Day 4” and onward I’ll determine the path and layout of the topic later, so subscribe on YouTube and Twitch, and tune in. The events are scheduled, with the option to be notified when a particular episode is coming on that you’d like to watch here on Twitch.

Next Events for “Bunches of Databases in Bunches of Days

Distributed Database Things to Know: Snitches

Snitches. What a great name for a feature right? I’d bring up the Harry Potter thing, but I’m gonna let that one fly. (get it, it flies!)

A snitch determines where nodes go among the racks and datacenters. This is the Cassandra specific racks and datacenters however, so check out my previous post on datacenters and racks for more detail on the specifics about what they are in relation to Cassandra and DataStax Enterprise (DSE). Snitches tell the database about the network topology of the system. Requests can then be routed efficiently and enables Cassandra and DSE to distribute replicas by grouping the machines accordingly. Of the nodes, all within a cluster must use this same snitch in the logic of distribution among the system.

Snitch Options

The following are the feature options we have to determine how the snitches determine node placement.

  • DseSimpleSnitch – This is the default snitch and is intended only for development deployments. It doesn’t recognize datacenter or rack information, and simply needs a a keyspace defined to use SimpleStrategy and set a replication factor. It’s use makes it a bit easier to setup a cluster for development.
  • GossipingPropertyFileSnitch – This snitch is usable for production. Rack and datacenter information for the local node is defined in the cassandra-rackdc.properties file, which then propagates this to other nodes via gossip.
  • Ec2Snitch – This is a great snitch for simple cluster deployments that reside in a single region. For this snitch, the region name is used as the datacenter name and availability zones are setup as racks. That gives us a setup that matches datacenter and racks to region and zones, making it pretty easy to remember which is where then. Since this maps this way, as the way Ec2 works, this snitch isn’t usable among multi-region clusters.
  • Ec2MultiRegionSnitch – This snitch can be used for multi-region deployments. To use this snitch settings need to be made in both the cassandra.yaml file and cassandra-rackdc.properties file. The way this snitch works is by using the public IP designated in the broadcast_address to allow this multi-region connection.
  • GoogleCloudSnitch – This snitch, as is somewhat obvious by the name, is for DSE deployments on Google Cloud Platform (GCP). This snitch uses datacenters and racks similarly mapped as the Ec2Snitch with datacenters mapped to regions and racks mapped to zones.
  • CloudstackSnitch – This snitch is for Apache Cloudstack. Zone naming is free-form in Cloudstack so this snitch uses <country> <location> <az> notation.
  • PropertyFileSnitch – The way this snitch works is by proximity, determined by rack and datacenter. It uses network details configured in cassandra-topology.properties file, with the datacenter names defined using standard convention. These need to correlated to the name of the actual datacenters in the keyspace definition. Then nodes in the cluster are described in the cassandra-topology.properties file and must be exactly the same on every node in the cluster.
  • RackInferringSnitch – This snitch is kind of funny, because it’s a usable snitch, but it’s also an example snitch. It determines the proximity of nodes by datacenter and rack too. However it assumes these to correspond to the second and third octet of the node’s IP address. It is best used as an example for writing custom snitch classes, unless of course this matches your actual deployment conventions.

That’s the basics on snitches. I recently wrote about another important distributed database architectural concept called consistent hashing, it’s an important concept to understand about distributed databases like Cassandra and DataStax Enterprise.

References:

Let’s Talk Top 7 Options for Database Gumbo

When one starts to dig into databases things get really complex really fast. There’s not only a whole plethora of database companies and projects, but database types, storage engines, and other options and functionality to choose from. One place to get a start is just to take a look at the crazy long list of databases on db-engines. In this post I’m going to take a look at a few of the top database engines to create a starting point – which I’ll reference – for future video streaming coding sessions (follow me @ twitch.tv/adronhall).

My Options for Database Gumbo

  1. Apache Cassandra / DataStax Enterprise
  2. Postgresql
  3. SQL Server
  4. Elasticsearch
  5. Redis
  6. SQLite
  7. Dynamo DB

The Reasons

Ok, so the list is as such, and as stated it’s my list. There are a lot of databases, and of course some are still more used such as Oracle. However here’s some of the logic and reasoning behind my choices above.

Oracle

First off I feel like I need to broach the Oracle topic. Mostly because of their general use in industry. I’m not doing anything with Oracle now, nor have I for years for a long, long, LONG list of reasons. Using their software tends to be buried in bureaucratic, oddly broken and unnecessary usage today anyway. They use predatory market tactics, completely dishonorable approach to sales and services, as well as threatening and suing people for doing benchmarks, and a host of other practices. In face to face experiences, Oracle tends to give off experiences, that Lawrence from Office Space would say, “naw man, I think you’d get your ass kicked for that!” and I agree. Oracle’s practices are too often disgusting. But even from the purely technical point of view, the Oracle Database and ecosystem itself really isn’t better than other options out there. It is indeed a better, more intelligently strategic and tactical option to use a number of alternatives.

Apache Cassandra / DataStax Enterprise

This combo has multiple reasons and logic to be on the list. First and foremost, much of my work today is using DataStax Enterprise (DSE) and Apache Cassandra since I work for DataStax. But it’s important to know I didn’t just go to DataStax because I needed a job, but because I chose them (and obviously they chose me by hiring me) because of the team and technology. Yes, they pay me, but it’s very much a two way street, I advocate Cassandra and DSE because I personally know the tech is top tier and solid.

On the fact that Apache Cassandra is top tier and solid, it is simply the remaining truly masterless distributed database that provides a linear path of scalability on the market that you can use, buy support for, and is actually actively and knowingly maintained not just by DataStax but by members of the community. One could make an argument for MongoDB but I’ll maybe elaborate on that in the future.

In addition to being a solid distributed database there are capabilities inherent in Apache Cassandra because of the data types and respective the CQL (Cassandra Query Language) that make it a great database to use too. DataStax Enterprise extends that to provide spatial (re: GIS/Geo Data/Queries), graph data, analytics engine, and more built on other components like SOLR and related technology. Overall a great database and great prospective combinations with the database.

Postgresql

Postgres is a relational database that has been around for a long time. It’s got some really awesome features like native JSON support, which I’m a big fan of. But I digress, there’s tons of other material that lays out thoroughly why to use Postgres which I very much agree with.

Just from the perspective of the extensive and rich data types Postgres is enough to be put on this list, but considering there are a lot of reasons around multi-tenancy, scalability, and related characteristics that are mostly unique to Postgres it’s held a solid position.

SQL Server

This one is on my list for a few reasons that have nothing to do with features or capabilities. This is the first database I was responsible for in its entirety. Administration, queries, query tuning, setup, and developer against with the application tier. I think of all my experience, this database I’ve spent the most time with, with Apache Cassandra being a close second, then Postgres and finally Riak.

Kind of a pattern there eh? Relational, distributed, relational, distributed!

The other thing about SQL Server however is the integrations, tooling, and related development ecosystem around SQL Server is above and beyond most options out there. Maybe, with a big maybe, Oracle’s ecosystem might be comparable but the pricing is insanely different. In that SQL Server basically can carry the whole workload, reporting, ETL, and other feature capabilities that the Oracle ecosystem has traditionally done. Combine SQL Server with SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), and other online systems like Azure’s SQL Database and the support, tooling, and ecosystem is just massive. Even though I’ve had my ins and outs with Microsoft over the years, I’ve always found myself enjoying working on SQL Server and it’s respective tooling options and such. It’s a feature rich, complete, solidly, and generally well performing relational database, full stop.

Elasticsearch

Ok, this is kind of a distributed database of sorts but focused more exclusively (not totally since it’s kind of expanded its roles) search engine. Overall I’ve had good experiences with Elasticsearch and it’s respective ELK (or Elastic ecosystem) of tooling and such, with some frustrating flakiness here and there over the years. Most of my experience has come from an operational point of view with Elasticsearch. I’ve however done a fair bit of work over the years in supporting teams that are doing actual software development against the system. I probably won’t write a huge amount about Elasticsearch in the coming months, but I’ll definitely bring it up at certain times.

Redis / SQLite / DynamoDB

These I’ll be covering in the coming months. For Redis and DynamoDB I have wanted to dig in for some comparison analysis from the perspective of implementing data tiers against these databases, where they are a good option, and determining where they’re just an outright bad option.

For SQLite I’ve used it on and off for many years, but have wanted to sit down and just learn it and try out some of its features a bit more.

Cassandra Datacenter & Racks

The last post in this series is Distributed Database Things to Know: Consistent Hashing.

Let’s talk about the analogy of Apache Cassandra Datacenter & Racks to actual datacenter and racks. I kind of enjoy the use of the terms datacenter and racks to describe architectural elements of Cassandra. However, as time moves on the relationship between these terms and why they’re called datacenter and racks can be obfuscated.

Take for instance, a datacenter could just be a cloud provider, an actual physical datacenter location, a zone in Azure, or region in some other provider. What an actual Datacenter in Cassandra parlance actually is can vary, but the origins of why it’s called a Datacenter remains the same. The elements of racks also can vary, but also remain the same.

Origins: Racks & Datacenters?

Let’s cover the actual things in this industry we call datacenter and racks first, unrelated to Apache Cassandra terms.

Racks: The easiest way to describe a physical rack is to show pictures of datacenter racks via the ole’ Google images.

racks.png

A rack is something that is located in a data-center, or even just someone’s garage in some odd scenarios. Ya know, if somebody wants serious hardware to work with. The rack then has a number of servers, often various kinds, within that rack itself. As you can see from the images above there’s a wide range of these racks.

Datacenter: Again the easiest way to describe a datacenter is to just look at a bunch of pictures of datacenter, albeit you see lots of racks again. But really, that’s what a datacenter is, is a building that has lots and lots of racks.

data-center.png

However in Apache Cassandra (and respectively DataStax Enterprise products) a datacenter and rack do not directly correlate to a physical rack or datacenter. The idea is more of an abstraction than hard mapping to the physical realm. In turn it is better to think of datacenter and racks as a way to structure and organize your DataStax Enterprise or Apache Cassandra architecture. From a tree perspective of organizing your cluster, think of things in this hierarchy.

  • Cluster
    • Datacenter(s)
      • Rack(s)
        • Server(s)
          • Node (vnode)

Apache Cassandra Datacenter

An Apache Cassandra Datacenter is a group of nodes, related and configured within a cluster for replication purposes. Setting up a specific set of related nodes into a datacenter helps to reduce latency, prevent transactions from impact by other workloads, and related effects. The replication factor can also be setup to write to multiple datacenter, providing additional flexibility in architectural design and organization. One specific element of datacenter to note is that they must contain only one node type:

Depending on the replication factor, data can be written to multiple datacenters. Datacenters must never span physical locations.Each datacenter usually contains only one node type. The node types are:

  • Transactional: Previously referred to as a Cassandra node.
  • DSE Graph: A graph database for managing, analyzing, and searching highly-connected data.
  • DSE Analytics: Integration with Apache Spark.
  • DSE Search: Integration with Apache Solr. Previously referred to as a Solr node.
  • DSE SearchAnalytics: DSE Search queries within DSE Analytics jobs.

Apache Cassandra Racks

An Apache Cassandra Rack is a grouped set of servers. The architecture of Cassandra uses racks so that no replica is stored redundantly inside a singular rack, ensuring that replicas are spread around through different racks in case one rack goes down. Within a datacenter there could be multiple racks with multiple servers, as the hierarchy shown above would dictate.

To determine where data goes within a rack or sets of racks Apache Cassandra uses what is referred to as a snitch. A snitch determines which racks and datacenter a particular node belongs to, and by respect of that, determines where the replicas of data will end up. This replication strategy which is informed by the snitch can take the form of numerous kinds of snitches, some examples include;

  • SimpleSnitch – this snitch treats order as proximity. This is primarily only used when in a single-datacenter deployment.
  • Dynamic Snitching – the dynamic snitch monitors read latencies to avoid reading from hosts that have slowed down.
  • RackInferringSnitch – Proximity is determined by rack and datacenter, assumed corresponding to 3rd and 2nd octet of each node’s IP address. This particular snitch is often used as an example for writing a custom snitch class since it isn’t particularly useful unless it happens to match one’s deployment conventions.

In the future I’ll outline a few more snitches, how some of them work with more specific detail, and I’ll get into a whole selection of other topics. Be sure to subscribe to the blog, the ole’ RSS feed works great too, and follow @CompositeCode for blog updates. For discourse and hot takes follow me @Adron.

Distributed Database Things to Know Series

  1. Consistent Hashing
  2. Apache Cassandra Datacenter & Racks (this post)

 

A New Adventure of Multi-model Distribute Graph Time Series […etc…] Database(s) Explorations Begins!

I arrived at the airport, sending a few tweets of this or that nature with all of this Github and Microsoft News. I have a great view out the window from the Alaska Lounge just before heading to the D gates. For you aeronautics fans like myself, here’s a picture of that view and a few of those Alaska Planes with one of the newly acquired Virgin America Planes!

IMG_5264

All this news with Github and Microsoft was easily eclipsing WWDC18 and in the meanwhile little ole’ me is on my way to a new adventure in my career. So priorities what they are, the news being excited, I’m more excited today to announce today I’m joining a most excellent team at DataStax! to bring forth investigation, research, knowledge, ideas, and whatever else I can as a Developer Evangelist with the crew here at DataStax! I’m unbelievably stoked as I’ve been searching for a company that would check all of my “will this work” check boxes for some months now! DataStax won out among the other prospective candidate companies and I’m starting today!

datastax_logo_blue

To kick off this adventure, I’m heading to San Francisco to join in the fun attending DevxCon. I’ll be there a little later today, hopefully in time for the kick off (ya know, pending flights and BART are all timely and such)! Then a full day of the conf, then later will join the team for a visit to DataStax HQ and maybe a few surprises. I’m super excited and ready to bring awesome content your way, while inventing, building, and experimenting my way through some awesome technologies!

In-memory Orchestrate Local Development Database

I was talking with Tory Adams @BEZEI2K about working with Orchestrate‘s Services. We’re totally sold on what they offer and are looking forward to a lot of the technology that is in the works. The day to day building against Orchestrate is super easy, and setting up collections for dev or test or whatever are so easy nothing has stood in our way. Except one thing…

Every once in a while we have to work disconnected. For whatever the reason might be; Comcast cable goes out, we decide to jump on a train or one of us ends up on one of those Q400 puddle jumpers that doesn’t have wifi! But regardless of being disconnected from wifi, cable or internet connectivity we still want to be able to code and test!

In Memory Orchestrate Wrapper

Enter the idea of creating an in memory Orchestrate database wrapper. Using something like convict.js one could easily redirect all the connections as necessary when developing locally. That way development continues right along and when the application is pushed live, it’s redirected to the appropriate Orchestrate connections and keys!

This in memory “fake” or “mock” would need to have the key value, events, and graph store setup just like Orchestrate. With the possibility of having this in memory one could also easily write tests against a real fake and be able to test connected or disconnected without mocking. Not to say that’s a good or bad idea, but just one more tool in the tool chest doesn’t hurt!

If something like this doesn’t pop up in the next week or three, I might just have to kick off this project myself! If anybody is interested please reach out to me and let’s discuss! I’m open to writing it in JavaScript, C#, Java or whatever poison pill you’d prefer. (I’m not polyglot to limit my options!!)

Other Ideas, Development Shop Swap

Another idea that I’ve been pondering is setting up a development shop swap. I’ll leave the reader to determine what that means!  😉  Feel free to throw down ideas that this might bring up and I’ll incorporate that into the soon to be implementation. I’ll have more information about that idea right here once the project gets rolling. In the meantime, happy coding!