Let’s Talk Top 7 Options for Database Gumbo

When one starts to dig into databases things get really complex really fast. There’s not only a whole plethora of database companies and projects, but database types, storage engines, and other options and functionality to choose from. One place to get a start is just to take a look at the crazy long list of databases on db-engines. In this post I’m going to take a look at a few of the top database engines to create a starting point – which I’ll reference – for future video streaming coding sessions (follow me @ twitch.tv/adronhall).

My Options for Database Gumbo

  1. Apache Cassandra / DataStax Enterprise
  2. Postgresql
  3. SQL Server
  4. Elasticsearch
  5. Redis
  6. SQLite
  7. Dynamo DB

The Reasons

Ok, so the list is as such, and as stated it’s my list. There are a lot of databases, and of course some are still more used such as Oracle. However here’s some of the logic and reasoning behind my choices above.

Oracle

First off I feel like I need to broach the Oracle topic. Mostly because of their general use in industry. I’m not doing anything with Oracle now, nor have I for years for a long, long, LONG list of reasons. Using their software tends to be buried in bureaucratic, oddly broken and unnecessary usage today anyway. They use predatory market tactics, completely dishonorable approach to sales and services, as well as threatening and suing people for doing benchmarks, and a host of other practices. In face to face experiences, Oracle tends to give off experiences, that Lawrence from Office Space would say, “naw man, I think you’d get your ass kicked for that!” and I agree. Oracle’s practices are too often disgusting. But even from the purely technical point of view, the Oracle Database and ecosystem itself really isn’t better than other options out there. It is indeed a better, more intelligently strategic and tactical option to use a number of alternatives.

Apache Cassandra / DataStax Enterprise

This combo has multiple reasons and logic to be on the list. First and foremost, much of my work today is using DataStax Enterprise (DSE) and Apache Cassandra since I work for DataStax. But it’s important to know I didn’t just go to DataStax because I needed a job, but because I chose them (and obviously they chose me by hiring me) because of the team and technology. Yes, they pay me, but it’s very much a two way street, I advocate Cassandra and DSE because I personally know the tech is top tier and solid.

On the fact that Apache Cassandra is top tier and solid, it is simply the remaining truly masterless distributed database that provides a linear path of scalability on the market that you can use, buy support for, and is actually actively and knowingly maintained not just by DataStax but by members of the community. One could make an argument for MongoDB but I’ll maybe elaborate on that in the future.

In addition to being a solid distributed database there are capabilities inherent in Apache Cassandra because of the data types and respective the CQL (Cassandra Query Language) that make it a great database to use too. DataStax Enterprise extends that to provide spatial (re: GIS/Geo Data/Queries), graph data, analytics engine, and more built on other components like SOLR and related technology. Overall a great database and great prospective combinations with the database.

Postgresql

Postgres is a relational database that has been around for a long time. It’s got some really awesome features like native JSON support, which I’m a big fan of. But I digress, there’s tons of other material that lays out thoroughly why to use Postgres which I very much agree with.

Just from the perspective of the extensive and rich data types Postgres is enough to be put on this list, but considering there are a lot of reasons around multi-tenancy, scalability, and related characteristics that are mostly unique to Postgres it’s held a solid position.

SQL Server

This one is on my list for a few reasons that have nothing to do with features or capabilities. This is the first database I was responsible for in its entirety. Administration, queries, query tuning, setup, and developer against with the application tier. I think of all my experience, this database I’ve spent the most time with, with Apache Cassandra being a close second, then Postgres and finally Riak.

Kind of a pattern there eh? Relational, distributed, relational, distributed!

The other thing about SQL Server however is the integrations, tooling, and related development ecosystem around SQL Server is above and beyond most options out there. Maybe, with a big maybe, Oracle’s ecosystem might be comparable but the pricing is insanely different. In that SQL Server basically can carry the whole workload, reporting, ETL, and other feature capabilities that the Oracle ecosystem has traditionally done. Combine SQL Server with SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), and other online systems like Azure’s SQL Database and the support, tooling, and ecosystem is just massive. Even though I’ve had my ins and outs with Microsoft over the years, I’ve always found myself enjoying working on SQL Server and it’s respective tooling options and such. It’s a feature rich, complete, solidly, and generally well performing relational database, full stop.

Elasticsearch

Ok, this is kind of a distributed database of sorts but focused more exclusively (not totally since it’s kind of expanded its roles) search engine. Overall I’ve had good experiences with Elasticsearch and it’s respective ELK (or Elastic ecosystem) of tooling and such, with some frustrating flakiness here and there over the years. Most of my experience has come from an operational point of view with Elasticsearch. I’ve however done a fair bit of work over the years in supporting teams that are doing actual software development against the system. I probably won’t write a huge amount about Elasticsearch in the coming months, but I’ll definitely bring it up at certain times.

Redis / SQLite / DynamoDB

These I’ll be covering in the coming months. For Redis and DynamoDB I have wanted to dig in for some comparison analysis from the perspective of implementing data tiers against these databases, where they are a good option, and determining where they’re just an outright bad option.

For SQLite I’ve used it on and off for many years, but have wanted to sit down and just learn it and try out some of its features a bit more.

A New Adventure of Multi-model Distribute Graph Time Series […etc…] Database(s) Explorations Begins!

I arrived at the airport, sending a few tweets of this or that nature with all of this Github and Microsoft News. I have a great view out the window from the Alaska Lounge just before heading to the D gates. For you aeronautics fans like myself, here’s a picture of that view and a few of those Alaska Planes with one of the newly acquired Virgin America Planes!

IMG_5264

All this news with Github and Microsoft was easily eclipsing WWDC18 and in the meanwhile little ole’ me is on my way to a new adventure in my career. So priorities what they are, the news being excited, I’m more excited today to announce today I’m joining a most excellent team at DataStax! to bring forth investigation, research, knowledge, ideas, and whatever else I can as a Developer Evangelist with the crew here at DataStax! I’m unbelievably stoked as I’ve been searching for a company that would check all of my “will this work” check boxes for some months now! DataStax won out among the other prospective candidate companies and I’m starting today!

datastax_logo_blue

To kick off this adventure, I’m heading to San Francisco to join in the fun attending DevxCon. I’ll be there a little later today, hopefully in time for the kick off (ya know, pending flights and BART are all timely and such)! Then a full day of the conf, then later will join the team for a visit to DataStax HQ and maybe a few surprises. I’m super excited and ready to bring awesome content your way, while inventing, building, and experimenting my way through some awesome technologies!

Restarting Data Diluvium – Four Steps

I’ve got three steps I’m going through to reboot the Data Diluvium Project & the respective CLI app I started about a year ago. I got a little ways into the project and then a bit distracted, it happens. Here’s the next steps I’m taking and for those interested in helping out I’ll be blogging the work here, and also sending out updates via my Thrashing Code Newsletter. You can sign up and select all the news or just the open source project news if you just want to follow the projects.

Step 0: Write Up the Ideas Behind the Project

Ok, so this will arrive subsequently. So far, just wanted to get these notes and intentions written down. Previously I’d written about the idea here, and here. Albeit after many discussion with a number of people, there will be some twists and turns to the project to make it more useful and streamlined in CLI & services.

Step 1: Cleanup The Repository

Currently the repository is kind of a mess. I’m going to aim to do the following over the next few days.

  • Write up contributor issues/files for the repo.
  • Rewrite the documentation (initial docs that is) to detail the intent of the data generator ideas.
  • Incorporate the CLI to a repo that is parallel to this repo that is designed specifically to work against this repo’s project.
  • Write up a README.md that will detail what Data Diluvium is exactly as well as point to the project site and provide installation and setup instructions.
  • Setup the first databases to target as Postgresql, Cassandra, and *maybe* one other database, but I’m not sure which one. Feel free to file an issue with a suggestion.

Step 2: Cleanup & Publish a new Project Website

This is a simple one, I need to write up copy with the details, specific with feature descriptions and intended examples. This will provide the start point to base the work for the project. It will be similar to one of those living documents in that the documentation will, can, and should change as the project is developed.

Step 3: Get More Cats Coding!

catI’ve pinged a few people I know are interested in helping out, but we’re always looking for others to help with PRs and related efforts around the project(s). If you’re game, the easiest way to get started would be to ping me directly via DM on Twitter @adron and to sign up on my Thrashing Code Newsletter and select Open Source Projects Only (unless you want all the things).

…anyway, getting to work on these tasks. Happy coding!

Data Diluvium Design Ideas

This post includes a collection of my thoughts on design and architecture of a data generation service project I’ve started called Data Diluvium. I’m very open to changes, new ideas, or completely different paradigms around these plans altogether. You can jump into the conversation thread. What kind of data do you often need? What systems do you want it inserted into?

Breakdown of Article Ideas:

  • Collected Systems API – This API service idea revolves around a request that accepts a schema type for a particular database type, an export source for inserting the data into, and generating an amount of data per the requested amount. The response then initiates that data generation, while responding with a received message and confirmation based on what it has received.
  • Individual Request API – This API service idea (thanks to Dave Curylo for this one, posted in the thread) revolves around the generation of data, requested at end points for a particular type of random data generation.

Alright, time to dive deeper into each of these.

Collected Systems APIs

  • https://datadiluvium.com/schema/generate – This API end point would take a schema with the various properties needed. For any that aren’t set, a default would be set. The generation process would then randomize, generate, and insert this data into any destination source specified. Here are some prospective examples I’ve created:

    A very basic sample JSON schema

    [
      {
        "schema": "relational",
        "database": "text"
      }
    ]
    

    In this particular example, I’ve created the simplist schema that could be sent into the service. For this particular situation I’d have (currently not decided yet) defaults that would randomly create a table, with a single column, and generate one element of data in that table. Other properties could be set, which would give control over the structure created in which to insert the data into. An example would be the following.

    [
      {
        "schema": "relational",
        "database": "postgresql",
        "structure": [
          {
            "table": "Users",
            "columns": [
              {"name": "id", "type": "uuid"},
              {"name": "firstname", "type": "firstname"},
              {"name": "lastname", "type": "lastname"},
              {"name": "email_address", "type": "email"}
            ]
          },
          {
            "table": "Addresses",
            "columns": [
              {"name": "id", "type": "uuid"},
              {"name": "street", "type": "address"},
              {"name": "city", "type": "city"},
              {"name": "state", "type": "state"},
              {"name": "postalcode", "type": "zip"}
            ]
          },
          {
            "table": "Transactions",
            "columns": [
              { "name": "id", "type": "uuid" },
              { "name": "transaction", "type": "money" },
              { "name": "stamp", "type": "date" }
            ]
          }
        ]
      }
    ]
    

    In this example, the properties included are three tables; UsersAddresses, and Transactions. In the first table, Users, the columsn would be; idfirstnamelastname, and email_address. Each of these then have a type property which sets the type of data to be generated for the columns. The same type of set of properties is then included for the Addressesand Transactions tables and their respective columns.

    Some additional questions remain, such as if the tables exist in the database, would the insertion build SQL to create the tables? Should it be assumed that the tables exist already and have the appropriate settings set to insert the data into the tables? Again, a great thing to discuss on the thread here.

  • https://datadiluvium.com/schema/validate – This could be used to validate a schema request body. Simply submit a schema and a validation response would be returned with “Valid” or “Invalid”. In the case of an invalid response, a list of prospective and known errors would be returned.

These two API end points focus around building out large data to test systemic environments and the respective construction of those environments. The actual generation of the data is assumed for this API service and the individual generation of data is discussed below in the individual request APIs.

Individual Request APIs

The following API calls could be implemented with fairly straight forward random data generation. A number can easily be randomized and returned, a word can be chosen from a dictionary, and a city returned from a list of cities. The following are prospective API calls to return data of this type.

The next level of complexity for data generation would be the slightly structured data generation. Instead of having an arbitrary list of addresses, we could or would prospectively generate one. But on the other hand, maybe we should just randomly create actual addresses that can be validated against an actual real address? That seems to have the possiblity of issues in the real world in spite of the fact all the addresses out there in the world are basically publicly accessible data. But questioning how or what the data would or could actually represent will be a great thing to discuss in the thread.

The next level of data generation complexity would be to generate sentences and other related data. This could be done a number of ways. If we wanted to have it generate intelligent sentences that made sense, it would take a little bit more work then for example generating lorum ipsum.

TLDR;

This blog entry just details the starting point of features for the Data Diluvium Project. If you’d like to jump into the project too, let me know. I’m generally working on the project during the weekends and a little during the week. There’s already a building project base that I’m starting with. If you’re interested in writing some F#, check out the work Dave Curylo has done here. I’ve been pondering breaking it out to another project and sticking to the idea of microservice but with F# for the work he put in. Anyway if you’ve got ideas on how to generate data, how you’d like to use it in your applications, or other related ideas please dive into the conversation on the Github Thread here.

History of Symphonize.js – JavaScript Client Pivot to Data Generation Library

…the history of symphonize.js So Far!

NOTE: If you just want to check out the code bits, scroll down to the sub-title #symphonize #hacking. Also important to note I’m putting the library through a fairly big refactor at the moment so that everything aligns with the documentation that I’ve recently created. So many things may not be implemented, but we’re moving toward v0.1.0, which will be a functional implementation of the library available via npm based entirely on the documentation and specs that I outline after the history.

A Short History

I started the symphonize.js project back on the 1st of November. Originally I started the project as a client driver library for Orchestrate.io, but within a day Chris Molozian commented and pointed out that there was already a client driver library for Orchestrate.io available that Steve Kaliski (Github @sjkaliski and Twitter @stevekaliski and http://stevekaliski.com/) had coded called logically orchestrate.js. Since this was available I did a pivot to symphonize.js being a data generation project instead.

The comment that enabled symphonize.js to pivot from client driver to data generation library.
The comment that made me realize symphonize.js should pivot from client driver to data generation library.

The Official Start of Symphonize.js

After that start and quick pivot I posted a blog with Orchestrate.io titled “Test Data Builder Symphonize.js With Chance.js (1/3)” to officially start the project. In that post I covered key value and graph basics, with a dive into using chance.js and orchestrate.js with examples. Near the same time I also posted a related blog on publishing an NPM module, which is the deployment focus of Symphonize.js.

Reasons Reasoning

There are two main reasons why I chose Orchestrate.io and a data generation library as the two things I wanted to combine. The first, is I knew the orchestrate.io team and really dug what they were building. I wanted to work with it and check out how well it would work for my use cases in the future. The ability to go sit down, discuss with them what they were building was great (which I interviewed Matt Heitzenroder @roder that you can watch Orchestrate.io, Stop Dealing With the Database Infrastructure!) The second reason is that my own startup that I’m co-founding with Aaron Gray (@agray) needed to use key value and graph data storage of some type, somewhere. Orchestrate.io looked like a perfect fit. After some research, giving it a go, it fit very well into what we are building.

CRUD, cURL Hacking & Next Steps

Early December I knocked out two support articles about testing APIs with cURL in Some JavaScript API Coding With Restify & Express & Hacking it With cURL …Segment #1 (with some Webstorm to boot) and Some JavaScript API Coding With Restify & Express & Hacking it With cURL …Segment #2 and an article on the Orchestrate.io Blog for part 2 of that series titled Symphonize Some Create, Read, Update & Delete [CRUD] via Orchestrate.js (2/3).

December then rolled into the standard holiday doldrums and slowdowns. So fast forward to January post a few rounds of beer and good tidings and I got the 3rd in the series published titled Getting Serious With Symphony.js – JavaScript TDD/BDD Coding Practices (3/3). The post doesn’t speak too much to symphony.js usage but instead my efforts to use TDD or BDD practices in trying to write the library.

Slowly I made progress in building the library and finally it’s in a mostly releasable state now. I use this library daily in working with the code base for Deconstructed and imagine I’ll use it ongoing for many other projects. I hope others might be able to find uses for it too and maybe even add capabilities or ideas. Just ping me via Twitter @adron or Github @adron, add an issue on Github and I’ll be happy to accept pull requests for new features, code refactoring, add you to the project or whatever else you’re interested in.

#symphonize #hacking

Now for the nitty gritty. If you’re up for using or contributing to the project check out the symphonize.js github pages site first. It’s got all the information to help get you kick started. However, you can keep reading as I’ve included much of the information there along with the examples from the README.md below.

NOTE: As I mentioned at the top of this blog entry, the funcitonal implementation of code isn’t available via npm just yet, myself and some others are ripping through a good refactor to align the implementation fo the library with the rewritten and newly available documentation – included blow and at the github pages.

How to use this project in one of your projects.

[sourcecode language=”bash”]
npm install symphonize
[/sourcecode]

How to setup this project for development.

First fork the repository located at https://github.com/Adron/symphonize.

[sourcecode language=”javascript”]
git clone git@github.com:YourUserName/symphonize.git
cd symphonize
npm install
[/sourcecode]

Using The Library

The intended usage is to invocate the JavaScript object and then call generate. That’s it, a super simple process. The code would look like this:

[sourcecode language=”javascript”]var Symphonize = require(‘../bin/symphonize’);
var symphonize = new Symphonize();
[/sourcecode]

The basic constructor invocation like this utilizes the generate.json file to generate data from. To inject the json configuration programmatically just inject the json configuration information via the constructor.

[sourcecode language=”javascript”]
var configJson = {"schema":"keyvalue"};

var Symphonize = require(‘../bin/symphonize’);
var symphonize = new Symphonize();
[/sourcecode]

Once the Symphonize data generator has been created call the generate() method as shown.

[sourcecode language=”javascript”]
symphonize.generate();
[/sourcecode]

That’s basically it. But you say, it’s supposed to do X, Y or Z. Well that’s where the json configuration data comes into play. In the configuration data you can set the data fields and what they’ll generate, what type of data will be generated, the specific schema, how many records to create and more.

generate.json

The library comes with the generate.json file already setup with a working example. Currently the generation file looks like this:

[sourcecode language=”javascript”]
{
"schema": "keyvalue", /* keyvalue, graph, event, geo */
"count": 20, /* X values to generate. */
"write_source": "console", /* console, orchestrateio and whatever other data sources that might come up. */
"fields": {
/* generates a random name. */
"fieldName": "name",
/* generates a random dice roll of a d20. */
"fieldTwo": "d20",
/* A single lorum ipsum random statement is genereated. */
"fieldSentence": "sentence",
/* A random guid is generated. */
"fieldGuid": "guid" }
}
[/sourcecode]

Configuration File Definitions

Each of the configuration options that are available have a default in the configuration file. The default is listed in italics with each definition of the configuration option listed below.

  • schema” : This is used to select what type of data structure type is going to be generated. The default iskeyvalue for this option.
  • count” : This provides the total records that are to be generated by the library. The default is 1 for this option.
  • write_source” : This provides the location to output the generated data to. The default is console for this option.
  • fields” : This is a JSON field within the JSON configuration file that provides configuration options around the fields, number of fields and their respective data to generate. The default is one field, with a default data type of guid. Each of the respective entries in this JSON option is a self contained JSON name and value pair. This then looks simply like this (which is also shown above in part):[sourcecode language=”javascript”]{
    "someBoolean": "boolean",
    "someChar": "character",
    "aFloat": "float",
    "GetAnInt": "integer",
    "fieldTwo": "d20",
    "diceRollD10": "d10",
    "_string": {
    "fieldName": "NameOfFieldForString",
    "length": 5,
    "pool": "abcdefgh"
    },
    "_sentence": {
    "fieldName": "NameOfFiledOfSentences",
    "sentence": "5"
    },
    "fieldGuid": "guid"
    }
    [/sourcecode]
  • Fields Configuration: For each of the fields you can either set the field to a particular data type or leave it empty. If the field name and value pair is left empty then the field defaults to guid. The types of data to generate for fields are listed below. These listed are all simple field and data generation types. More complex nested generation types are listed below under Complex Field Configuration below the simple section.
    • boolean“: This generates a boolean value of true or false.
    • character“: This generates a single character, such as ‘1’, ‘g’ or ‘N’.
    • float“: This generates a float value, similar to something like -211920142886.5024.
    • integer“: This generates an integer value, similar to something like 1, 14 or 24032.
    • d4“: This generates a random integer value based on a dice roll of one four sided dice. The integer range being 1-10.
    • d6“: This generates a random integer value based on a dice roll of one six sided dice. The integer range being 1-10.
    • d8“: This generates a random integer value based on a dice roll of one eight sided dice. The integer range being 1-10.
    • d10“: This generates a random integer value based on a dice roll of one ten sided dice. The integer range being 1-10.
    • d12“: This generates a random integer value based on a dice roll of one twelve sided dice. The integer range being 1-10.
    • d20“: This generates a random integer value based on a dice roll of one twenty sided dice. The integer range being 1-20.
    • d30“: This generates a random integer value based on a dice roll of one thirty sided dice. The integer range being 1-10.
    • d100“: This generates a random integer value based on a dice roll of one hundred sided dice. The integer range being 1-10.
    • guid“: This generates a random globally unique identifier. This value would be similar to ‘F0D8368D-85E2-54FB-73C4-2D60374295E3’, ‘e0aa6c0d-0af3-485d-b31a-21db00922517’ or ‘1627f683-efeb-4db8-8174-a5f2e3378c87’.
  • Complex Field Configuration: Some fields require more complex configuration for data generation, simply because the data needs some baseline of what the range or length of the values need to be. The following list details each of these. It is also important to note that these complex field configurations do not have defaults, each value must be set in the JSON configuration or an error will be thrown detailing that a complex field type wasn’t designated. Each of these complex field types is a JSON name and value parameter. The name is the passed in data type with a preceding underscore ‘_’ to generate with the value having the configuration parameters for that particular data type.
    • _string“: This generates string data based on a length and pool parameters. Required fields for this include fieldNamelength and pool. The JSON would look like this:[sourcecode language=”javascript”]"_string": {
      "fieldName": "NameOfFieldForString",
      "length": 5,
      "pool": "abcdefgh"
      }
      [/sourcecode]

      Samples of the result would look like this for the field; ‘abdef’, ‘hgcde’ or ‘ahdfg’.

    • _hash“: This generates a hash based on the length and upper parameters. Required fields for this included fieldNamelength and upper. The JSON would look like this:[sourcecode language=”javascript”]"_hash": {
      "fieldName": "HashFieldName",
      "length": 25,
      "casing": ‘upper’
      }
      [/sourcecode]

      Samples of the result would look like this for the field: ‘e5162f27da96ed8e1ae51def1ba643b91d2581d8’ or ‘3F2EB3FB85D88984C1EC4F46A3DBE740B5E0E56E’.

    • _name”: This generates a name based on the middle, *middleinitial* and prefix parameters. Required fields for this included fieldNamemiddlemiddle_initial and prefix. The JSON would look like this:[sourcecode language=”javascript”]"_name": {
      "fieldName": "nameFieldName",
      "middle": true,
      "middle_initial": true,
      "prefix": true
      }
      [/sourcecode]

      Samples of the result would look like this for the field: ‘Dafi Vatemi’, ‘Nelgatwu Powuku Heup’, ‘Ezme I Iza’, ‘Doctor Suosat Am’, ‘Mrs. Suosat Am’ or ‘Mr. Suosat Am’.

So that covers the kick start of how eventually you’ll be able to setup, use and generate data. Until then, jump into the project and give us a hand.

After this, more examples on the way, cheers!