MariaDB & SkySQL DBAAS

This is a getting started guide for MariaDB SkySQL. Let’s start with two prerequisites definitions:

  1. MariaDB – MariaDB is an open-source relational database management system (RDBMS) that is a fork of MySQL, another popular open-source database system. It was created by the original developers of MySQL after concerns arose about the acquisition of MySQL by Oracle Corporation in 2010. MariaDB is designed to be a drop-in replacement for MySQL, which means that many applications and tools developed for MySQL can also work seamlessly with MariaDB without requiring significant changes. It retains much of the same syntax, APIs, and commands as MySQL, making the transition relatively straightforward for users.
  2. MariaDB SkySQL – MariaDB SkySQL is a cloud-native Database as a Service (DBaaS) offering provided by MariaDB Corporation, the company behind the development of the MariaDB open-source database system. SkySQL is designed to simplify database management, deployment, and scaling by providing a fully managed and highly available MariaDB database solution in the cloud.

Some key features of MariaDB include:

  1. High Performance: MariaDB incorporates optimizations and improvements to enhance query execution speed and overall performance.
  2. Storage Engines: MariaDB supports multiple storage engines, including the popular InnoDB and Aria engines. Each engine has its own characteristics and performance attributes, allowing users to choose the one that best fits their requirements.
  3. Security: MariaDB includes various security enhancements, such as data encryption at rest and in transit, improved authentication methods, and better access control mechanisms.
  4. Open Source: MariaDB is fully open source, which means its source code is available for anyone to inspect, modify, and contribute to.
  5. Community and Development: MariaDB has a vibrant and active community of developers and contributors who work on its continued development and improvement.
  6. Compatibility: As mentioned earlier, MariaDB aims for compatibility with MySQL, allowing applications developed for MySQL to work with minimal changes.
  7. Extensions: MariaDB introduces some features not present in MySQL, such as the Aria storage engine, thread pooling, and more advanced geographic information system (GIS) functionality.
  8. Replication and Clustering: Like MySQL, MariaDB supports various replication methods and clustering solutions for high availability and fault tolerance.
  9. Plugins: MariaDB offers a plugin architecture that allows users to add custom functionality and features to the database system.

To elaborate further on the specifics of MariaDB SkySQL, here are some of the features of the DBAAS (DataBase As A Service):

  1. Managed Service: SkySQL takes care of database administration tasks such as provisioning, backup, monitoring, maintenance, and security updates. This allows users to focus more on their applications and less on managing the underlying database infrastructure.
  2. High Availability: SkySQL offers built-in high availability configurations that ensure database uptime and data durability. This includes automatic failover and replication setups.
  3. Scalability: SkySQL supports both vertical and horizontal scaling. Vertical scaling involves adjusting the resources of a single database instance, while horizontal scaling involves distributing data across multiple nodes for improved performance and capacity.
  4. Security: Security features such as encryption at rest and in transit, role-based access control, and network security protocols are integrated to help protect sensitive data.
  5. Multi-Cloud Support: SkySQL is designed to work across various cloud providers, enabling users to choose the cloud environment that best suits their needs. It supports popular cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
  6. Compatibility: SkySQL maintains compatibility with the MariaDB database, which means applications developed for MariaDB can run seamlessly on SkySQL with minimal modifications.
  7. Global Distributed Architecture: SkySQL offers the capability to deploy databases across multiple geographic regions for improved performance and data availability across different parts of the world.
  8. Managed Upgrades: Regular updates and improvements to the MariaDB database engine are managed by the SkySQL service, ensuring that your databases remain up to date without manual intervention.
  9. Pay-as-You-Go: SkySQL’s pricing model is typically based on usage, allowing users to pay for the resources they consume. This can be cost-effective for businesses as it eliminates the need to invest in and maintain dedicated database infrastructure.
Continue reading “MariaDB & SkySQL DBAAS”

Splitting a Postgres Timestamp with Generated Columns & GraphQL Query with Hasura

Recently I created a video short on how to split out a timestamp column for Hasura. This included the SQL for Postgres via a schema migration and also details on how this appears in the Hasura user interface. You can check out the video here.

The break out of what I show in the video is available in a Github repository also.

https://github.com/Adron/graphql-relational-concept-mapping

Postgres Table Creation SQL

Here is the specific database query that creates the table with the timestamp being broken out to the year, month, and day as generated column data.

create table standard_relational_model.users_data
(
    user_id uuid PRIMARY KEY,
    address_id uuid,
    signup_date timestamp DEFAULT now(),
    year int  GENERATED ALWAYS AS (date_part('year', signup_date)) STORED,
    month int  GENERATED ALWAYS AS (date_part('month', signup_date)) STORED,
    day int  GENERATED ALWAYS AS (date_part('day', signup_date)) STORED,
    points int,
    details jsonb
);

In this SQL the signup_date column is the timestamp column that I want split out to year, month, and day. I’ve set it up with a default function call of now() just to seed the column and not require entry when inserting a new row. With that seed, then the generated columns of year, month, and day use the date_part() function to extract the particular value out of the signup_date column and store it in the respective column.

The other columns are just there for other references.

The Hasura Console

In the Hasura Console those columns would look something like this.

Notice the syntax displayed for these is different than the migration that created them.

date_part('day'::text, signup_date)

The above of course is for day, and each respective part is designated by month, year, etc.

When the data is added to the table the results return as follow with GraphQL and results.

GraphQL

The query.

query MyQuery {
   users_data {
    signup_date
    year
    month
    day
  }
}

The results.

{
  "data": {
    "users_data": [
      {
        "signup_date": "1999-04-21T00:00:00",
        "year": 1999,
        "month": 4,
        "day": 21
      },
    
            ... etc ...
            
      {
        "signup_date": "2007-01-02T00:00:00",
        "year": 2007,
        "month": 1,
        "day": 2
      },
      {
        "signup_date": "2021-06-29T00:09:48.359247",
        "year": 2021,
        "month": 6,
        "day": 29
      }
    ]
  }
}

SQL

The query.

select signup_date, year, month, day
from standard_relational_model.users_data;

The results.

1999-04-21 00:00:00.000000,1999,4,21
2012-07-04 00:00:00.000000,2012,7,4
2019-06-24 00:00:00.000000,2019,6,24
2013-03-07 00:00:00.000000,2013,3,7
2007-01-02 00:00:00.000000,2007,1,2
2021-06-29 00:09:48.359247,2021,6,29

That is how to build generated columns in Postgres and how they’re available via Hasura to expose via GraphQL!

Scaling PostgreSQL + Top 12 Curated Posts

A video introduction into the basics of scaling a relational database like PostgreSQL.

There are two main ways to scale data storage, especially databases, and the resources available to store and process that data.

Horizontal Scaling (scale-out): This is done through adding more individual machines in some way, creating a cluster for the database to operate. Various databases handle this in different ways ranging from a system like a distributed database, which is specifically built to scale horizontally, to relational databases which require different strategies like sharding the data itself.

Vertical Scaling (scale-up): This includes increasing the individual system resources allocated to a database on a single vertically integrated machine, such as the CPU or number of CPUs, memory, and disk or disks.

For horizontal scaling the approach often requires things like possible sharding of the data across multiple databases and then pooling those resources together via connection pooling, load balancers, and other infrastructure resources to direct the correct requests and traffic at the right sharded database resource.

Sometimes from a purely process-centric horizontal scaling perspective, we just need more query and processing power but the data can be stored on a singular machine. In this case, we might implement a proxy to direct traffic across some bouncers that would then get requests and responses to and from the individual machines.

Vertical scaling often requires new hardware and a migration to the larger machine. Sometimes however it may just require more CPU, RAM, or more drive storage. If it is a singular need it often can be met by adding one of those elements or changing out one. For example, if a databases is continually hitting the peak processing of the CPUs to handle queries, one might be able to just add another processor or change the processor to a faster processor or one with more cores to process with. In the case of memory being overloaded, the same for that, simply increase the maximum memory in the system. Of course, the underlying limitations of vertical scaling always come up when you already have the fastest CPU or the memory is already maxed out. In that situation one might need to think about horizontally scaling as described in the previous described example.

The following is a top X list of blog entries that cover the expansion of the topic in many ways. I’ve found these (along with more than a few others) very useful in my own efforts around scaling Postgres (and other databases).

  1. Vertically Scaling PostgreSQL
  2. Vertically Scale Your PostgreSQL Infrastructure w/ pgpool -3- Adding Another Standby
  3. Vertically scale your PostgreSQL infrastructure with pgpool — 2 — Automatic failover and reconfiguration
  4. How well can PostgreSQL scale horizontally?
  5. How to Horizontally Scale Your Postgres Database Using Citus
  6. Horizontal scalability with Sharding in PostgreSQL — Where it is going Part 1 of 3
  7. Lessons learned scaling PostgreSQL database to 1.2bn records/month
  8. IDEAS FOR SCALING POSTGRESQL TO MULTI-TERABYTE AND BEYOND
  9. Scaling PostgreSQL Using Connection Poolers & Load Balancers
  10. Vertically Scaling PostgreSQL
  11. Scaling PostgreSQL for Large Amounts of Data
  12. Scaling Postgres

Jonathan Ellis talks about Five Lessons in Distributed Databases

Notes on the talk…

  1. If it’s not SQL it’s not a database. Watch, you’ll get to hear why… ha!

Then Jonathan covers the recent history (sort of recent, the last ~20ish years) of the industry and how we’ve gotten to this point in database technology.

  1. It takes 5+ years to build a database.

Also the tens of millions of dollars with that period of time. Both are needed, in droves, time and money.

…more below the video.

  1. The customer is always right.

Even when they’re clearly wrong, they’re largely right.

For number 4 and 5 you’ll have to watch the video. Lot’s good stuff in this video including comparisons of Cosmos, Dynamo DB, Apache Cassandra, DataStax Enterprise, and how these distributed databases work, their performance (3rd Party metrics are shown) and more details!

Distributed Systems: Cassandra, DataStax, a Short SITREP

SITREP = Situation Report. It’s military speak. 💂🏻‍♂️

Apache Cassandra is one of the most popular databases in use today. It has many characteristics and distinctive architectural details. In this post I’ll provide a description and some details for a number of these features and characteristics, divided as such. Then, after that (i.e. toward the end, so skip there if you just want to the differences) I’m doing to summarize key differences with the latest release of the DataStax Enterprise 6 version of the database.

Cassandra Characteristics

Cassandra is a linearly scalable, highly available, fault tolerant, distributed database. That is, just to name a few of the most important characteristics. The Cassandra database is also cross-platform (runs on any operating systems), multi-cloud (runs on and across multiple clouds), and can survive regional data center outages or even in multi-cloud scenarios entire cloud provider outages!

Columnar Store, Column Based, or Column Family? What? Ok, so you might have read a number of things about what Cassandra actually is. Let’s break this down. First off, a columnar or column store or column oriented database guarantees data location for a single column in a node on disk. The column may span a bunch of or all of the rows that depend on where or how you specify partitions. However, this isn’t what the Cassandra Database uses. Cassandra is a column-family database.

A column-family storage architecture makes sure the data is stored based on locality of the data at the partition level, not the column level. Cassandra partitions group rows and columns split by a partition key, then clustered together by a specified clustering column or columns. To query Cassandra, because of this, you must know the partition key in order to avoid full data scans!

Cassandra has these partitions that guarantee to be on the same node and sort strings table (referred to most commonly as an SSTable *) in the same location within that file. Even though, depending on the compaction strategy, this can change things and the partition can be split across multiple files on a disk. So really, data locality isn’t guaranteed.

Column-family stores are great for high throughput writes and the ability to linearly scale horizontally (ya know, getting lots and lots of nodes in the cloud!). Reads using the partition key are extremely fast since this key points to exactly where the data resides. However, this often – at least last I know of – leads to a full scan of the data for any type of ad-hoc query.

A sort of historically trivial but important point is the column-family term comes from the storage engine originally used based on a key value store. The value was a set of column value tuples, which where often referenced as family, and later this family was abstracted into partitions, and then the storage engine was matched to that abstraction. Whew, ok, so that’s a lot of knowledge being coagulated into a solid eh!  [scuse’ my odd artful language use if you visualized that!]

With all of this described, a that little history sprinkled in, when reading the description of Cassandra in the README.asc file of the actual Cassandra Github Repo things make just a little more sense. In the file it starts off with a description,

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.

Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Now that I’ve covered the 101 level of what Cassandra is I’ll give a look at DataStax and their respective offering.

DataStax

DataStax Enterprise at first glance might be a bit confusing since immediate questions pop up like, “Doesn’t DataStax make Cassandra?”, “Isn’t DataStax just selling support for Cassandra?”, or “Eh, wha, who is DataStax and what does this have to do with Cassandra?”. Well, I’m gonna tell ya all about where we are today regarding all of these things fit.

Performance

DataStax provides a whole selection of amenities around a database, which is derived from the Cassandra Distributed Database System. The core product and these amenities are built into what we refer to as the “DataStax Enterprise 6“. Some of specific differences are that the database engine itself has been modified out of band and now delivers 2x the performance of the standard Cassandra implemented database engine. I was somewhat dubious when I joined but after the third party benchmarks where completed that showed the difference I grew more confident. My confidence in this speed increase grew as I’ve gotten to work with the latest version I can tell in more than a few situations that it’s faster.

Read Repair & NodeSync

If you already use Cassandra, read repair works a certain way and that still works just fine in DataStax Enterprise 6. But one also has the option of using NodeSync which can help eliminate scripting, manual intervention, and other repair operations.

Spark SQL Connectivity

There’s also an always on SQL Engine for automated uptime for apps using DataStax Enterprise Analytics. This provides a better level of analytics requests and end -user analytics. Sort of on this related note, DataStax Studio also has notebook support for Spark SQL now. Writing one’s Spark SQL gets a little easier with this option.

Multi-Cloud / Hybrid-Cloud

Another huge advantage of DataStax Enterprise is going multi-cloud or hybrid-cloud with DataStax Enterprise Cassandra. Between the Lifecycle Manager (LCM), OpsCenter, and related tooling getting up and running with a cluster across a varying range of data-centers wherever they may be is quick and easy.

Summary

I’ll be providing deeper dives into the particular technology, the specific differences, and more in the future. For now I’ll wrap up this post as I’ve got a few others coming distinctively related to distributed database systems themselves ranging from specific principles (like CAP Theorem) to operational (how to and best ways to manage) and development (patterns and practices of developing against) related topics.

Overall the solutions that DataStax offers are solid advantages if you’re stepping into any large scale data (big data or whatever one would call their plethora of data) needs. Over the coming months I’ve got a lot of material – from architectural research and guidance to tactical coding implementation work – that I’ll be blogging about and providing. I’m really looking forward to exploring these capabilities, being the developer advocate to DataStax for the community of users, and learning a thing or three million.