The Exasperating Topic of Database Indexes

Previous posts on this theme, “Designing a Relational Database – Data Modeling“ and “Let’s Talk About Database Schema“.

Indexes aren’t necessary right? Of course not, they’re just peripherally superficial things that sit like leaches on a database table’s column!

Ok, bad database satire aside, let’s get real. Indexes are extremely useful for a plethora of reasons. They play a crucial role in optimizing the speed of database operations, especially retrieval operations.

Demystifying the Magic Behind Relational Database Indexes

Ever tried finding a specific recipe in a massive cookbook? Imagine if there wasn’t an index! Just as that cookbook index saves you hours of flipping through page after page, a relational database index works wonders by helping the system quickly locate rows within a table without the tedious process of examining each one.

So, How Does This Magic Work?

Continue reading “The Exasperating Topic of Database Indexes”

A Survey of 21 ETL Tools for Python

Here are summaries of each of the tools you’ve mentioned along with examples of how to implement the ETL (Extract, Transform, Load) process using each tool within a Python workflow:

  1. Apache Spark: Apache Spark is a powerful open-source cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It’s commonly used for processing large-scale data and running complex ETL pipelines. Example Implementation:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ETLExample") \
    .getOrCreate()

# Load data from source
source_data = spark.read.csv("source_data.csv", header=True, inferSchema=True)

# Apply transformations
transformed_data = source_data.select("column1", "column2").filter(source_data["column3"] > 10)

# Write data to destination
transformed_data.write.parquet("transformed_data.parquet")

spark.stop()
  1. Apache Airflow: Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It allows you to define complex ETL workflows as directed acyclic graphs (DAGs) and manage their execution. Example Implementation: Define a DAG in a Python script:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def etl_process():
    # Your ETL logic here
    pass

default_args = {
    'start_date': datetime(2023, 8, 1),
    'schedule_interval': '0 0 * * *',  # Run daily at midnight
}

dag = DAG('etl_workflow', default_args=default_args)

etl_task = PythonOperator(
    task_id='etl_task',
    python_callable=etl_process,
    dag=dag,
)
Continue reading “A Survey of 21 ETL Tools for Python”

MariaDB & SkySQL DBAAS

This is a getting started guide for MariaDB SkySQL. Let’s start with two prerequisites definitions:

  1. MariaDB – MariaDB is an open-source relational database management system (RDBMS) that is a fork of MySQL, another popular open-source database system. It was created by the original developers of MySQL after concerns arose about the acquisition of MySQL by Oracle Corporation in 2010. MariaDB is designed to be a drop-in replacement for MySQL, which means that many applications and tools developed for MySQL can also work seamlessly with MariaDB without requiring significant changes. It retains much of the same syntax, APIs, and commands as MySQL, making the transition relatively straightforward for users.
  2. MariaDB SkySQL – MariaDB SkySQL is a cloud-native Database as a Service (DBaaS) offering provided by MariaDB Corporation, the company behind the development of the MariaDB open-source database system. SkySQL is designed to simplify database management, deployment, and scaling by providing a fully managed and highly available MariaDB database solution in the cloud.

Some key features of MariaDB include:

  1. High Performance: MariaDB incorporates optimizations and improvements to enhance query execution speed and overall performance.
  2. Storage Engines: MariaDB supports multiple storage engines, including the popular InnoDB and Aria engines. Each engine has its own characteristics and performance attributes, allowing users to choose the one that best fits their requirements.
  3. Security: MariaDB includes various security enhancements, such as data encryption at rest and in transit, improved authentication methods, and better access control mechanisms.
  4. Open Source: MariaDB is fully open source, which means its source code is available for anyone to inspect, modify, and contribute to.
  5. Community and Development: MariaDB has a vibrant and active community of developers and contributors who work on its continued development and improvement.
  6. Compatibility: As mentioned earlier, MariaDB aims for compatibility with MySQL, allowing applications developed for MySQL to work with minimal changes.
  7. Extensions: MariaDB introduces some features not present in MySQL, such as the Aria storage engine, thread pooling, and more advanced geographic information system (GIS) functionality.
  8. Replication and Clustering: Like MySQL, MariaDB supports various replication methods and clustering solutions for high availability and fault tolerance.
  9. Plugins: MariaDB offers a plugin architecture that allows users to add custom functionality and features to the database system.

To elaborate further on the specifics of MariaDB SkySQL, here are some of the features of the DBAAS (DataBase As A Service):

  1. Managed Service: SkySQL takes care of database administration tasks such as provisioning, backup, monitoring, maintenance, and security updates. This allows users to focus more on their applications and less on managing the underlying database infrastructure.
  2. High Availability: SkySQL offers built-in high availability configurations that ensure database uptime and data durability. This includes automatic failover and replication setups.
  3. Scalability: SkySQL supports both vertical and horizontal scaling. Vertical scaling involves adjusting the resources of a single database instance, while horizontal scaling involves distributing data across multiple nodes for improved performance and capacity.
  4. Security: Security features such as encryption at rest and in transit, role-based access control, and network security protocols are integrated to help protect sensitive data.
  5. Multi-Cloud Support: SkySQL is designed to work across various cloud providers, enabling users to choose the cloud environment that best suits their needs. It supports popular cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
  6. Compatibility: SkySQL maintains compatibility with the MariaDB database, which means applications developed for MariaDB can run seamlessly on SkySQL with minimal modifications.
  7. Global Distributed Architecture: SkySQL offers the capability to deploy databases across multiple geographic regions for improved performance and data availability across different parts of the world.
  8. Managed Upgrades: Regular updates and improvements to the MariaDB database engine are managed by the SkySQL service, ensuring that your databases remain up to date without manual intervention.
  9. Pay-as-You-Go: SkySQL’s pricing model is typically based on usage, allowing users to pay for the resources they consume. This can be cost-effective for businesses as it eliminates the need to invest in and maintain dedicated database infrastructure.
Continue reading “MariaDB & SkySQL DBAAS”

Schedule Updates: Databases, Coding, Meetups, & More

Here is my updated schedule for the two weeks starting the 23rd for my Twitch streams, the basic topics, meetups (a little beyond two weeks), and related events coming up.

Series: Thrashing Code General Calamity

This is going to be a mix of tech this week. Probably some database hacking, code hacking, samples, and setup of even more examples for use throughout your coding week!

Series: Bunches of Databases in Bunches of Weeks

Focusing on DataStax Enterprise setup and configuration over the next few weeks.

Series: Meetups

This is a little further out than the next few weeks, but in August we’re having another meetup!

Series: Building the Geo App Trux (w/ Vue.js, Go, and DataStax Enterprise (Apache Cassandra). Navigate through to the DataStax Developers Channel to check out the event times.

Series: DataStax’s Apache Cassandra & C# Hour (C# Schema Migration builder, C# driver, and additional content). Navigate through to the DataStax Developers Channel to check out the event times.

Let’s Talk Top 7 Options for Database Gumbo

When one starts to dig into databases things get really complex really fast. There’s not only a whole plethora of database companies and projects, but database types, storage engines, and other options and functionality to choose from. One place to get a start is just to take a look at the crazy long list of databases on db-engines. In this post I’m going to take a look at a few of the top database engines to create a starting point – which I’ll reference – for future video streaming coding sessions (follow me @ twitch.tv/adronhall).

My Options for Database Gumbo

  1. Apache Cassandra / DataStax Enterprise
  2. Postgresql
  3. SQL Server
  4. Elasticsearch
  5. Redis
  6. SQLite
  7. Dynamo DB

The Reasons

Ok, so the list is as such, and as stated it’s my list. There are a lot of databases, and of course some are still more used such as Oracle. However here’s some of the logic and reasoning behind my choices above.

Oracle

First off I feel like I need to broach the Oracle topic. Mostly because of their general use in industry. I’m not doing anything with Oracle now, nor have I for years for a long, long, LONG list of reasons. Using their software tends to be buried in bureaucratic, oddly broken and unnecessary usage today anyway. They use predatory market tactics, completely dishonorable approach to sales and services, as well as threatening and suing people for doing benchmarks, and a host of other practices. In face to face experiences, Oracle tends to give off experiences, that Lawrence from Office Space would say, “naw man, I think you’d get your ass kicked for that!” and I agree. Oracle’s practices are too often disgusting. But even from the purely technical point of view, the Oracle Database and ecosystem itself really isn’t better than other options out there. It is indeed a better, more intelligently strategic and tactical option to use a number of alternatives.

Apache Cassandra / DataStax Enterprise

This combo has multiple reasons and logic to be on the list. First and foremost, much of my work today is using DataStax Enterprise (DSE) and Apache Cassandra since I work for DataStax. But it’s important to know I didn’t just go to DataStax because I needed a job, but because I chose them (and obviously they chose me by hiring me) because of the team and technology. Yes, they pay me, but it’s very much a two way street, I advocate Cassandra and DSE because I personally know the tech is top tier and solid.

On the fact that Apache Cassandra is top tier and solid, it is simply the remaining truly masterless distributed database that provides a linear path of scalability on the market that you can use, buy support for, and is actually actively and knowingly maintained not just by DataStax but by members of the community. One could make an argument for MongoDB but I’ll maybe elaborate on that in the future.

In addition to being a solid distributed database there are capabilities inherent in Apache Cassandra because of the data types and respective the CQL (Cassandra Query Language) that make it a great database to use too. DataStax Enterprise extends that to provide spatial (re: GIS/Geo Data/Queries), graph data, analytics engine, and more built on other components like SOLR and related technology. Overall a great database and great prospective combinations with the database.

Postgresql

Postgres is a relational database that has been around for a long time. It’s got some really awesome features like native JSON support, which I’m a big fan of. But I digress, there’s tons of other material that lays out thoroughly why to use Postgres which I very much agree with.

Just from the perspective of the extensive and rich data types Postgres is enough to be put on this list, but considering there are a lot of reasons around multi-tenancy, scalability, and related characteristics that are mostly unique to Postgres it’s held a solid position.

SQL Server

This one is on my list for a few reasons that have nothing to do with features or capabilities. This is the first database I was responsible for in its entirety. Administration, queries, query tuning, setup, and developer against with the application tier. I think of all my experience, this database I’ve spent the most time with, with Apache Cassandra being a close second, then Postgres and finally Riak.

Kind of a pattern there eh? Relational, distributed, relational, distributed!

The other thing about SQL Server however is the integrations, tooling, and related development ecosystem around SQL Server is above and beyond most options out there. Maybe, with a big maybe, Oracle’s ecosystem might be comparable but the pricing is insanely different. In that SQL Server basically can carry the whole workload, reporting, ETL, and other feature capabilities that the Oracle ecosystem has traditionally done. Combine SQL Server with SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), and other online systems like Azure’s SQL Database and the support, tooling, and ecosystem is just massive. Even though I’ve had my ins and outs with Microsoft over the years, I’ve always found myself enjoying working on SQL Server and it’s respective tooling options and such. It’s a feature rich, complete, solidly, and generally well performing relational database, full stop.

Elasticsearch

Ok, this is kind of a distributed database of sorts but focused more exclusively (not totally since it’s kind of expanded its roles) search engine. Overall I’ve had good experiences with Elasticsearch and it’s respective ELK (or Elastic ecosystem) of tooling and such, with some frustrating flakiness here and there over the years. Most of my experience has come from an operational point of view with Elasticsearch. I’ve however done a fair bit of work over the years in supporting teams that are doing actual software development against the system. I probably won’t write a huge amount about Elasticsearch in the coming months, but I’ll definitely bring it up at certain times.

Redis / SQLite / DynamoDB

These I’ll be covering in the coming months. For Redis and DynamoDB I have wanted to dig in for some comparison analysis from the perspective of implementing data tiers against these databases, where they are a good option, and determining where they’re just an outright bad option.

For SQLite I’ve used it on and off for many years, but have wanted to sit down and just learn it and try out some of its features a bit more.