Normalization in Relational Databases

This is a continuation of my posts on relational databases, started here. Previous posts on this theme, “Data Modeling“, “Let’s Talk About Database Schema“, “The Exasperating Topic of Database Indexes“, “The Keys & Relationships of Relational Databases“, and “Relational Database Query Optimization“.

Consider you’re a die-hard fan of progressive death metal, with a particular affinity for bands like Allegaeon. Over the years, you’ve accumulated a vast collection of CDs, vinyls, and other memorabilia.

Your collection has grown so much that you decide to document every item meticulously. Each piece of memorabilia contains information about the album, track titles, band members, and so forth. If you scribbled down every detail in one continuous list, you’d end up with a lot of repeated information. For instance, both “Proponent for Sentience” and “Formshifter” would mention the same band members like Riley McShane and Michael Stancel.

Normalization is akin to setting up separate lists or sections in your documentation. One section purely for “Band Members” where you detail members of Allegaeon over time. Another section for “Albums”, where instead of listing band members all over again, you simply refer back to the “Band Members” section. This kind of organization cuts down redundancy and ensures if, say, a band member leaves, you have only one spot to update.

Continue reading “Normalization in Relational Databases”

The Exasperating Topic of Database Indexes

Previous posts on this theme, “Designing a Relational Database – Data Modeling“ and “Let’s Talk About Database Schema“.

Indexes aren’t necessary right? Of course not, they’re just peripherally superficial things that sit like leaches on a database table’s column!

Ok, bad database satire aside, let’s get real. Indexes are extremely useful for a plethora of reasons. They play a crucial role in optimizing the speed of database operations, especially retrieval operations.

Demystifying the Magic Behind Relational Database Indexes

Ever tried finding a specific recipe in a massive cookbook? Imagine if there wasn’t an index! Just as that cookbook index saves you hours of flipping through page after page, a relational database index works wonders by helping the system quickly locate rows within a table without the tedious process of examining each one.

So, How Does This Magic Work?

Continue reading “The Exasperating Topic of Database Indexes”

A Survey of 21 ETL Tools for Python

Here are summaries of each of the tools you’ve mentioned along with examples of how to implement the ETL (Extract, Transform, Load) process using each tool within a Python workflow:

  1. Apache Spark: Apache Spark is a powerful open-source cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It’s commonly used for processing large-scale data and running complex ETL pipelines. Example Implementation:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ETLExample") \
    .getOrCreate()

# Load data from source
source_data = spark.read.csv("source_data.csv", header=True, inferSchema=True)

# Apply transformations
transformed_data = source_data.select("column1", "column2").filter(source_data["column3"] > 10)

# Write data to destination
transformed_data.write.parquet("transformed_data.parquet")

spark.stop()
  1. Apache Airflow: Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It allows you to define complex ETL workflows as directed acyclic graphs (DAGs) and manage their execution. Example Implementation: Define a DAG in a Python script:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def etl_process():
    # Your ETL logic here
    pass

default_args = {
    'start_date': datetime(2023, 8, 1),
    'schedule_interval': '0 0 * * *',  # Run daily at midnight
}

dag = DAG('etl_workflow', default_args=default_args)

etl_task = PythonOperator(
    task_id='etl_task',
    python_callable=etl_process,
    dag=dag,
)
Continue reading “A Survey of 21 ETL Tools for Python”

MariaDB & SkySQL DBAAS

This is a getting started guide for MariaDB SkySQL. Let’s start with two prerequisites definitions:

  1. MariaDB – MariaDB is an open-source relational database management system (RDBMS) that is a fork of MySQL, another popular open-source database system. It was created by the original developers of MySQL after concerns arose about the acquisition of MySQL by Oracle Corporation in 2010. MariaDB is designed to be a drop-in replacement for MySQL, which means that many applications and tools developed for MySQL can also work seamlessly with MariaDB without requiring significant changes. It retains much of the same syntax, APIs, and commands as MySQL, making the transition relatively straightforward for users.
  2. MariaDB SkySQL – MariaDB SkySQL is a cloud-native Database as a Service (DBaaS) offering provided by MariaDB Corporation, the company behind the development of the MariaDB open-source database system. SkySQL is designed to simplify database management, deployment, and scaling by providing a fully managed and highly available MariaDB database solution in the cloud.

Some key features of MariaDB include:

  1. High Performance: MariaDB incorporates optimizations and improvements to enhance query execution speed and overall performance.
  2. Storage Engines: MariaDB supports multiple storage engines, including the popular InnoDB and Aria engines. Each engine has its own characteristics and performance attributes, allowing users to choose the one that best fits their requirements.
  3. Security: MariaDB includes various security enhancements, such as data encryption at rest and in transit, improved authentication methods, and better access control mechanisms.
  4. Open Source: MariaDB is fully open source, which means its source code is available for anyone to inspect, modify, and contribute to.
  5. Community and Development: MariaDB has a vibrant and active community of developers and contributors who work on its continued development and improvement.
  6. Compatibility: As mentioned earlier, MariaDB aims for compatibility with MySQL, allowing applications developed for MySQL to work with minimal changes.
  7. Extensions: MariaDB introduces some features not present in MySQL, such as the Aria storage engine, thread pooling, and more advanced geographic information system (GIS) functionality.
  8. Replication and Clustering: Like MySQL, MariaDB supports various replication methods and clustering solutions for high availability and fault tolerance.
  9. Plugins: MariaDB offers a plugin architecture that allows users to add custom functionality and features to the database system.

To elaborate further on the specifics of MariaDB SkySQL, here are some of the features of the DBAAS (DataBase As A Service):

  1. Managed Service: SkySQL takes care of database administration tasks such as provisioning, backup, monitoring, maintenance, and security updates. This allows users to focus more on their applications and less on managing the underlying database infrastructure.
  2. High Availability: SkySQL offers built-in high availability configurations that ensure database uptime and data durability. This includes automatic failover and replication setups.
  3. Scalability: SkySQL supports both vertical and horizontal scaling. Vertical scaling involves adjusting the resources of a single database instance, while horizontal scaling involves distributing data across multiple nodes for improved performance and capacity.
  4. Security: Security features such as encryption at rest and in transit, role-based access control, and network security protocols are integrated to help protect sensitive data.
  5. Multi-Cloud Support: SkySQL is designed to work across various cloud providers, enabling users to choose the cloud environment that best suits their needs. It supports popular cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
  6. Compatibility: SkySQL maintains compatibility with the MariaDB database, which means applications developed for MariaDB can run seamlessly on SkySQL with minimal modifications.
  7. Global Distributed Architecture: SkySQL offers the capability to deploy databases across multiple geographic regions for improved performance and data availability across different parts of the world.
  8. Managed Upgrades: Regular updates and improvements to the MariaDB database engine are managed by the SkySQL service, ensuring that your databases remain up to date without manual intervention.
  9. Pay-as-You-Go: SkySQL’s pricing model is typically based on usage, allowing users to pay for the resources they consume. This can be cost-effective for businesses as it eliminates the need to invest in and maintain dedicated database infrastructure.
Continue reading “MariaDB & SkySQL DBAAS”

Schedule Updates: Databases, Coding, Meetups, & More

Here is my updated schedule for the two weeks starting the 23rd for my Twitch streams, the basic topics, meetups (a little beyond two weeks), and related events coming up.

Series: Thrashing Code General Calamity

This is going to be a mix of tech this week. Probably some database hacking, code hacking, samples, and setup of even more examples for use throughout your coding week!

Series: Bunches of Databases in Bunches of Weeks

Focusing on DataStax Enterprise setup and configuration over the next few weeks.

Series: Meetups

This is a little further out than the next few weeks, but in August we’re having another meetup!

Series: Building the Geo App Trux (w/ Vue.js, Go, and DataStax Enterprise (Apache Cassandra). Navigate through to the DataStax Developers Channel to check out the event times.

Series: DataStax’s Apache Cassandra & C# Hour (C# Schema Migration builder, C# driver, and additional content). Navigate through to the DataStax Developers Channel to check out the event times.