Dynamic Data Generation with JavaScript

This video shows the process detailed below in this blog entry, to provide the choice of video or a quick read! 👍🏻😁

I coded up some JavaScript to generate some data for a table recently and it seemed relatively useful, so here it is ready to use as you may. (The complete js file is below the description of the individual code segments below). This file simple data generation is something I put together to create a csv for some quick data imports into a database (Postgres, SQL Server, or anything you may want). With that in mind, I added the libraries and initialized the repo with the libraries I would need.

npm install faker
npm install fs
faker = require('faker');
fs = require('fs');

Next up I included the column row of data for the csv. I decided to go ahead and setup the variable at this point, as it would be needed as I would add the rest of the csv data to the variable itself. There is probably a faster way to do this, but this was the quickest path from the perspective of getting something working right now.

After the colum row, I also setup the base 8 UUIDs that would related to the project_id values to randomly use throughout data generation. The idea behind this is that the project_id values are the range of values that would be in the data that Subhendu would have, and all the ip and other recorded data would be recorded with and related to a specific project_id. I used a UUID generation site to generate these first 8 values, that site is available here.

After that I went ahead and added the for loop that would be used to step through and generate each record.

var data = "id,country,ip,created_at,updated_at,project_id\n";
let project_ids = [
    'c16f6dd8-facb-406f-90d9-45529f4c8eb7',
    'b6dcbc07-e237-402a-bf11-12bf2226c243',
    '33f45cab-0e14-4830-a51c-fd44a62d1adc',
    '5d390c9e-2cfa-471d-953d-f6727972aeba',
    'd6ef3dfd-9596-4391-b0ef-3d7a8a1a6d10',
    'e72c0ed8-d649-4c53-97c5-da793d7a8228',
    'bf020fd2-2514-4709-8108-a2810e61c503',
    'ead66a4a-968a-448c-a796-51c6a1da0c20'];

for (var i = 0; i < 500000; i++) {
    // TODO: Generation will go here.
}

The next thing that I wanted to sort out are the two dates. One would be the created_at value and the other the updated_at value. The updated_at date needed to show as occurring after the created_at date, for obvious reasons. To make sure I could get this calculated I added a function to perform the randomization! First two functions to get additions for days and hours, then getting the random value to add for each, then getting the calculated dates.

function addDays(datetime, days) {
    let date = new Date(datetime.valueOf());
    date.setDate(date.getDate() + days);
    return date;
}

function addHours(datetime, hours) {
    let time = new Date(datetime.valueOf())
    time.setTime(time.getTime() + (hours*60*60*1000));
    return time;
}

var days = faker.datatype.number({min:0, max:7})
var hours = faker.datatype.number({min:0, max:24})

var updated_at = new Date(faker.date.past())
var created_at = addHours(addDays(updated_at, -days), -hours)

With the date time stamps setup for the row data generation I moved on to selecting the specific project_id for the row.

var proj_id = project_ids[faker.datatype.number({min:0, max: 7})]

One other thing that I knew I’d need to do is filter for the ' or , values located in the countries that would be selected. The way I clean that data to ensure it doesn’t break the SQL bulk import process is kind of cheap and in production data I wouldn’t do this, but it works great for generated data like this.

var cleanCountry = faker.address.country().replace(",", " ").replace("'", " ")

If you’re curious why I’m calculating these before I do the general data generation and set the row up, I like to keep the row of actual data calls to either a set variable assignment or at most one dot level deep in my calls. As you’ll see now in the row level data being generated below.

data2 += 
    faker.datatype.uuid() + "," +
    cleanCountry + "," +
    faker.internet.ip() + "," +
    created_at.toISOString() + "," +
    updated_at.toISOString() + "," +
    proj_id + "\n"

Now the last step is to create the file for all these csv rows.

fs.writeFile('kundu_table_data.csv', data, function (err) {
  if (err) return console.log(err);
  console.log('Data file written.');
});

The results.

Scaling PostgreSQL + Top 12 Curated Posts

A video introduction into the basics of scaling a relational database like PostgreSQL.

There are two main ways to scale data storage, especially databases, and the resources available to store and process that data.

Horizontal Scaling (scale-out): This is done through adding more individual machines in some way, creating a cluster for the database to operate. Various databases handle this in different ways ranging from a system like a distributed database, which is specifically built to scale horizontally, to relational databases which require different strategies like sharding the data itself.

Vertical Scaling (scale-up): This includes increasing the individual system resources allocated to a database on a single vertically integrated machine, such as the CPU or number of CPUs, memory, and disk or disks.

For horizontal scaling the approach often requires things like possible sharding of the data across multiple databases and then pooling those resources together via connection pooling, load balancers, and other infrastructure resources to direct the correct requests and traffic at the right sharded database resource.

Sometimes from a purely process-centric horizontal scaling perspective, we just need more query and processing power but the data can be stored on a singular machine. In this case, we might implement a proxy to direct traffic across some bouncers that would then get requests and responses to and from the individual machines.

Vertical scaling often requires new hardware and a migration to the larger machine. Sometimes however it may just require more CPU, RAM, or more drive storage. If it is a singular need it often can be met by adding one of those elements or changing out one. For example, if a databases is continually hitting the peak processing of the CPUs to handle queries, one might be able to just add another processor or change the processor to a faster processor or one with more cores to process with. In the case of memory being overloaded, the same for that, simply increase the maximum memory in the system. Of course, the underlying limitations of vertical scaling always come up when you already have the fastest CPU or the memory is already maxed out. In that situation one might need to think about horizontally scaling as described in the previous described example.

The following is a top X list of blog entries that cover the expansion of the topic in many ways. I’ve found these (along with more than a few others) very useful in my own efforts around scaling Postgres (and other databases).

  1. Vertically Scaling PostgreSQL
  2. Vertically Scale Your PostgreSQL Infrastructure w/ pgpool -3- Adding Another Standby
  3. Vertically scale your PostgreSQL infrastructure with pgpool — 2 — Automatic failover and reconfiguration
  4. How well can PostgreSQL scale horizontally?
  5. How to Horizontally Scale Your Postgres Database Using Citus
  6. Horizontal scalability with Sharding in PostgreSQL — Where it is going Part 1 of 3
  7. Lessons learned scaling PostgreSQL database to 1.2bn records/month
  8. IDEAS FOR SCALING POSTGRESQL TO MULTI-TERABYTE AND BEYOND
  9. Scaling PostgreSQL Using Connection Poolers & Load Balancers
  10. Vertically Scaling PostgreSQL
  11. Scaling PostgreSQL for Large Amounts of Data
  12. Scaling Postgres

Quick Answers: What is the UUID column type good for and what exactly is a UUID?

The question has come up a few times recently about what UUIDs are and what they’re good for. In this quick answers post I explain what they are and provide reference links to further material, as well as a Hasura Short video to implementation and use in Postgres & Hasura.

A Hasura Bit

“A Hasura Bit – What sit he UUID column type good for and what exactly is a UUID?” video.

UUID

UUID stands for a universally unique identifier. Another term which is common for a UUID, is the GUID, or globally unique identifier, which is the common term Microsoft created for their UUIDs. Just know, that a GUID is a UUID, it’s just company naming convention vs the standard industry naming convention.

UUIDs are standardized by the Open Software Foundation (OSF) as part of the Distributed Computer Environment (DCE). Specifically UUID are designed and used from an USO/IEC spec, which if you’d like to know more about the standards they’re based on check out the Wikipedia page @ https://en.wikipedia.org/wiki/Universally_unique_identifier

UUID Format

The canonical textual representation, the 16 octets of a UUID are represented as 32 hexadecimal (base-016) digits. They’re displayed across five groups separated by hyphens in 8-4-4-4-12 format. Even though there are 32 hexadecimal digits, this makes a total of 36 characters for format display. If storing as a string for example, the string needs to be able to hold 36 characters.

Uses for a UUID

The first key use case for a UUID is to have something generate the UUID to use it as a completely unique value for use with a subset of related data. UUIDs are prefect for primary keys in a database, or simply any type of key to ensure uniqueness across a system.

UUIDs can be generated from many different origin points too without any significant concern for collision (i.e. duplicate UUIDs). For example, the database itself has database functions that enable the generation of a UUID at time of a data row’s insertion, as a default value. This means a client inserting data wouldn’t need to generate that UUID. However this can be flipped over to the client side as a responsibility and the client side development stack (i.e. like Go UUID generation) can generate the UUID. Which then enables the creation of a primary key entity being created with a UUID as the primary key, that can then be used to create what would be foreign key items and so on down the chain of a relationship. Then once all of these are created on the client side they can all be inserted in a batch, and even if ordered appropriately can be made transactional to ensure the integrity of the data.

Terrazura – A Build Out of an Azure based, Hasura GraphQL API on Postgres

I created this repo https://github.com/Adron/terrazura​ during a live stream on my Twitch Thrashing Code Channel 🤘 at 10am on the 30th of December, 2020. The VOD is now available on my YouTube Thrashing Code Channel https://youtube.com/thrashingcode​. A rough as hell year, but wanted to wrap it up with some solid content. In this stream I tackled a ton of specifics, in detail about getting Hasura deployed in Azure, Postgres backed, a database schema designed and created, using database schema migrations, and all sorts of tips n’ tricks along the way. 3 hours of solid how to get shit done material!

For live streams, check out and follow at https://www.twitch.tv/thrashingcode​ 👊🏻 or for VOD viewing check out https://youtube.com/thrashingcode

A point in coding during the video!

02:49​ – Shout out to the stream sponsor, Azure, and links to some collateral material.
14:50​ – In this first segment, I start but run into some troubleshooting needs around the provider versions for Terraform in regards to Azure. You can skip this part unless you want to see what issue I ran into.
18:24​ – Since I ran into issues with the current version of Terraform I had installed, at this time I show a quick upgrade to the latest version.
27:22​ – After upgrading and fighting through trial and error execution of Terraform until I finally get the right combination of provider and Terraform versions.
27:53​ – Adding the first Terraform resource, the Azure resource group.
29:47​ – Azure Portal oddness, just to take note off if/when you’re working through this. Workaround later in the stream.
32:00​ – Adding the Postgres server resource.
44:43​ – In this segment I switched over to Jetbrain’s Intellij to do the rest of the work. I also tweak the IDE to re-add the plugin for the material design themes and icons. If you use this IDE, it’s very much IMHO worth getting this to switch between themes.
59:32​ – After getting leveled-up with the IDE, I wrap up the #Postgres​ server resource and #terraform​ apply it the overall set of resources. At this point I also move forward with the infrastructure as code, with emphasis on additive changes to the immutable infrastructure by emphasizing use of terraform apply and minimizing any terraform destroy use.
1:02:07​ – At this time, I try figuring out the portal issue by az logout and logging back in az login to Azure Still no resources shown but…
1:08:47​ – eventually I realize I have to use the hack solution of pasting the subscription ID into the
@Azure portal to get resources for the particular subscription account which seems highly counter intuitive since its the ONLY account. 🧐
1:22:54​ – The next thing I setup, now that I have variables that need passed in on every terraform execution, I add a script to do this for me.
1:29:35​ – Next up is adding the database to the database server and firewall rule. Also we get to see Jetbrains #Intellij​ HCL plugin introspection at work adding required properties to the firewall resource! A really useful feature.
1:38:24​ – Next up, creating the Azure container to deploy our Hasura GraphQL API for #Postgres​ to!
1:51:42​ – BAM! API Server is done and launched! I’ve got a live #GraphQL​ API up and running in Azure and we’re ready to start building a data model!
1:56:22​ – In this segment I show how to turn off the public facing console and shift one’s development workflow to the local Hasura console working against – local OR your live dev environment.
1:58:29​ – Next segment I get into schema migrations, initializing a directory structure for Hasura CLI use, and metadata, migrations, and related data. Including an update to the latest CLI so you can see how to do that, after a run into a slight glitch. 😬
2:23:02​ – I also shift over to dbdiagram to graphically build out some of the schema via their markdown, then use the SQL export option for #postgres​ combined with Hasura’s option to execute plain ole SQL via migrations…
2:31:48​ – Getting a bit more in depth in this segment, I delve through – via the Hasura console – to build out relationships between the tables and data so the graphql queries can introspect accordingly.
2:40:30​ – Next segment, graphql time! I show some of the options of what is available immediately for queries and mutations via the console.
2:50:36​ – Then some more details about metadata. I’m going to do a stream with further details, since I was a little fuzzy on some of those details myself, in the very very near future. However a good introduction to what the metadata does for the #graphql​ API.
2:59:07​ – Then as a wrap up to all of this… I nuke EVERYTHING and deploy it all out to Azure again inclusive of schema migrations, metadata, etc. 🤘🏻
3:16:30​ – Final segment, I add some data to the database and get into a few basic queries and mutations in #graphql​ via the #graphiql​ console interface in #Hasura​.

Pragmatic Database Schema Naming Conventions, Practices, and Patterns

In this post I’ve put together some of the naming conventions, rules, and ideas that I tend to follow when creating database schemas to work with. This also applies to schema-less databases, distributed systems databases, graph, time series, or whatever else I am working with. It’s always good to have some good conventions to work with, and the descriptions and ideas in this post are a solid starting point.

What We’re Working With, Some of The Database Rules

Let’s start with a SQL Server rule. Table names must be less than 128 characters. To force SQL Server to use non-standard table names one can use brackets. Then in scripts these names have to be single quoted.

There are many other rules about naming things in SQL Server. But let’s talk about some other database specific rules for other databases.

Postgres names fold to lowercase versus uppercase, which is different then many other databases. Throw in some double quotes however and you can use names like MyTable, MYTABLE, and mytable. These would all be the same without double qutoes, add the double quotes around those names and “MYTABLE” becomes different than “MyTable” and different than “mytable”.

SQL identifiers in Postgres and key words must begin with letters (a-z), which include diacritical marks and non-Latin letters. After the first letter and identifier can have letters, underscores, digits, or dollar signs. If an identifier is double quoted, you can also yse keywords, albeit I would very strongly reccommend against this practice.

As these examples provide, there are a number of ways that the rules are just different enough from one database to another that it is often very helpful to use a naming convention that would work across databases. I’m often working with a variety of databases so this post will cover naming convention ideas and respective patterns and practices around them that would work with every conceivable database I know to exist!

Table, Column, Tuple, or Related Naming Conventions

  • Table, column, and related object names should contain only letters, numbers as characters in the body of the name and not as the preface characters, underscores and absolutely no spaces or special characters. In summary, use a case scheme like Camel or Pascal Case but do not use Snake or Kebab Case.
  • Use meaningful names from the business or organizational domain being modeled. Such as “BankingUsers”, “Transactions”, “railroads”, or “railroad_Systems”.
  • Use singular word names if at all possible, only moving to compound word naming if absolutely necessary. Ideally names would be single words like “User”, “Transactions”, “railroad”, or “system” and exclude compound names like “railroadSystem” until it is needed to prevent confusion or naming collisions.
  • Columns that are primary or foreign keys should be prefaced with PK_ and FK_ respectively, and in my moderately humble opinion, stick to just PK or FK using Camel or Pascal Case. For other metadata, indexes, and related names use a respective preface or postfix conventions.
  • It’s also a good idea to choose plural or singular for table names. However, be sure to choose one or the other so that frameworks, Object Relationship/Relational Mappers/Mapping (ORM), and other tools can effectively name things when used. For example, when table names are singular, many ORM frameworks when generating code would take a singular table name like Customer and make it Customers and have objects of Customer.

Schema/Domain Naming Conventions

In many databases there are additional organizational and related structures that help us to setup tables, functions, stored procedures, compiled SQL/queries, and other objects in groupings. Naming these objects accordingly is easiest by following the same convention as the table naming convention.

For example, if the table naming convention is following Camel Case then continue that;

Table Names for Returns in an E-commerce Domain:

  • Purchase
  • Return
  • Shipped

Table Names for Pricing in an E-commerce Domain:

  • Price
  • Cost

Table Names for Core Tables, for multiple schemas, within the E-commerce Domain:

  • Item

This could be split out to three schemas;

  • Core
  • Prices
  • Returns

The names would then look like this:

Core.Item
Prices.Purchase
Prices.Return
Prices.Shipped
Returns.Purchase
Returns.Return
Returns.ShippedItem

An Example

Here’s an example I put together with some naming conventions I’ve found useful, reduces confusion, and manages to tell a reasonable amount of information about the domain space and schema of the database without conflicts.

The Database Schema

Throughout this schema I’ve used Camel Casing, with most single word column and single word table names. Keeping it simple, such as Id and Stamp are some of the recurring columsn that are useful for retrieval, relationships, and determining origins of data over time. In production settings there are default columns that are often needed and one can rest assured, a time stamp is most likely one of those that is needed everywhere for audits!

For my foriegn key columns, one can determine the relationship by the name of the column itself. For example, in the Connection table the are two foreign keys, one to the Action table and one to the Source table, to the respective Id columns in each of those tables. The one exception is NoteJot, which I named because Note tends to conflict in certain systems. In that table I’ve added a relationship, for recursive data, back to itself with the use of the NoteId foreign key back to the table’s primary key Id.

The diagram above, without any further details can be used to create the schema, with SQL code that would look like the following.

CREATE TABLE "Source" (
  "Id" uuid PRIMARY KEY,
  "Stamp" timestamp,
  "Name" text,
  "Uri" text,
  "Details" text
);

CREATE TABLE "SourceNotes" (
  "SourceId" uuid,
  "NotesId" uuid,
  "Details" text,
  "Stamp" timestamp
);

CREATE TABLE "NoteJot" (
  "Id" uuid PRIMARY KEY,
  "Stamp" timestamp,
  "NoteId" uuid,
  "Details" text
);

CREATE TABLE "Action" (
  "Id" uuid PRIMARY KEY,
  "Stamp" timestamp,
  "Action" json
);

CREATE TABLE "Connection" (
  "Id" uuid PRIMARY KEY,
  "Stamp" timestamp,
  "ActionId" uuid,
  "SourceId" uuid
);

CREATE TABLE "Formatter" (
  "Id" uuid PRIMARY KEY,
  "Stamp" timestamp,
  "ConnectionId" uuid,
  "FormatterMap" json
);

CREATE TABLE "Schema" (
  "Id" uuid PRIMARY KEY,
  "Stamp" timestamp,
  "ConnectionId" uuid,
  "SchemaMap" json
);

ALTER TABLE "SourceNotes" ADD FOREIGN KEY ("SourceId") REFERENCES "Source" ("Id");

ALTER TABLE "SourceNotes" ADD FOREIGN KEY ("NotesId") REFERENCES "NoteJot" ("Id");

ALTER TABLE "NoteJot" ADD FOREIGN KEY ("NoteId") REFERENCES "NoteJot" ("Id");

ALTER TABLE "Connection" ADD FOREIGN KEY ("ActionId") REFERENCES "Action" ("Id");

ALTER TABLE "Connection" ADD FOREIGN KEY ("SourceId") REFERENCES "Source" ("Id");

ALTER TABLE "Formatter" ADD FOREIGN KEY ("ConnectionId") REFERENCES "Connection" ("Id");

ALTER TABLE "Schema" ADD FOREIGN KEY ("ConnectionId") REFERENCES "Connection" ("Id");

We have the tables and keys, all following the Camel Case standard. Much of this however could be – if you preferred – to Pascal Case however switching them to Snake or Kebab Case would cause a number of issues depending on the database.

Do NOT Use These

Unless you want to spend tons of time with errors, debugging, and related issues skip these practices.

  • Generally throughout databases it is best to skip Snake or Kebab Case. Various situations they’re fine, but overall they’re likely to run into conflicts, naming limitations, or other concerns. It’s best to just skip them and remove the concern.
  • When you’re using data that has variance in how it is represented, do not use multitudes of formats. For dates, location, geographic, or related data it is best to stick to a particular format that is repeated throughout the database. Using mixed representations, for example with dates a MMDDYYYY format and then a DD-MM-YYYY format, methods, functions, or other elements that consume or process this data will need to account for this. Creating more time consuming and error prone code.
  • Once you pick a naming scheme for any particular database object type, stick to the naming scheme. For example, if you go with Camel Casing for your tables, use Camel Casing for all of the tables and don’t switch to Pascal for some of them. Specifically, however with this guidance, is if you switch object types, for example you name the tables Pascal Cased but switch to Camel Case for indexes, that’s perfect. Then if ever reviewing a list of objects irrespective of kinds of objects, one can differentiate merely by the conventions used.

Summary & Caveat

Working up a set of patterns, practices, rules, and generally conventions to work with on the database side of things is immensely useful. It helps the Database Administrators, Data Scientists, Software Developers, others that need to utilize the database, and to communicate with each other in reference to the database and data.


That’s it for the database schema topic for now. However, if you’re interested in joining me for more database and data oriented things, language stack setup, software development, patterns, practices, and more in addition to writing some JavaScript, Go, Python, Terraform, and infrastructure, web dev, and all sorts of coding I stream regularly on Twitch at https://twitch.tv/thrashingcode, post the VOD’s to YouTube along with entirely new tech and metal content at https://youtube.com/ThrashingCode. Feel free to check out a coding session, ask questions, interject, or just come and enjoy the tunes!