Effortless Data Generation for Developers

DataDiluvium is a web-based tool available at datadiluvium.com that helps developers, database administrators, and data engineers generate realistic test data from SQL schema definitions. Whether you’re setting up a development environment, creating test scenarios, or preparing data for demonstrations, DataDiluvium streamlines the process of data generation.

What is DataDiluvium?

Purpose

DataDiluvium serves several key purposes:

  1. Development Environment Setup: Quickly populate development databases with meaningful test data
  2. Testing: Generate consistent test data for automated testing scenarios
  3. Demonstrations: Create realistic data sets for product demonstrations
  4. Data Migration Testing: Validate data migration scripts with generated test data
  5. Schema Validation: Test database schema designs with realistic data

Key Features

  • SQL schema parsing and validation
  • Customizable data generation rules
  • Support for foreign key relationships
  • Multiple export formats (JSON, CSV, XML, Plain Text, SQL Inserts)
  • Real-time preview of generated data
  • Dark mode support
  • Responsive design

How to Use DataDiluvium

1. Accessing the Application

  1. Visit datadiluvium.com
  2. No account required – start using immediately
  3. Your data is processed locally in your browser

2. Defining Your Schema

  1. Navigate to the Schema page

  2. Enter your SQL schema definition in the text area Example:

    CREATE TABLE users (
        id INT PRIMARY KEY,
        username VARCHAR(50) NOT NULL,
        email VARCHAR(100) NOT NULL,
        created_at DATETIME DEFAULT CURRENT_TIMESTAMP
    );
    
    CREATE TABLE orders (
        id INT PRIMARY KEY,
        user_id INT,
        total_amount DECIMAL(10,2),
        created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
        FOREIGN KEY (user_id) REFERENCES users(id)
    );
    
  3. The application will automatically:

    • Parse your schema
    • Validate the structure
    • Suggest appropriate data generators
    • Show a preview of the parsed schema

3. Configuring Data Generation

  1. For each column, you can:

    • Select a data generator
    • Set custom parameters
    • Define relationships
  2. Available generators include:

    • Sequential Numbers
    • Usernames
    • Email addresses
    • Dates
    • Foreign Keys
    • Custom text
    • And more…
  3. Set the number of rows to generate:

    • Global row count for all tables
    • Table-specific row counts
    • Preview sample data before generation

4. Generating Data

  1. Click the “Generate” button
  2. Review the generation summary
  3. Confirm the generation
  4. Wait for the process to complete

5. Exporting Data

  1. Choose your preferred export format:

    • JSON: Standard JSON format with columns and rows
    • JSON (rich): Array of objects with column names as keys
    • CSV: Comma-separated values with headers
    • XML: Structured XML format
    • Plain Text: Human-readable format with numbered rows
    • SQL Inserts: Ready-to-use SQL INSERT statements
  2. Click the “Export” button

  3. Files will be downloaded automatically:

    • One file per table
    • Named according to the table name
    • Appropriate file extension based on format

Best Practices

1. Schema Design

  • Use clear, descriptive table and column names
  • Include appropriate constraints
  • Define foreign key relationships
  • Use appropriate data types

2. Data Generation

  • Start with a small number of rows for testing
  • Use appropriate generators for each column type
  • Consider data relationships when setting up foreign keys
  • Preview data before generating large sets

3. Export Selection

  • Choose JSON for application development
  • Use CSV for spreadsheet applications
  • Select SQL Inserts for direct database population
  • Consider Plain Text for human review

Example Workflow

Scenario: Setting up a Development Environment

  1. Define Schema

    CREATE TABLE products (
        id INT PRIMARY KEY,
        name VARCHAR(100) NOT NULL,
        price DECIMAL(10,2),
        category_id INT,
        created_at DATETIME DEFAULT CURRENT_TIMESTAMP
    );
    
    CREATE TABLE categories (
        id INT PRIMARY KEY,
        name VARCHAR(50) NOT NULL
    );
    
  2. Configure Generators

    • id: Sequential Number
    • name: Product Name
    • price: Random Decimal (10-1000)
    • category_id: Foreign Key to categories
    • created_at: Current Date
  3. Generate Data

    • Set 100 rows for products
    • Set 10 rows for categories
    • Generate and review
  4. Export

    • Choose SQL Inserts format
    • Download and execute in your development database

Tips and Tricks

1. Performance

  • Generate data in smaller batches for large schemas
  • Use appropriate generators for better performance
  • Preview data before large generations

2. Data Quality

  • Use meaningful generators for each column type
  • Consider data relationships
  • Validate generated data before use

3. Export Formats

  • JSON (rich) for application development
  • CSV for data analysis
  • SQL Inserts for database population
  • Plain Text for quick review

Support and Resources

  • Visit datadiluvium.com for the latest version
  • Check the documentation for detailed guides
  • Review sample schemas in the SQL samples section
  • Contact support for questions or feedback

Conclusion

DataDiluvium provides a user-friendly and powerful solution for generating test data from SQL schemas. Whether you’re a developer setting up a new project or a database administrator preparing test environments, DataDiluvium streamlines the process of data generation and helps ensure data quality and consistency.

Dynamic Data Generation with JavaScript

This video shows the process detailed below in this blog entry, to provide the choice of video or a quick read! 👍🏻😁

I coded up some JavaScript to generate some data for a table recently and it seemed relatively useful, so here it is ready to use as you may. (The complete js file is below the description of the individual code segments below). This file simple data generation is something I put together to create a csv for some quick data imports into a database (Postgres, SQL Server, or anything you may want). With that in mind, I added the libraries and initialized the repo with the libraries I would need.

npm install faker
npm install fs
faker = require('faker');
fs = require('fs');

Next up I included the column row of data for the csv. I decided to go ahead and setup the variable at this point, as it would be needed as I would add the rest of the csv data to the variable itself. There is probably a faster way to do this, but this was the quickest path from the perspective of getting something working right now.

After the colum row, I also setup the base 8 UUIDs that would related to the project_id values to randomly use throughout data generation. The idea behind this is that the project_id values are the range of values that would be in the data that Subhendu would have, and all the ip and other recorded data would be recorded with and related to a specific project_id. I used a UUID generation site to generate these first 8 values, that site is available here.

After that I went ahead and added the for loop that would be used to step through and generate each record.

var data = "id,country,ip,created_at,updated_at,project_id\n";
let project_ids = [
    'c16f6dd8-facb-406f-90d9-45529f4c8eb7',
    'b6dcbc07-e237-402a-bf11-12bf2226c243',
    '33f45cab-0e14-4830-a51c-fd44a62d1adc',
    '5d390c9e-2cfa-471d-953d-f6727972aeba',
    'd6ef3dfd-9596-4391-b0ef-3d7a8a1a6d10',
    'e72c0ed8-d649-4c53-97c5-da793d7a8228',
    'bf020fd2-2514-4709-8108-a2810e61c503',
    'ead66a4a-968a-448c-a796-51c6a1da0c20'];

for (var i = 0; i < 500000; i++) {
    // TODO: Generation will go here.
}

The next thing that I wanted to sort out are the two dates. One would be the created_at value and the other the updated_at value. The updated_at date needed to show as occurring after the created_at date, for obvious reasons. To make sure I could get this calculated I added a function to perform the randomization! First two functions to get additions for days and hours, then getting the random value to add for each, then getting the calculated dates.

function addDays(datetime, days) {
    let date = new Date(datetime.valueOf());
    date.setDate(date.getDate() + days);
    return date;
}

function addHours(datetime, hours) {
    let time = new Date(datetime.valueOf())
    time.setTime(time.getTime() + (hours*60*60*1000));
    return time;
}

var days = faker.datatype.number({min:0, max:7})
var hours = faker.datatype.number({min:0, max:24})

var updated_at = new Date(faker.date.past())
var created_at = addHours(addDays(updated_at, -days), -hours)

With the date time stamps setup for the row data generation I moved on to selecting the specific project_id for the row.

var proj_id = project_ids[faker.datatype.number({min:0, max: 7})]

One other thing that I knew I’d need to do is filter for the ' or , values located in the countries that would be selected. The way I clean that data to ensure it doesn’t break the SQL bulk import process is kind of cheap and in production data I wouldn’t do this, but it works great for generated data like this.

var cleanCountry = faker.address.country().replace(",", " ").replace("'", " ")

If you’re curious why I’m calculating these before I do the general data generation and set the row up, I like to keep the row of actual data calls to either a set variable assignment or at most one dot level deep in my calls. As you’ll see now in the row level data being generated below.

data2 += 
    faker.datatype.uuid() + "," +
    cleanCountry + "," +
    faker.internet.ip() + "," +
    created_at.toISOString() + "," +
    updated_at.toISOString() + "," +
    proj_id + "\n"

Now the last step is to create the file for all these csv rows.

fs.writeFile('kundu_table_data.csv', data, function (err) {
  if (err) return console.log(err);
  console.log('Data file written.');
});

The results.

New Live Coding Streams and Episodes!

I’ve been working away in Valhalla on the next episodes of Thrashing Code TV and subsequent content for upcoming Thrashing Code Sessions on Twitch (follow) and Youtube (subscribe). The following I’ve broken into the main streams and shows that I’ll be putting together over the next days, weeks, and months and links to sessions and shows already recorded. If you’ve got any ideas, questions, thoughts, just send them my way.

Colligere (Next Session)

Coding has been going a little slow, in light of other priorities and all, but it’ll still be one of the featured projects I’ll be working on. Past episodes are available here, however join in on Friday and I’ll catch everybody up, so you can skip past episodes if you aren’t after specific details and just want to join in on future work and sessions.

In this next session, this Friday the 9th at 3:33pm PST, I’m going to be working on reading in JSON, determining what type of structure the JSON should be unmarshalled into, and how best to make that determination through logic and flow.

Since Go needs something to unmarshall JSON into, a specific structure, I’ll be working on determining a good way to pre-read information in the schema configuration files (detailed in the issue listed below) so that a logic flow can be implemented that will then begin the standard Go JSON unmarshalling of the object. So this will likely end up including some hackery around reading in JSON without the assistance of the Go JSON library. Join in and check out what solution(s) I come up with.

The specific issue I’ll be working on is located on Github here. These sessions I’m going to continue working on, but will be a little vague and will start working on the Colligere CLI primarily on Saturday’s at 10am. So you can put that on your schedule and join me then for hacks. If you’d like to contribute, as always, reach out via here, @Adron, or via the Github Colligere Repository and let’s discuss what you’d like to add.

Getting Started with Go

This set of sessions, which I’ve detailed in “Getting Started with Go“, I’ll be starting on January 12th at 4pm PST. You can get the full outline and further details of what I’ll be covering on my “Getting Started with Gopage and of course the first of these sessions I’ve posted details on the Twitch event page here.

  • Packages & the Go Tool – import paths, package declarations, blank imports, naming, and more.
  • Structure – names, declarations, variables, assignments, scope, etc., etc.
  • Basic Types – integers, floats, complex numbers, booleans, strings, and constants.

Infrastructure as Code with Terraform and Apache Cassandra

I’ll be continuing the Terraform, bash, and related configuration and coding of using infrastructure as code practices to build out, maintain, and operate Apache Cassandra distributed database clusters. At some point I’ll likely add Kubernetes, some additional on the metal cluster systems and start looking at Kubernetes Operators and how one can manage distributed systems on Kubernetes using this on the metal environment. But for now, these sessions will continue real soon as we’ve got some systems to build!

Existing episodes of this series you can check out here.

Getting Started with Multi-model Databases

This set of sessions I’ve detailed in “Getting Started with a Multi-model Database“, and this one I’ll be starting in the new years also. Here’s the short run down of the next several streams. So stay tuned, subscribe or follow my Twitch and Youtube and of course subscribe to the Composite Code blog (should be to the left, or if on mobile click the little vertical ellipses button)

  1. An introduction to a range of databases: Apache Cassandra, Postgresql and SQL Server, Neo4j, and … in memory database. Kind of like 7 Databases in 7 Weeks but a bunch of databases in just a short session!
  2. An Introduction – Apache Cassandra and what it is, how to get a minimal cluster started, options for deploying something quickly to try it out.
  3. Adding to Apache Cassandra with DataStax Enterprise, gaining analytics, graph, and search. In this session I’ll dive into what each of these capabilities within DataStax Enterprise give us and how the architecture all fits together.
  4. Deployment of Apache Cassandra and getting a cluster built. Options around ways to effectively deploy and maintain Apache Cassandra in production.
  5. Moving to DataStax Enterprise (DSE) from Apache Cassandra. Getting a DSE Cluster up and running with OpsCenter, Lifecycle Manager (LCM), and getting some queries tried out with Studio.

Go Library Data Generation Timings

Recently I put together some quick code to give some timings on the various data generation libraries available for Go. For each library there were a few key pieces of data generation I wanted to time:

  • First Name – basically a first name of some sort, like Adam, Nancy, or Frank.
  • Full Name – something like Jason McCormick or Sally Smith.
  • Address – A basic street address, or whatever the generator might provide.
  • User Agent – Such as that which is sent along with the browser response.
  • Color – Something like red, blue, green, or other color beyond the basics.
  • Email – A fully formed, albeit faked email address.
  • Phone – A phone number, ideally with area code and prefix too.
  • Credit Card Number – Ideally a properly formed one, which many of the generators seem to provide based on VISA, Mastercard, or related company specifications.
  • Sentence – A stand multi-word lorem ipsum based sentence would be perfect.

I went through and searched for libraries that I wanted to try out. Of all the libraries I found I narrowed it down to three specific libraries. When I add the imports for these libraries, by way of how Go works, it gives you the repo locations:

  • “github.com/bxcodec/faker” – faker – Faker generates data based on a Struct, which is a pretty cool way to determine what type of data you want and to get it returned in a particularly useful format.
  • “github.com/icrowley/fake” – fake – Fake is a library inspired by the ffaker and forgery Ruby gems. Not that you’d be familiar with those, but if you are you have instant insight into how this library works.
  • “github.com/malisit/kolpa” – kolpa – This is another data generator that creates fake data for various types of data, structures, strings, sentences, and more.

Continue reading “Go Library Data Generation Timings”