Adding Time Range Generation to Data Diluvium

Following up on my previous posts about adding humidity and temperature data generation to Data Diluvium, I’m now adding a Time Range generator. I decided this would be a nice addition to give any graphing of the data a good look. This will complete the trio of generators I needed for my TimeScale DB setup. While humidity and temperature provide the environmental data, the Time Range generator ensures we have properly spaced time points for our time-series analysis.

Why Time Range Generation?

When working with TimeScale DB for time-series data, having evenly spaced time points is crucial for accurate analysis. I’ve found that many of my experiments require data points that are:

  • Evenly distributed across a time window
  • Properly spaced for consistent analysis
  • Flexible enough to handle different sampling rates
  • Random in their starting point to avoid bias

The Implementation

I created a Time Range generator that produces timestamps based on a 2-hour window. Here’s what I considered:

  • Default 2-hour time window
  • Even distribution of points across the window
  • Random starting point within a reasonable range
  • Support for various numbers of data points

Here’s how I implemented this in Data Diluvium:

Continue reading “Adding Time Range Generation to Data Diluvium”

Generating Temperature Data with Data Diluvium for Time Series Data with TimeScaleDB

Following up on my previous post about adding humidity data generation to Data Diluvium, I’m now adding temperature data generation. This completes the pair of environmental data generators I needed for my TimeScale DB setup. Temperature data is crucial for time-series analysis and works perfectly alongside the humidity data we just implemented.

Why Temperature Data?

Continue reading “Generating Temperature Data with Data Diluvium for Time Series Data with TimeScaleDB”

Generating Realistic Humidity Data for TimeScale DB with Data Diluvium

For next steps of why I set up TimeScale DB up for local dev, and being able to just do things, I need two new data generators over on Data Diluvium. One for humidity, which will be this post, and one for temperature, which will be next.

Why Humidity Data?

When working with TimeScale DB for time-series data, having realistic environmental data is crucial. I’ve found that humidity is a particularly important parameter that affects everything from agriculture to HVAC systems. Having realistic humidity data is essential for:

  • Testing environmental monitoring systems
  • Simulating weather conditions
  • Developing IoT applications
  • Training machine learning models for climate prediction

The Implementation

I created a humidity generator that produces realistic values based on typical Earth conditions. Here’s what I considered:

  • Average humidity ranges (typically 30-70% for most inhabited areas)
  • Daily variations (higher in the morning, lower in the afternoon)
  • Seasonal patterns
  • Geographic influences
Continue reading “Generating Realistic Humidity Data for TimeScale DB with Data Diluvium”

Restarting Data Diluvium – Four Steps

I’ve got three steps I’m going through to reboot the Data Diluvium Project & the respective CLI app I started about a year ago. I got a little ways into the project and then a bit distracted, it happens. Here’s the next steps I’m taking and for those interested in helping out I’ll be blogging the work here, and also sending out updates via my Thrashing Code Newsletter. You can sign up and select all the news or just the open source project news if you just want to follow the projects.

Step 0: Write Up the Ideas Behind the Project

Ok, so this will arrive subsequently. So far, just wanted to get these notes and intentions written down. Previously I’d written about the idea here, and here. Albeit after many discussion with a number of people, there will be some twists and turns to the project to make it more useful and streamlined in CLI & services.

Step 1: Cleanup The Repository

Currently the repository is kind of a mess. I’m going to aim to do the following over the next few days.

  • Write up contributor issues/files for the repo.
  • Rewrite the documentation (initial docs that is) to detail the intent of the data generator ideas.
  • Incorporate the CLI to a repo that is parallel to this repo that is designed specifically to work against this repo’s project.
  • Write up a README.md that will detail what Data Diluvium is exactly as well as point to the project site and provide installation and setup instructions.
  • Setup the first databases to target as Postgresql, Cassandra, and *maybe* one other database, but I’m not sure which one. Feel free to file an issue with a suggestion.

Step 2: Cleanup & Publish a new Project Website

This is a simple one, I need to write up copy with the details, specific with feature descriptions and intended examples. This will provide the start point to base the work for the project. It will be similar to one of those living documents in that the documentation will, can, and should change as the project is developed.

Step 3: Get More Cats Coding!

catI’ve pinged a few people I know are interested in helping out, but we’re always looking for others to help with PRs and related efforts around the project(s). If you’re game, the easiest way to get started would be to ping me directly via DM on Twitter @adron and to sign up on my Thrashing Code Newsletter and select Open Source Projects Only (unless you want all the things).

…anyway, getting to work on these tasks. Happy coding!

Data Diluvium Design Ideas

This post includes a collection of my thoughts on design and architecture of a data generation service project I’ve started called Data Diluvium. I’m very open to changes, new ideas, or completely different paradigms around these plans altogether. You can jump into the conversation thread. What kind of data do you often need? What systems do you want it inserted into?

Breakdown of Article Ideas:

  • Collected Systems API – This API service idea revolves around a request that accepts a schema type for a particular database type, an export source for inserting the data into, and generating an amount of data per the requested amount. The response then initiates that data generation, while responding with a received message and confirmation based on what it has received.
  • Individual Request API – This API service idea (thanks to Dave Curylo for this one, posted in the thread) revolves around the generation of data, requested at end points for a particular type of random data generation.

Alright, time to dive deeper into each of these.

Collected Systems APIs

  • https://datadiluvium.com/schema/generate – This API end point would take a schema with the various properties needed. For any that aren’t set, a default would be set. The generation process would then randomize, generate, and insert this data into any destination source specified. Here are some prospective examples I’ve created:

    A very basic sample JSON schema

    [
      {
        "schema": "relational",
        "database": "text"
      }
    ]
    

    In this particular example, I’ve created the simplist schema that could be sent into the service. For this particular situation I’d have (currently not decided yet) defaults that would randomly create a table, with a single column, and generate one element of data in that table. Other properties could be set, which would give control over the structure created in which to insert the data into. An example would be the following.

    [
      {
        "schema": "relational",
        "database": "postgresql",
        "structure": [
          {
            "table": "Users",
            "columns": [
              {"name": "id", "type": "uuid"},
              {"name": "firstname", "type": "firstname"},
              {"name": "lastname", "type": "lastname"},
              {"name": "email_address", "type": "email"}
            ]
          },
          {
            "table": "Addresses",
            "columns": [
              {"name": "id", "type": "uuid"},
              {"name": "street", "type": "address"},
              {"name": "city", "type": "city"},
              {"name": "state", "type": "state"},
              {"name": "postalcode", "type": "zip"}
            ]
          },
          {
            "table": "Transactions",
            "columns": [
              { "name": "id", "type": "uuid" },
              { "name": "transaction", "type": "money" },
              { "name": "stamp", "type": "date" }
            ]
          }
        ]
      }
    ]
    

    In this example, the properties included are three tables; UsersAddresses, and Transactions. In the first table, Users, the columsn would be; idfirstnamelastname, and email_address. Each of these then have a type property which sets the type of data to be generated for the columns. The same type of set of properties is then included for the Addressesand Transactions tables and their respective columns.

    Some additional questions remain, such as if the tables exist in the database, would the insertion build SQL to create the tables? Should it be assumed that the tables exist already and have the appropriate settings set to insert the data into the tables? Again, a great thing to discuss on the thread here.

  • https://datadiluvium.com/schema/validate – This could be used to validate a schema request body. Simply submit a schema and a validation response would be returned with “Valid” or “Invalid”. In the case of an invalid response, a list of prospective and known errors would be returned.

These two API end points focus around building out large data to test systemic environments and the respective construction of those environments. The actual generation of the data is assumed for this API service and the individual generation of data is discussed below in the individual request APIs.

Individual Request APIs

The following API calls could be implemented with fairly straight forward random data generation. A number can easily be randomized and returned, a word can be chosen from a dictionary, and a city returned from a list of cities. The following are prospective API calls to return data of this type.

The next level of complexity for data generation would be the slightly structured data generation. Instead of having an arbitrary list of addresses, we could or would prospectively generate one. But on the other hand, maybe we should just randomly create actual addresses that can be validated against an actual real address? That seems to have the possiblity of issues in the real world in spite of the fact all the addresses out there in the world are basically publicly accessible data. But questioning how or what the data would or could actually represent will be a great thing to discuss in the thread.

The next level of data generation complexity would be to generate sentences and other related data. This could be done a number of ways. If we wanted to have it generate intelligent sentences that made sense, it would take a little bit more work then for example generating lorum ipsum.

TLDR;

This blog entry just details the starting point of features for the Data Diluvium Project. If you’d like to jump into the project too, let me know. I’m generally working on the project during the weekends and a little during the week. There’s already a building project base that I’m starting with. If you’re interested in writing some F#, check out the work Dave Curylo has done here. I’ve been pondering breaking it out to another project and sticking to the idea of microservice but with F# for the work he put in. Anyway if you’ve got ideas on how to generate data, how you’d like to use it in your applications, or other related ideas please dive into the conversation on the Github Thread here.