I’ve got three steps I’m going through to reboot the Data Diluvium Project & the respective CLI app I started about a year ago. I got a little ways into the project and then a bit distracted, it happens. Here’s the next steps I’m taking and for those interested in helping out I’ll be blogging the work here, and also sending out updates via my Thrashing Code Newsletter. You can sign up and select all the news or just the open source project news if you just want to follow the projects.
Step 0: Write Up the Ideas Behind the Project
Ok, so this will arrive subsequently. So far, just wanted to get these notes and intentions written down. Previously I’d written about the idea here, and here. Albeit after many discussion with a number of people, there will be some twists and turns to the project to make it more useful and streamlined in CLI & services.
Step 1: Cleanup The Repository
Currently the repository is kind of a mess. I’m going to aim to do the following over the next few days.
Write up contributor issues/files for the repo.
Rewrite the documentation (initial docs that is) to detail the intent of the data generator ideas.
Incorporate the CLI to a repo that is parallel to this repo that is designed specifically to work against this repo’s project.
Write up a README.md that will detail what Data Diluvium is exactly as well as point to the project site and provide installation and setup instructions.
Setup the first databases to target as Postgresql, Cassandra, and *maybe* one other database, but I’m not sure which one. Feel free to file an issue with a suggestion.
Step 2: Cleanup & Publish a new Project Website
This is a simple one, I need to write up copy with the details, specific with feature descriptions and intended examples. This will provide the start point to base the work for the project. It will be similar to one of those living documents in that the documentation will, can, and should change as the project is developed.
Step 3: Get More Cats Coding!
I’ve pinged a few people I know are interested in helping out, but we’re always looking for others to help with PRs and related efforts around the project(s). If you’re game, the easiest way to get started would be to ping me directly via DM on Twitter @adron and to sign up on my Thrashing Code Newsletter and select Open Source Projects Only (unless you want all the things).
…anyway, getting to work on these tasks. Happy coding!
This post includes a collection of my thoughts on design and architecture of a data generation service project I’ve started called Data Diluvium. I’m very open to changes, new ideas, or completely different paradigms around these plans altogether. You can jump into the conversation thread. What kind of data do you often need? What systems do you want it inserted into?
Breakdown of Article Ideas:
Collected Systems API – This API service idea revolves around a request that accepts a schema type for a particular database type, an export source for inserting the data into, and generating an amount of data per the requested amount. The response then initiates that data generation, while responding with a received message and confirmation based on what it has received.
Individual Request API – This API service idea (thanks to Dave Curylo for this one, posted in the thread) revolves around the generation of data, requested at end points for a particular type of random data generation.
Alright, time to dive deeper into each of these.
Collected Systems APIs
https://datadiluvium.com/schema/generate – This API end point would take a schema with the various properties needed. For any that aren’t set, a default would be set. The generation process would then randomize, generate, and insert this data into any destination source specified. Here are some prospective examples I’ve created:
A very basic sample JSON schema
In this particular example, I’ve created the simplist schema that could be sent into the service. For this particular situation I’d have (currently not decided yet) defaults that would randomly create a table, with a single column, and generate one element of data in that table. Other properties could be set, which would give control over the structure created in which to insert the data into. An example would be the following.
In this example, the properties included are three tables; Users, Addresses, and Transactions. In the first table, Users, the columsn would be; id, firstname, lastname, and email_address. Each of these then have a type property which sets the type of data to be generated for the columns. The same type of set of properties is then included for the Addressesand Transactions tables and their respective columns.
Some additional questions remain, such as if the tables exist in the database, would the insertion build SQL to create the tables? Should it be assumed that the tables exist already and have the appropriate settings set to insert the data into the tables? Again, a great thing to discuss on the thread here.
https://datadiluvium.com/schema/validate – This could be used to validate a schema request body. Simply submit a schema and a validation response would be returned with “Valid” or “Invalid”. In the case of an invalid response, a list of prospective and known errors would be returned.
These two API end points focus around building out large data to test systemic environments and the respective construction of those environments. The actual generation of the data is assumed for this API service and the individual generation of data is discussed below in the individual request APIs.
Individual Request APIs
The following API calls could be implemented with fairly straight forward random data generation. A number can easily be randomized and returned, a word can be chosen from a dictionary, and a city returned from a list of cities. The following are prospective API calls to return data of this type.
The next level of complexity for data generation would be the slightly structured data generation. Instead of having an arbitrary list of addresses, we could or would prospectively generate one. But on the other hand, maybe we should just randomly create actual addresses that can be validated against an actual real address? That seems to have the possiblity of issues in the real world in spite of the fact all the addresses out there in the world are basically publicly accessible data. But questioning how or what the data would or could actually represent will be a great thing to discuss in the thread.
The next level of data generation complexity would be to generate sentences and other related data. This could be done a number of ways. If we wanted to have it generate intelligent sentences that made sense, it would take a little bit more work then for example generating lorum ipsum.
This blog entry just details the starting point of features for the Data Diluvium Project. If you’d like to jump into the project too, let me know. I’m generally working on the project during the weekends and a little during the week. There’s already a building project base that I’m starting with. If you’re interested in writing some F#, check out the work Dave Curylo has done here. I’ve been pondering breaking it out to another project and sticking to the idea of microservice but with F# for the work he put in. Anyway if you’ve got ideas on how to generate data, how you’d like to use it in your applications, or other related ideas please dive into the conversation on the Github Thread here.