Sorry Database Nerds, Nobody Actually Gives a Shit…

So I’ve been in more than a few conversations about data structures, various academic conversations and other notions about where and how data should be stored. I’ve been on projects and managed projects that involve teams of people determining how to manage data so that other people can just not manage data. They want to focus on business use and not the data mechanisms underneath. The root of everything around databases really boils down to a single thing – how can we store X and retrieve X – nobody actually trying to get business done or change the world is going to dig into the data storage mechanisms if they don’t have to. To summarize,

nobody actually gives a shit…

At least nobody does until the database breaks, or somebody has to be hired to manage or tune queries or something or some other problem comes up. In the ideal world we could just put data into the ether and have it come back when we ask for it. Unfortunately we have to keep caring for where the data is, how it’s stored, the schema (even in schema-less, you still need to know the schema of the data at some point, it’s just another abstraction to push off dealing with the database), how to backup, recover, data gravity, proximity and a host of other concerns. Wouldn’t it be cool if we could just work on our app or business? Wouldn’t it be nice to just, well, focus on things we actually give a shit about?

Managed Data Systems!

The whole *aaS and PaaS World has been pushing to simplify operations to the point that the primary, if not the only concern, is the business itself. This is a pretty big step in many ways, but holds a lot of hope and promise around fixing the data gravity, proximity, management and related concerns. One provider of services that has an interesting start around the NoSQL realm is Orchestrate.io. I’ll have more about them in the future, as I’ll actually be working on hacking on some code against their platform. They’re currently solving a number of the mentioned issues. Which is great, a solid starting point that takes us past the draconian nature of the old approach to NoSQL and Relational Databases in general.

There has been some others, such as Mongo Labs or such, that have created a sort of DBaaS. This however doesn’t fill the gap that Orchestrate.io is filling. So far almost every *aaS database or other solution has merely been a single type of database that a developer can just throw data at in a single kind of way. Not really flexible, and really only abstracting some manual work, but not providing much additional value add around using the actual data. Orchestrate.io is bridging these together with search, replication and other features to provide a platform on which multiple options are available via the API. Key value, geo, time series and others are all coming together for them nicely. Having all the options actually creates a real value add, versus just provide one single way to do one thing.

Intelligent Data Systems?

After checking out and interviewing Orchestrate.io recently I’ve stumbled into a few other ideas. It would be perfect for them to implement or for the open source community to take a stab at. What would happen if the systems storing the data knew where to put things? What would be the case for providing an intelligent indexing policy or architecture at the schema design decision layer, the area where a person usually must intervene? Could it be done?

A decision tier that scans and makes decisions on the data to revamp the way it is stored against a key value, geo, time series or other method. Could it be done in real time? Would it have to go through some type of processing system? The options around implementing something like this are numerous, but this just leaves a lot of space for providing value add around the data to reduce the complexity of this decision making.

Imagine you have key value data, that needs to be associative based on graph principles, that you must store in a highly available system with pertinent real-time data provided based on those graph relations. A decision layer, to create an intelligent data system, could monitor the data and determine the frequent query paths against the data. If the data is growing old it could move data from real-time to archival via the key value. Other decisions could be made to push up data segments into a cache tier or some other mechanism to provide realtime graph connections to client queries. These are all decisions that would need to be made by somebody working on the data, but could be put into a set of rules to allow for re-allocation of the data via automated mechanisms into better storage options. Why keep old data that isn’t queried in the active in memory graph store, push it to the distributed key store. Why keep the graph data on drive when it can be in memory with correlated keys in a key value in memory store, backed by an on drive key value? All valid decisions, all becoming better understood day by day. It’s about time some of this decision process started to be automated.

What are your thoughts? Pro-intelligent data systems or anti-intelligent data systems? Think it’ll work or is it the wrong approach? Maybe the system should approach some other zenith or axiom point to become truly abstracted and transparent?

11 thoughts on “Sorry Database Nerds, Nobody Actually Gives a Shit…

  1. Your approach I see solving the huge impedance mismatch built up between front end web scale ecommerce (often nosql based) and backend (often sql based) business systems, ERP, healthcare.gov, et al. The +aaS abstraction check signers really care about is enabling reactive biz apps that can intermediate themselves in dynamic clusters.

    I thought this will happen via tree-shaking message passing architectures. Companies like Whitepages are backfitting Scala Akka. But the bigger prize is the whole Microsoft / Oracle backend is a horribles birdsnest full of crap. Salesforce patterned an OracleDB-on-an-OracleDB, yet Force.com is over 10 years old, and still state of the art for delivering value biz customers really care about. What Force.com does not do is behave dynamically automatically, this is what businesses really care about, solving what biz customers can promise on the web and deliver on the backend. Only 6 were able to sign up on healthcare.gov’s first day (thanks Oracle).

  2. Coda. Forgot to mention that Force.com Oracle-on-an-Oracle database is patterned is pretty much an NoSQL-on-a-SQL database. Lucene reverse indexes populate the bastardized Web facing Oracle database via update triggers from the backend database (conceptually so, more dets of course). No one who buys the enterprise solution seems to care how it works as long as it’s reliable and business developers can develop using declarative methods with very little coding. Scaling database development for applications automatically.

    Abstracting away complexity seems to be running rings around Microsoft / Oracle. What Force.com doesn’t do well is scale. Multitenancy is governed to stop one tenants session deprecating another users session. I imagine this is where orchestrate.io can innovate the *aaS.

    1. Yeah, the force.com database I’ve always heard was kind of a clunky whacked together thing. I’ve never had to deal with it though, so can’t really speak to it intimately at a code level. But just considering what I’ve read in the docs, etc, it doesn’t seem like a very ideal platform to build data intensive applications on.

  3. Another data design and storage paradigm. Hmm.

    See the latest database map, perhaps over 100 vendors now.
    http://blogs.the451group.com/information_management/2013/02/04/updated-database-lanscape-map-february-2013/

    The problem with so many in tech is that they aren’t so much reluctant
    to understand the RDBMS paradigm, they refuse to learn it.

    They’ll dive deep into the OS, and front ends like java.
    But never the database, where they actually get their data from.

    There are bad designs in:
    – networks
    – java objects
    – user interfaces
    – websites
    – process flow
    – programs and algorithms

    And most in IT will understand their importance.

    But not database design.

    A good database design (and people who have the smarts to use it) will :
    – eliminate thousands, or hundreds of thousands lines of code
    – be seriously efficient
    – produce the correct output
    – be reliable
    – allow projects to be completed on time!
    – reduce the total expenditure on projects by orders of magnitude!

    See a series of blog posts I did:
    Database Design Mistakes To Avoid:
    http://rodgersnotes.wordpress.com/2010/09/14/database-design-mistakes-to-avoid/

    What’s utterly amazing is that all these incredibly pathetic “designs”, actually made it into production!!!!

    Part of my series, are my experiences using the 3 table / EAV / MUCK modeling paradigm:
    http://rodgersnotes.wordpress.com/2010/09/21/muck-massively-unified-code-key-generic-three-table-data-model/

    For some analysis of Oracle Apps under the covers, start here:
    http://rodgersnotes.wordpress.com/2010/12/17/oracle-apps-r12-schema-analysis/
    Especially see the part about primary keys!

    HTH

    1. I generally agree with a lot of what you said but a few comments… 😉

      “eliminate thousands, or hundreds of thousands lines of code” <- I'm assuming someone didn't just misunderstand the database side, but didn't do a very good job on the client side if thousands and thousands of lines of code need removed…

      "reduce the total expenditure on projects by orders of magnitude!" http://rodgersnotes.wordpress.com/2010/09/21/muck-massively-unified-code-key-generic-three-table-data-model/

      1. EAV seduction scares me. Our former CEO insisted all financial budgets and matching double entry accounting transactions, sometimes spanning hours, days, weeks, or months existed in the same table to make reporting easier for customers. More than 20 indexes on the GBKMUT table strangled performance for larger business (>12GB). Thousands of line of SQL server queries and knowledge built up was too much risk for management to take on approving splitting the GBKMUT table out into 3 or more tables. http://forum.exactamerica.com/macola/forum/3-macola-es/1153-how-big-is-your-gbkmu

    2. I assume you mean RDMBS by database?

      Anyway, “A good database design” is subjective and there are always trade offs. My experience is that “A good database design” (as a DBA sees it) tends to increase my code (and I have smarts). As for the rest.. more of the same. That might be your experience. Not mine.

      Databases are part of the solution, not the center of it. Note that I am not equating data to database.

  4. cliveb -> That’s exactly why ya just gotta drop the RDBMS sometimes and go with something built specific. The other issue is the DRBMS is so tightly coupled in many of those scenarios, locking people into exactly what you point out, as in that thread the nightmare of,

    “Wondering how many other users are having Financials Performance issues.My GBKMUT is 17gig
    I index nightly. Optimize regularly. I have a quardcore processor running sql 2k with 5 gig ram
    it currently takes 5 min to open GL>General Journal with processed/all selected.
    it takes 2 min Unprocessed/all
    it takes 5 min Void/all”

    …that is, indeed just unreasonable, but a common dilemma in RDBMS land. It just isn’t really efficient for those scenarios.

    1. Funny thing is @SQLServerMike told me for years Oracle and DB2 have been first and second in RDBMS revenues with SQL server in always in third place, but creeping up. Last quarter 2013, this flipped SQL server is now in first place.

      …perhaps this means the NoSQL DMS are biting into larger RDBMS, but plenty of smaller are staying with RDBMS. Oh heck more GBKUMT’s ahead!

      1. Very possible eh. It’s interesting to see which way that money is flowing. I know for a fact that Oracle is being displaced in many key value situations. They don’t cut it in a real distributed highly available way anyway. Ex: They can’t touch Riak for uptime on trillions of pieces of data, especially unstructured key value type data per day cycle.

Comments are closed.