The Solutions Architect's Guide to Choosing a (NoSQL) Data Store: Process Overview

When proposing a data store solution, just going with the flow (or hype, for that matter) is not a very safe approach. As a solutions architect, one needs to make sure one has a clear overview of the usage scenarios and business needs served by the data store, research and inventory potential candidates, benchmark the fully compliant ones, examine the results and then make an objective, argument-bound proposal.

Just saying “we’ll implement a cutting edge NoSQL data store” might earn you extra points in front of stakeholders at first, but it is clearly not enough to deliver a mature, robust solution, which is fit for use and fit for purpose and which the development and operations teams can feel in control of.

Let’s start with the diagram below.

Architectural Process: Choosing a Data Store (click to enlarge)

Analyze usage and load scenarios and general requirements for the data store. At this stage you list the features and capabilities you wish to have from your data store, such as scalability (partition tolerance), responsiveness (fast queries) or indexing. You should NOT put items on this list just because “it’s good to have it there”, “all the cool kids have it”, “I heard it’s important” or “I read about it in a magazine”. You should ONLY put items on this list because they serve a business purpose and a real usage scenario. You can mark each requirement as “mandatory” (must have) or “optional” (nice-to-have). Also, it’s a good idea to mark the business impact of not having such requirement – as this will help you discuss with business resources and stakeholders with a lot more ease. Finally, make sure you validate your assumptions with your team and with the beneficiary (client, stakeholders). All these requirements will serve as the header of the compliance matrix for candidate data stores. The compliance matrix can contain technical requirements or business requirements (“the data store should be open source” OR “the data store is commercially supported” OR “the data store is offered as a cloud service on infrastructure provider Amazon Web Services/Microsoft Azure/Google Cloud).
Rank candidate data stores on the compliance matrix. Do your research, make a list of all data stores you would like to consider. Typically, this list should contain 5 candidates; having more than 7 would mean your losing focus; having less than 3 would mean your jumping to conclusions too soon. Go in-depth for each one of the candidates and see if they are compliant with each of the requirements. Mark “I don’t know” where you are not sure or where further research is needed.
Decision gate: is any data store which is fully compliant all of your requirements? This step is critical in your process: either you have found one (but preferably, 2-3) candidates for which “the shoe seems to fit” or you need to go back to the drawing board. Typically, when you have too many requirements, it typically means you want to use the same solution to solve several problems at once (there are few data stores which are both transactional and analytical at the same time; SAP HANA would be an example and it is not cheap). So what you can do is to split your problem in two smaller ones (divide et impera in solutions architecture is known as separation of concerns), which can be independently be solved more efficiently. Only do this split is absolutely necessary. Remember: the more pieces you split in, the more integration work you’ll have to handle.
Benchmark and execute proof-of-concept. Yeah, it looks great on paper, the open source community thinks it’s great, the vendor says it’s great (especially that hot chick who is our account manager). So let’s test it. Pick a scenario. Let’s say 10K transactions per second. Match the number of columns in each record and the approximate type with what you imagine you’ll have in production. It doesn’t have to be an exact replica of the real scenarios – when you’re not sure what you will need, round up the requirement and benchmark something more aggresive. “We might need between 30 and 40 million records per table” translates to “Let’s benchmark it with 100 million”. When executing a benchmark, understand that it not important to match the functionality of the feature; rather, it is important to match (and outmatch) its aggressiveness in terms of performance. Make 50 concurrent requests from different machines altering the same record. Drop in 100 million records sequentially, read them randomly, delete some of them, and then read randomly again – is there any degradation in performance. Shut down n/2-1 nodes during a load test – is the data store still holding? And if so, with what kind of performance degradation? If you turn back on one of the dead nodes, does it start to take in some of the load? Does it reprovision with the data it lost? And so on… Use your imagination when you benchmark, spiced with a pinch of sadism. The main purpose of the benchmark is to confirm that the non-functional requirements of the scenarios you identified are met.
Rank candidate data stores based on the results of the benchmark. This is the first real world validation of your proposals. This will help you discern from those potential candidates which say they’re good from those who are actually good. This will also weed out any invalid assumptions you have made.
Evaluate operational cost based on benchmark. Bombarding a data store with requests will make you aware of just how much hardware and resource you will need for the real life production scenario. Based on the results, you can make a more educated guess about costs. Be sure to put it in writing in your proposal.
Consult with business resources on cost, benefit, risk. So far, this has been pretty much a technology exercise. Now it’s time to share your findings, inform stakeholders of any potential risk and tell them what kind of invoice they can expect for this, including capital expenses (hardware, licenses) and operational expenses (using cloud PaaS or SaaS, renting infrastructure).
Split scenarios on types so as to allow the usage of two or more integrated, purpose-specific data stores. Let’s say you want to build a data store which processes transactions with millisecond delay (OLTP), but is also able to produce complex reports on hundreds of millions of transactions in a few seconds (OLAP). While there are few data stores able to do both at once (at they are probably over your budget anyway), what you can do is propose and OLTP solution which periodically (every hour, let’s say) batch-provisions data into the OLAP solution. This way, you can have the best of both worlds, if you are able to accept some delay between them (i.e. the reports will not contain the last hour of data, they will not be real time).

It might seem like a bit of overkill, but following this process will make sure you don’t end up with loose ends and with things you discover 2 weeks before or 3 months after the go-live of the final project.

In the following articles I will publish I will drill-down into the first (defining requirements) and the second step (ranking candidate data stores).

The Solutions Architect’s Guide to Choosing a (NoSQL) Data Store: Process Overview

DeepVISS.org

One Comment

Leave a Reply