As discussed in the previous articles, the first step in the process of finding the right solution for a data store is having an in-depth understanding of the fundamentals problem at hand and of the business scenario(s) which it will serve. In what follows, I will expose some of the question that I consider an enterprise/solutions architect should know the answers to before proposing either a new data store or a data store replacement. After all, as stated by the CAP theorem, you can’t have all qualities in a data store, so you need to carefully pick the right tool for the job.
- Volumetry: What is the total size of the data store? It’s not important to get an exact value, but rather an order of magnitude (10GB, 200GB, 1TB, 20TB). Instead of concerning yourself with the exact value, it’s better to focus on the growth factor you expect year-over-year (is it 10%, 50%, 200% or 1000%). Depending on the volumetry and on its growth, you might be forced to opt for a scalable/distrbuted data store (which runs on several nodes).
- Atomic size: How many records (items, objects) are processed/retrieved/in any way touched by one query? Also, you need to focus on the order of magnitude, not on the exact value. Are you planning to retrieve/process/compute over up to 10 records in a query (this would be the case for transactional workloads, like updating customer data, records), is it more like 10K records (usually in short-term reporting and analytics workloads) or it is more like 10-100M records touched by each query (characteristic of an analytical data store or data warehouse, use for building more complex long-term reports)?
- Load: How many queries do you expect per second, on average and in spikes? Are we talking 10 operations/second, 1K operations/second or 100K operations/second? Depending on the load and on its growth, you might be forced to opt for a scalable/distrbuted data stores (which runs on several nodes).
- Responsiveness: How fast do you expect those queries to run? In some instances, you may need 1-2ms response time (real time systems), other scenarios might be OK with 50-500 ms (displaying, generating content) and other scenarios might be satisfactory to run in 1-60 seconds (usually analytical workloads, generating a complex report of all items sold in the last six months per geographic region and line of business)
- Immutability. Does your data ever change after you add it? For instance, if you’re storing a log of events (page views, user actions, application errors/warnings), it’s unlikely you ever want to change a particular data. And this assumption does wonders in terms of allowing you to choose a class of data stores that are fast, scalable, capable of running complex queries, but which are pretty averse to changes: column (or columnar) data stores / data warehouses. Of course, this does not mean that you cannot change data once it’s stored – it just means that changing comes with a big performance hit (i.e. not what the tool is built for). Note that for columnar data stores there are a lot of strategies of selectively deleting old data without denting performance (i.e. destroy data which is older than 24 months).
- Strict consistency. There are cases when you want all queries to the data store to receive the exact same result (assuming nothing changed between queries). In case you are running your data store on a single layer and on a single node (like, you know, MySQL) this is almost never an issue. If you are running the data base on distributed nodes (and all or some of the data is replicated), some nodes may get the updated version later than other, therefore they might give out different answers than the master node, at least until they get the update. Therefore, you may want the data store to be able to guarantee you the fact that all replicas have been updated before confirming the change (consensum) or that at least 2,3, n/2, n/2+1 (quorum) of the replicas received the update.
- Date Freshness/Staleness: How fresh do you expect the data to be? In order to scale, you may want to maintain copies of the master data. This means that when something is added or when something changes, it takes some times for all the copies (replicas) to be updated. Is this acceptable? And if so, would 10ms be ok, would you be OK with 1 second? Or would even 1-5 minutes be satisfactory? For instance, when reading and writing banking transactions, any sort of staleness is unacceptable (since it can raise risks of double spending). However, if running a content site, having an article or a picture refresh from the user’s perspective 5 or 10 seconds after if was updated by the content manager is pretty much OK. Going further, if you create a sales report for the last 6 months, it may even be acceptable that the data from the last hour (or even the last day) is not included (or is not guaranteed to be entirely accurate).
- Transactional ACID compliance. ACID stands for Atomicity-Consistency-Integrity-Durability it basically refers to the fact that transactions (groups of separate changes) either succeed together or fail together (while preserving the previous state in case of failure). This might be the case for bank statements, customer orders and online payments, but transactional compliance is most probably NOT needed for reporting, content delivery and ad delivery, tracking analytics.
- Query accuracy. For certain analytics tasks (i.e. number of unique users), especially for real time queries, having the absolute exact value is not a absolute necessity. If having 1-2% error in acceptable, you can consider using sketch techniques for approximate query processing, which make your systems run faster, with less resources/lower costs, while only guaranteeing the results with a specified error threshold (of course, less error->more processing->more time/more resources). Simple examples of approximate query processing include linear counter and LogLog counters. These methods of doing fast estimates for problems which are expensive to evaluate accurately rely on the less-popular probabilistic data structures. Data stores usually don’t have built-in support for this, but you can implement it in the application layer to make your life a lot easier when precision is not mandatory.
- Persistence and durability. Do you want to keep the data in case of adding/removing or replacing a node or in case of an application restart or power failure? In most cases, the answer is “of course I do! what are you, crazy?!”, but there are some use cases (such as caching or periodic recomputing) where wiping out the whole data store in case of node failure/cluster failure or maintenance work is acceptable. Imagine a memcached cluster is used to query database query results for up to 1 minute (i.e. to prevent congestion on the underlying database) – in this case, wiping out the cache, starting from scratch and then refilling it (known as cache warming) is acceptable, as it would only entail a small performance degradation during the 10 minutes (this negative effect can be further reduced by performing this cache wiping during maintenance hours, i.e. during the night).
- High availability (fault tolerance or partition tolerance). In some cases, it is important that a data store is never, never down (well, almost). Nobody likes downtime, but in some cases it’s more acceptable than in others (the way you can asses this is by looking at the business impact per hour: revenue loss and legal risk – you know, like people not paying or suing to ask for their money back plus damages). Assuming you are under strict (maybe even legal) high availability requirements, you want to make sure that you data store can take a hit or to; that is, I can go on functioning even if a few nodes go down. This is a way, you can reduce the probability of data store failing if a node fails and you can make sure that the service(s) it provides or supports do no suffer interruptions while you repair or replace the damaged node. So if your truly need to offer such guarantee, make sure you go for a data store which is fault tolerant.
- Note: As an exercise, try to compute the average failure rate of a cluster composed of three fully redundant nodes (they all store the same data), assuming each individual node has a failure rate of 1% per year (in the first year of operations) and that failures are isolated (i.e. not accounting for failures that affect all nodes simultaneously, like power outages or a meteor hitting your data center)
- Backups and disaster recovery. Sooooo, you remember I mentioned a meteor hitting your data center? Yeah, it just hit your data center. Head on, full on. There’s nothing left. Every bit wiped out of existence in 2.47 seconds. Do you want to be prepared for this scenario? If so, add backup and disaster recovery to your data store’s requirement. Remember that for a data store to be considered disaster recoverable, it needs to have an exact/almost exact replica in a geographically separate data center (different continent, different country, different city). Furthermore, you may even require that the replica is hot-swappable (passive backup) or load balance (active backup) with the master version, so that in case of disaster the downtime is non-existent or minimal.
There are other non-technical constraints which you need to have in mind, as some of them might prove to be show-stoppers for some of the candidate data stores you will consider:
- Infrastructure preference: on-premise / private cloud, public cloud / SaaS (software-as-a-service).
- Capital expenses (up-front investment).
- Operational expenses (recurring costs).
- Deployment complexity.
- Operating/maintenance complexity.
- Team’s knowledge/willingness and opportunity to expand that knowledge.
- Licensing concerns.
Pick those requirements which apply for your project/scenario and write them down as the header of a table.
That table will become the compliance matrix for your candidate solutions, which we will use and evaluate in the next article.