Before deciding to transition the apps that your company builds to Big Data / NoSQL solutions , there are a few things one needs to understand beforehand:
- The CAP theorem, which states that a distributed system cannot be strictly consistent, highly available and fault tolerant at the same time. Figure out what you need first. Tip: you might need several separate data stores for different purposes.
- “NoSQL” is just a marketing buzzword, it is not a concrete solution. There are several types of non-relational and of scalable data stores which are labeled NoSQL, although they are very different in capability and performance.
- There is no silver bullet. “One [data store] to rule them all” is something that only Lord of the Rings fan would believe; and even them (most of them, anyway) know it’s fiction.
- There is no free lunch (or “there ain’t no such thing as a free lunch“) which means that a data store will perform wonderfully under the conditions for which is was engineered and it would be a disaster for other scenarios. It’s your responsibility to pick the right tool for the job.
- Don’t do it just because it’s cool. Technology must serve a practical, objective-bound, business purpose. “Our company has to transition to Big Data (because everybody else is doing it)” does NOT constitute a valid reason.
- Is your data really that BIG? Rule of thumb: if you don’t have at least 1TB of data, you don’t need really big data. We all like to think that our department deploys and manages big data, we all like to think that our company need big data. You want to be one of the cool kids who are riding high on the big data new wave. But give serious thought whether you actually are. Before you jump in the Big Data pool, you might want to check out current and future data storage needs (are they really growing that fast?), ways to improve the performance of your current MySQL solution (Google the following: “master-slave replication”, “query result caching”, “memcached query caching”, “database partitions” and “sharding” – see if any ideas light up). Also, you might want to consider a hardware upgrade (servers with SSD drives can do magic, I’m told).
- Performance, capability and low cost: pick two. You can’t have all (see “There is not free lunch” above). Maybe you are a small organization which is not that data intensive. Maybe you need all the query flexibility of SQL and don’t have a huge budget to get into data warehouse BI solution. Understand your business needs, priorities and budget before you start blurting out words like “NoSQL”, “big data”, “lambda architecture”, “unlimited scalability” and “data driven business”.
- Training and support. Fine, let’s say you build the goddamn thing. It works. Passes all the tests. Goes live. The business cheers, the tech guys cheer, everyone’s happy. The OPS/DEVOPS/infrastructure guys: maybe not so much. You see, knowledge on MySQL and Tomcat is ubiquitous, so if you ran into a production problem either the team has the experience or Google and StackOverflow have a lot of things that can help. However, you won’t find a lot of 10 step tutorials on how to recover from multiple Hadoop (HDFS) node failure that occurs during a HBase compaction. For that, you need to make sure your team is either well trained (unlikely if you’re just adopting this tech stacks in the company) or that you at least you have a satisfactory level of support (with SLAs, not just best effort) from your software vendor, from your service provider or from a third party (that specializes in support for open source)
- Not paying up-front ends up being more expensive over time. Every business guy is super-excited that all this big data magic is free, right? Cause it’s open source, right? I’m not going to get into the “free speech vs. free beer” argument. I’m just saying that if you factor in loss of revenue due to downtime, maintenance, operation and support costs – an open source solution might end up being a lot more expensive than paying for licensing, training and support. Whoever says that using open source is cheaper with too much ease clearly doesn’t understand the concept of TCO (Total Cost of Ownership) Make sure your team either has the knowledge and the practical experience of managing the solution you adopt or that you have a solution or support vendor which has SLAs which are acceptable.
- Do your homework, stay in control, don’t buy the bullshit. Big Data is not a solution to all your problems. It won’t make your business bloom overnight. And it’s a lot of knowledge to take in for the technical. “Transitioning to big data is a key objective for our company. That’s why we hired this big data consult” congrats, you just hired a guy who doesn’t know your apps, your business processes or your team and who is probably charging you 400-2000$/day for Googling “how to install HBase on my laptop” – great investment, much successes.
Big Data, scalable data stores and cloud infrastructure – are no longer an “if” for IT, it’s just a matter of “when?“. All I’m saying is that maybe for your business the answer might be “not this year”. And I’m also saying that if the answer is “right now!”, you should make sure you cover all the angles exposed above.
On a less serious note, you can always check out NoSQLBane for some consistency and fault tolerance humor. And for a mix of distributed computing insight and stand up comedy, do watch James Micken speech on big data, NoSQL, cloud, virtual infrastructure and bullshit.