As it is clear from Gartner’s Magic Quadrant for Cloud Infrastructure as a Service both Microsoft Azure and Amazon Web Services are well-ahead of the curve in terms of ability to execute and completeness of vision.
But which one is right for you?
In this article, I will expose some views you need to consider when singing-up up for one of the two leading cloud vendors. Finally, I will present a mapping between Azure services and Amazon Services, so it is easier to see if one of them has practical out-of-the-box added value for you – although they are generally equivalent from a functional point of view.
Each argument below is presented assuming all else being equal.
Do you have a hybrid (both public cloud and on-premise) infrastructure?
In case you need to operate a hybrid infrastructure of both public cloud infrastructure and on-premise infrastructure (with Enterprise SLAs), Azure is most likely the right choice for you. It has multiple out-of-the-box integrations that allow to quickly ramp-up a hybrid infrastructure. Of course, AWS is perfectly capable of being used in a hybrid scenario, but this would probably require more development, testing and configuration effort from your organization (or your software vendor).
Do you have a strong Microsoft footprint in your organization?
Reusing on-premise infrastructure that runs existing Microsoft applications or stack is obviously easier to do in the Azure environment. Obviously, if your development and operations team have a track record of using Microsoft products, the transition to Azure cloud would be less steep than to Amazon.
Are you a startup or are you under pressure to deliver an MVP under very limited resources?
If you want to have something rolling fast, with a smoother learning curve, Amazon does a better job at this, since setting up basic services is more straight-forward for less complex setups.
Do you have strong legal or regulatory requirements around data privacy and protection?
Some companies are under strong regulatory requirements to keep their data stored inside the borders of a country or within its own data centers. For these kinds of scenarios, a hybrid cloud solution is required, which puts Azure on top of your list again.
In terms of cost and SLAs (both providers guarantee 99.5% and 99.9% for storage), the differences are marginal and should be analyzed on a case-by-case basis, to conclude whether they make a significant difference.
In any circumstance, whatever you may choose, know that the services offered by both providers are functionally equivalent (even if not interoperable).
|Azure Service||AWS Service||Level of Equivalence||Major differences|
|Active Directory||Identity and Access Management (IAM)|
|Full||Active Directory support higher granularity (at application level) and out-of the box integration with on-premise applications and services.|
|API Management||Not out-of-the-box||None||AWS does not offer an integration gateway/API management service.
|Application Insights||Cloud Watch||Full|
|Redis Cache||Amazon ElastiCache||Full||No major functional difference.|
|Batch||Simple Queue Service|
Simple Workflow Service
|BizTalk Services||Not out-of-the-box||None||AWS does not offer an ESB (enterprise service bus) as a service.|
|Cloud Services||Cloud Formation||Partial|
|Data Factory||Data Pipeline||Partial|
|Event Hubs||Cloud Watch||Full|
|HDInsight||Elastic Map Reduce (EMR)||Full|
|Machine Learning||Custom: Apache Mahout over Elastic Map Reduce (EMR) service||Partial|
|Managed Cache||Elastic Cache||Full|
|Media Services||Elastic Transcoder||Partial|
|Mobile Services||Mobile Services||Partial|
|Multi-Factor Authentication||Multi-Factor Authentication (MFA)||Full|
|Notifications Hub||Simple Notification Service||Full|
|Operational Insights||Cloud Trail||Full|
|Azure Search||Elastic Search - general full text search|
Cloud Search - custom web search
|Scheduler||Not out-of-the-box||None||Although systems and resources can be orchestrated from various services, AWS does not explicitly offer a scheduler service.|
|Service Bus||Simple Queue Service|
Simple Workflow Service
|Site Recovery||Not out-of-the-box||None||AWS supports Disaster Recovery scenario, but it has no explicit service.|
|SQL Database||Relational Database Service||Partial|
|Storage||Simple Storage Service - for big objects|
DynamoDB - for tables
Simple Queue Service - for queues
|StorSimple||Not out-of-the-box||Limited||AWS can be used for hybrid cloud architectures, but it offers explicit service for multi-modal data management. Limited compliance can be achieved in AWS by defining policies in Simple Storage Service and Glacier, that revolve either around removing old data or moving it to long term (cheaper storage).|
|Traffic Manager||Route 53 - for DNS|
Elastic Load Balancer - for balancing
|Virtual Machines||Machine Images||Full|
|Virtual Network||Virtual Private Cloud||Full|
|Visual Studio Online||Elastic Beanstalk|
CodeCommit (available in 20015)
|None||AWS does not offer an integrated, native Visual Studio ALM/IDE, but|
|Websites||Not out-of-the-box||None||AWS does not offer a web CMS as a service, but it is relatively easy to deploy using several AWS Marketplace AMIs (Amazon Machine Images).|
The data store variety and landscape today is huge and can be confusing. To get a grip on all the various solutions out there, I find the map below (courtesy of 451research.com) very helpful when making the long list of potential candidates for a project. It lets you make sure you that don’t leave out anything relevant from your selection process and that you are, quite literally, on the right track.
For the very high level purpose of your project, look up the line and list all the data stores on that line. Research each one against your project-specific requirements. Finally, make a short list with the ones which seem to be fit for purpose.
As discussed in the previous articles, the first step in the process of finding the right solution for a data store is having an in-depth understanding of the fundamentals problem at hand and of the business scenario(s) which it will serve. In what follows, I will expose some of the question that I consider an enterprise/solutions architect should know the answers to before proposing either a new data store or a data store replacement. After all, as stated by the CAP theorem, you can’t have all qualities in a data store, so you need to carefully pick the right tool for the job.
- Volumetry: What is the total size of the data store? It’s not important to get an exact value, but rather an order of magnitude (10GB, 200GB, 1TB, 20TB). Instead of concerning yourself with the exact value, it’s better to focus on the growth factor you expect year-over-year (is it 10%, 50%, 200% or 1000%). Depending on the volumetry and on its growth, you might be forced to opt for a scalable/distrbuted data store (which runs on several nodes).
- Atomic size: How many records (items, objects) are processed/retrieved/in any way touched by one query? Also, you need to focus on the order of magnitude, not on the exact value. Are you planning to retrieve/process/compute over up to 10 records in a query (this would be the case for transactional workloads, like updating customer data, records), is it more like 10K records (usually in short-term reporting and analytics workloads) or it is more like 10-100M records touched by each query (characteristic of an analytical data store or data warehouse, use for building more complex long-term reports)?
- Load: How many queries do you expect per second, on average and in spikes? Are we talking 10 operations/second, 1K operations/second or 100K operations/second? Depending on the load and on its growth, you might be forced to opt for a scalable/distrbuted data stores (which runs on several nodes).
- Responsiveness: How fast do you expect those queries to run? In some instances, you may need 1-2ms response time (real time systems), other scenarios might be OK with 50-500 ms (displaying, generating content) and other scenarios might be satisfactory to run in 1-60 seconds (usually analytical workloads, generating a complex report of all items sold in the last six months per geographic region and line of business)
- Immutability. Does your data ever change after you add it? For instance, if you’re storing a log of events (page views, user actions, application errors/warnings), it’s unlikely you ever want to change a particular data. And this assumption does wonders in terms of allowing you to choose a class of data stores that are fast, scalable, capable of running complex queries, but which are pretty averse to changes: column (or columnar) data stores / data warehouses. Of course, this does not mean that you cannot change data once it’s stored – it just means that changing comes with a big performance hit (i.e. not what the tool is built for). Note that for columnar data stores there are a lot of strategies of selectively deleting old data without denting performance (i.e. destroy data which is older than 24 months).
- Strict consistency. There are cases when you want all queries to the data store to receive the exact same result (assuming nothing changed between queries). In case you are running your data store on a single layer and on a single node (like, you know, MySQL) this is almost never an issue. If you are running the data base on distributed nodes (and all or some of the data is replicated), some nodes may get the updated version later than other, therefore they might give out different answers than the master node, at least until they get the update. Therefore, you may want the data store to be able to guarantee you the fact that all replicas have been updated before confirming the change (consensum) or that at least 2,3, n/2, n/2+1 (quorum) of the replicas received the update.
- Date Freshness/Staleness: How fresh do you expect the data to be? In order to scale, you may want to maintain copies of the master data. This means that when something is added or when something changes, it takes some times for all the copies (replicas) to be updated. Is this acceptable? And if so, would 10ms be ok, would you be OK with 1 second? Or would even 1-5 minutes be satisfactory? For instance, when reading and writing banking transactions, any sort of staleness is unacceptable (since it can raise risks of double spending). However, if running a content site, having an article or a picture refresh from the user’s perspective 5 or 10 seconds after if was updated by the content manager is pretty much OK. Going further, if you create a sales report for the last 6 months, it may even be acceptable that the data from the last hour (or even the last day) is not included (or is not guaranteed to be entirely accurate).
- Transactional ACID compliance. ACID stands for Atomicity-Consistency-Integrity-Durability it basically refers to the fact that transactions (groups of separate changes) either succeed together or fail together (while preserving the previous state in case of failure). This might be the case for bank statements, customer orders and online payments, but transactional compliance is most probably NOT needed for reporting, content delivery and ad delivery, tracking analytics.
- Query accuracy. For certain analytics tasks (i.e. number of unique users), especially for real time queries, having the absolute exact value is not a absolute necessity. If having 1-2% error in acceptable, you can consider using sketch techniques for approximate query processing, which make your systems run faster, with less resources/lower costs, while only guaranteeing the results with a specified error threshold (of course, less error->more processing->more time/more resources). Simple examples of approximate query processing include linear counter and LogLog counters. These methods of doing fast estimates for problems which are expensive to evaluate accurately rely on the less-popular probabilistic data structures. Data stores usually don’t have built-in support for this, but you can implement it in the application layer to make your life a lot easier when precision is not mandatory.
- Persistence and durability. Do you want to keep the data in case of adding/removing or replacing a node or in case of an application restart or power failure? In most cases, the answer is “of course I do! what are you, crazy?!”, but there are some use cases (such as caching or periodic recomputing) where wiping out the whole data store in case of node failure/cluster failure or maintenance work is acceptable. Imagine a memcached cluster is used to query database query results for up to 1 minute (i.e. to prevent congestion on the underlying database) – in this case, wiping out the cache, starting from scratch and then refilling it (known as cache warming) is acceptable, as it would only entail a small performance degradation during the 10 minutes (this negative effect can be further reduced by performing this cache wiping during maintenance hours, i.e. during the night).
- High availability (fault tolerance or partition tolerance). In some cases, it is important that a data store is never, never down (well, almost). Nobody likes downtime, but in some cases it’s more acceptable than in others (the way you can asses this is by looking at the business impact per hour: revenue loss and legal risk – you know, like people not paying or suing to ask for their money back plus damages). Assuming you are under strict (maybe even legal) high availability requirements, you want to make sure that you data store can take a hit or to; that is, I can go on functioning even if a few nodes go down. This is a way, you can reduce the probability of data store failing if a node fails and you can make sure that the service(s) it provides or supports do no suffer interruptions while you repair or replace the damaged node. So if your truly need to offer such guarantee, make sure you go for a data store which is fault tolerant.
- Note: As an exercise, try to compute the average failure rate of a cluster composed of three fully redundant nodes (they all store the same data), assuming each individual node has a failure rate of 1% per year (in the first year of operations) and that failures are isolated (i.e. not accounting for failures that affect all nodes simultaneously, like power outages or a meteor hitting your data center)
- Backups and disaster recovery. Sooooo, you remember I mentioned a meteor hitting your data center? Yeah, it just hit your data center. Head on, full on. There’s nothing left. Every bit wiped out of existence in 2.47 seconds. Do you want to be prepared for this scenario? If so, add backup and disaster recovery to your data store’s requirement. Remember that for a data store to be considered disaster recoverable, it needs to have an exact/almost exact replica in a geographically separate data center (different continent, different country, different city). Furthermore, you may even require that the replica is hot-swappable (passive backup) or load balance (active backup) with the master version, so that in case of disaster the downtime is non-existent or minimal.
There are other non-technical constraints which you need to have in mind, as some of them might prove to be show-stoppers for some of the candidate data stores you will consider:
- Infrastructure preference: on-premise / private cloud, public cloud / SaaS (software-as-a-service).
- Capital expenses (up-front investment).
- Operational expenses (recurring costs).
- Deployment complexity.
- Operating/maintenance complexity.
- Team’s knowledge/willingness and opportunity to expand that knowledge.
- Licensing concerns.
Pick those requirements which apply for your project/scenario and write them down as the header of a table.
That table will become the compliance matrix for your candidate solutions, which we will use and evaluate in the next article.
When proposing a data store solution, just going with the flow (or hype, for that matter) is not a very safe approach. As a solutions architect, one needs to make sure one has a clear overview of the usage scenarios and business needs served by the data store, research and inventory potential candidates, benchmark the fully compliant ones, examine the results and then make an objective, argument-bound proposal.
Just saying “we’ll implement a cutting edge NoSQL data store” might earn you extra points in front of stakeholders at first, but it is clearly not enough to deliver a mature, robust solution, which is fit for use and fit for purpose and which the development and operations teams can feel in control of.
Let’s start with the diagram below.
- Analyze usage and load scenarios and general requirements for the data store. At this stage you list the features and capabilities you wish to have from your data store, such as scalability (partition tolerance), responsiveness (fast queries) or indexing. You should NOT put items on this list just because “it’s good to have it there”, “all the cool kids have it”, “I heard it’s important” or “I read about it in a magazine”. You should ONLY put items on this list because they serve a business purpose and a real usage scenario. You can mark each requirement as “mandatory” (must have) or “optional” (nice-to-have). Also, it’s a good idea to mark the business impact of not having such requirement – as this will help you discuss with business resources and stakeholders with a lot more ease. Finally, make sure you validate your assumptions with your team and with the beneficiary (client, stakeholders). All these requirements will serve as the header of the compliance matrix for candidate data stores. The compliance matrix can contain technical requirements or business requirements (“the data store should be open source” OR “the data store is commercially supported” OR “the data store is offered as a cloud service on infrastructure provider Amazon Web Services/Microsoft Azure/Google Cloud).
- Rank candidate data stores on the compliance matrix. Do your research, make a list of all data stores you would like to consider. Typically, this list should contain 5 candidates; having more than 7 would mean your losing focus; having less than 3 would mean your jumping to conclusions too soon. Go in-depth for each one of the candidates and see if they are compliant with each of the requirements. Mark “I don’t know” where you are not sure or where further research is needed.
- Decision gate: is any data store which is fully compliant all of your requirements? This step is critical in your process: either you have found one (but preferably, 2-3) candidates for which “the shoe seems to fit” or you need to go back to the drawing board. Typically, when you have too many requirements, it typically means you want to use the same solution to solve several problems at once (there are few data stores which are both transactional and analytical at the same time; SAP HANA would be an example and it is not cheap). So what you can do is to split your problem in two smaller ones (divide et impera in solutions architecture is known as separation of concerns), which can be independently be solved more efficiently. Only do this split is absolutely necessary. Remember: the more pieces you split in, the more integration work you’ll have to handle.
- Benchmark and execute proof-of-concept. Yeah, it looks great on paper, the open source community thinks it’s great, the vendor says it’s great (especially that hot chick who is our account manager). So let’s test it. Pick a scenario. Let’s say 10K transactions per second. Match the number of columns in each record and the approximate type with what you imagine you’ll have in production. It doesn’t have to be an exact replica of the real scenarios – when you’re not sure what you will need, round up the requirement and benchmark something more aggresive. “We might need between 30 and 40 million records per table” translates to “Let’s benchmark it with 100 million”. When executing a benchmark, understand that it not important to match the functionality of the feature; rather, it is important to match (and outmatch) its aggressiveness in terms of performance. Make 50 concurrent requests from different machines altering the same record. Drop in 100 million records sequentially, read them randomly, delete some of them, and then read randomly again – is there any degradation in performance. Shut down n/2-1 nodes during a load test – is the data store still holding? And if so, with what kind of performance degradation? If you turn back on one of the dead nodes, does it start to take in some of the load? Does it reprovision with the data it lost? And so on… Use your imagination when you benchmark, spiced with a pinch of sadism. The main purpose of the benchmark is to confirm that the non-functional requirements of the scenarios you identified are met.
- Rank candidate data stores based on the results of the benchmark. This is the first real world validation of your proposals. This will help you discern from those potential candidates which say they’re good from those who are actually good. This will also weed out any invalid assumptions you have made.
- Evaluate operational cost based on benchmark. Bombarding a data store with requests will make you aware of just how much hardware and resource you will need for the real life production scenario. Based on the results, you can make a more educated guess about costs. Be sure to put it in writing in your proposal.
- Consult with business resources on cost, benefit, risk. So far, this has been pretty much a technology exercise. Now it’s time to share your findings, inform stakeholders of any potential risk and tell them what kind of invoice they can expect for this, including capital expenses (hardware, licenses) and operational expenses (using cloud PaaS or SaaS, renting infrastructure).
- Split scenarios on types so as to allow the usage of two or more integrated, purpose-specific data stores. Let’s say you want to build a data store which processes transactions with millisecond delay (OLTP), but is also able to produce complex reports on hundreds of millions of transactions in a few seconds (OLAP). While there are few data stores able to do both at once (at they are probably over your budget anyway), what you can do is propose and OLTP solution which periodically (every hour, let’s say) batch-provisions data into the OLAP solution. This way, you can have the best of both worlds, if you are able to accept some delay between them (i.e. the reports will not contain the last hour of data, they will not be real time).
It might seem like a bit of overkill, but following this process will make sure you don’t end up with loose ends and with things you discover 2 weeks before or 3 months after the go-live of the final project.
In the following articles I will publish I will drill-down into the first (defining requirements) and the second step (ranking candidate data stores).
Before deciding to transition the apps that your company builds to Big Data / NoSQL solutions , there are a few things one needs to understand beforehand:
- The CAP theorem, which states that a distributed system cannot be strictly consistent, highly available and fault tolerant at the same time. Figure out what you need first. Tip: you might need several separate data stores for different purposes.
- “NoSQL” is just a marketing buzzword, it is not a concrete solution. There are several types of non-relational and of scalable data stores which are labeled NoSQL, although they are very different in capability and performance.
- There is no silver bullet. “One [data store] to rule them all” is something that only Lord of the Rings fan would believe; and even them (most of them, anyway) know it’s fiction.
- There is no free lunch (or “there ain’t no such thing as a free lunch“) which means that a data store will perform wonderfully under the conditions for which is was engineered and it would be a disaster for other scenarios. It’s your responsibility to pick the right tool for the job.
- Don’t do it just because it’s cool. Technology must serve a practical, objective-bound, business purpose. “Our company has to transition to Big Data (because everybody else is doing it)” does NOT constitute a valid reason.
- Is your data really that BIG? Rule of thumb: if you don’t have at least 1TB of data, you don’t need really big data. We all like to think that our department deploys and manages big data, we all like to think that our company need big data. You want to be one of the cool kids who are riding high on the big data new wave. But give serious thought whether you actually are. Before you jump in the Big Data pool, you might want to check out current and future data storage needs (are they really growing that fast?), ways to improve the performance of your current MySQL solution (Google the following: “master-slave replication”, “query result caching”, “memcached query caching”, “database partitions” and “sharding” - see if any ideas light up). Also, you might want to consider a hardware upgrade (servers with SSD drives can do magic, I’m told).
- Performance, capability and low cost: pick two. You can’t have all (see “There is not free lunch” above). Maybe you are a small organization which is not that data intensive. Maybe you need all the query flexibility of SQL and don’t have a huge budget to get into data warehouse BI solution. Understand your business needs, priorities and budget before you start blurting out words like “NoSQL”, “big data”, “lambda architecture”, “unlimited scalability” and “data driven business”.
- Training and support. Fine, let’s say you build the goddamn thing. It works. Passes all the tests. Goes live. The business cheers, the tech guys cheer, everyone’s happy. The OPS/DEVOPS/infrastructure guys: maybe not so much. You see, knowledge on MySQL and Tomcat is ubiquitous, so if you ran into a production problem either the team has the experience or Google and StackOverflow have a lot of things that can help. However, you won’t find a lot of 10 step tutorials on how to recover from multiple Hadoop (HDFS) node failure that occurs during a HBase compaction. For that, you need to make sure your team is either well trained (unlikely if you’re just adopting this tech stacks in the company) or that you at least you have a satisfactory level of support (with SLAs, not just best effort) from your software vendor, from your service provider or from a third party (that specializes in support for open source)
- Not paying up-front ends up being more expensive over time. Every business guy is super-excited that all this big data magic is free, right? Cause it’s open source, right? I’m not going to get into the “free speech vs. free beer” argument. I’m just saying that if you factor in loss of revenue due to downtime, maintenance, operation and support costs – an open source solution might end up being a lot more expensive than paying for licensing, training and support. Whoever says that using open source is cheaper with too much ease clearly doesn’t understand the concept of TCO (Total Cost of Ownership) Make sure your team either has the knowledge and the practical experience of managing the solution you adopt or that you have a solution or support vendor which has SLAs which are acceptable.
- Do your homework, stay in control, don’t buy the bullshit. Big Data is not a solution to all your problems. It won’t make your business bloom overnight. And it’s a lot of knowledge to take in for the technical. “Transitioning to big data is a key objective for our company. That’s why we hired this big data consult” congrats, you just hired a guy who doesn’t know your apps, your business processes or your team and who is probably charging you 400-2000$/day for Googling “how to install HBase on my laptop” – great investment, much successes.
Big Data, scalable data stores and cloud infrastructure – are no longer an “if” for IT, it’s just a matter of “when?“. All I’m saying is that maybe for your business the answer might be “not this year”. And I’m also saying that if the answer is “right now!”, you should make sure you cover all the angles exposed above.
On a less serious note, you can always check out NoSQLBane for some consistency and fault tolerance humor. And for a mix of distributed computing insight and stand up comedy, do watch James Micken speech on big data, NoSQL, cloud, virtual infrastructure and bullshit.
Traditionally, management theory has been based on hierarchy, structure and delegation of activities. However, in today’s business landscape, with is ever increasingly marked by flow of information and changing processes, there is an increasing gap between power of knowledge and power of decision. In other words, the organizational (hierarchical) distance between the point where the relevant information is needed and the point where such information is used to make a decision is large enough for relevant information to be lost.
Of course, old-school managers will tell you that as long a reporting lines are defined and KPIs/objectives are cascaded correctly, there is no problem in efficiently delegating. That would be true, except in a business world that increasingly revolves around technology and services, “information” isn’t only about predefined metrics that are pivoted and rolled up in a spreadsheets and reports. The information has become the change that happens to those spreadsheets and reports.
Ultimately, the gap between knowledge and decision affects an organization to the extent to which changes to the business processes become business as usual (i.e. a regular activity, that occurs more than once during a financial lifecycle).
Let us take an example
In a classical business world, the reporting format down-towards-up (sold units, best selling items, items with the best margin) and the decision format from up-towards-down (targets, commissioning scheme) is pretty straight forward. But let’s imagine the following:
- Christine decides to partner up with a local partner/affiliate, which directly impacts the commissioning scheme and sales volume
- Blaine would also like to leverage online lead generation to boost sales, which impacts cost and customer visibility
- Claire thinks she could improve sales volume by engaging in a profit-sharing scheme by partnering up with a local services provider
Considering all of that, Jim has to prioritize the strategy for next year. To him, all projects seems like a good idea, because ultimately all of them boost sales and customer visibility. And to Jim, it’s all about the bottom line.
Furthermore, let’s assume there is enough time and budget to do both projects in every particular region. And even if Blaine’s idea (online lead generation) could be easily implemented across the other two regions, the affiliate and profit sharing schemes Christine and Claire have in the pipeline are pretty particular to their respective regions.
What none of them imagine is that implementing these projects (with end dates at several times throughout the year) will impact the reporting structure. You can’t directly compare in-house sales with affiliate sales. And online lead generation also involves additional cost, not just additional sales. All of these non-uniform changes in the reporting structure will also change the way budgeting is done for the following year and Jim has to take all that into account.
Moreover, Carla, Alma and Blythe – the persons in charge of the three respective projects in each region have a deeper understanding of the details of each project, but they are not aware of each other project and so cannot foresee the impact they will have on each other.
You cannot drive a car by just looking at speed and fuel gauge
However obvious that might be for cars, a lot of companies are governed by just looking at profit and capital / operational expenses. Even though budgets and project priorities are blurted out in endless Excel sheets, few companies have a truly processes-centric approach that allows them to see how different processes and lines of business influence each other.
The truth is that a lot of people in management positions think, speak and act in lists. But the truth of the matter is that in a dynamic business landscape where automation is business as usual, lists just don’t cut it anymore. Relationships (between projects, processes, features and requirements) are graphs and mappings between changing entities are hash tables.
One thing that cannot be automated is the process of changing processes
A lot of the work that can be automated (rolling up sales reports, balancing accounts and taking orders) has been or will soon be automated. There is less and less room for workers who execute a simple process, day in and day out. Which means that the workload itself tends to become increasingly unstructured, highly variable and less predictable. One of the things that cannot be automated (at least for now, if you believe Searle’s Chinese Room argument) is the process of changing other processes in order to achieve certain objectives.
This means that the knowledge work is less about punching in numbers while being on the phone and more about exploring implications, ramifications and impact of change to the work that is already being done (on most part by machines). However, the workforce is for the most part unprepared for this mindset and so is management.
For instance, the worker may not be willing or prepared to propose a change to a process that might (on the short term) negatively impacting his/her KPIs, objectives or personal revenue. The manager or the executive on the other hand might have a zero-risk policy
Global and local optimization
Let’s say you have a company with three departments: sales, tech and operations. The hierarchical structure present in most companies encourages the three respective managers/VPs of sales, tech and operations to seek the optimum for their silo/department. The fact of the matter is that seeking local optima (what’s good for my department) may often yield a strategy that is deeply sub-optimal for the organization. Of course, it would be the job of the CEO to balance the view and build a global optimum from the local optima, but the reality is that s/he often lacks the information required: partly because it was filtered out at lower level as “not relevant to our department”, partly because he doesn’t have the patience to challenge things on a lower level. By taking the safe path towards local objectives, global objectives can be missed on a higher level.
In programming, choosing the solution that seems to best fit locally and/or on the short term is called a Greedy Algorithm. Although it might work for simple problems, which model linear relationships, under certainty and following simple restrictions, it may often produce deeply sub-optimal results. You see, this class of algorithms are not called “greedy” by chance – they are called so because they seek immediate maximization of the outcome/benefit/revenue. Which brings me to my next point …
Global optima take time to achieve. Which is just the opposite of the current business landscape which seeks immediate gratification. Bigger stock price, bigger sales, bigger bonus. When? By the end of the financial year! Heck, let’s have it this quarter – as a stretch target. We are all greedy. We want pay-offs now. The promotion, the raise, the stock price increase. We put pressure on ourselves, our on peers, on our direct reports. We put pressure to achieve things now. And we keep ignoring the complexity, the impact and ultimately the fact that achieving the best possible outcome every week of the year is not the same as achieving the best possible outcome this year.
We tell ourselves that achieving the best in each department every month will make the company achieve the best this year or for the next three years. And that might have been true when labor was manual and the market and processes and the opportunities changed infrequently. But that is no longer true.
As our world is getting more complex, uncertain, riddled with change and illusive local minima, we are becoming increasingly like the kids in the Standford Marshmallow Experiment: surrounded by temptation to get our “fix” now and depleted of the discipline to seek long term goals.
People who have the information don’t get to decide; people who decide don’t always have the relevant information
You’ll probably think that if relevant objectives are cascaded from executive to management to worker, nothing can go wrong. Right?
Well, that used to be right. But nowadays, the complexity and inter-dependency of processes (especially automated ones) put the knowledge worker in the position to be the only one to spot or define what is “relevant” in some cases. As you’d expect, “cascading” this information upwards goes against the flow and oftentimes gets a lot of resistance – especially if you have to get through 7 layers of red tape until you can get to someone who has the authority to make a change.
Even if the hands-on guy at the bottom of the food chain who spotted a problem in the process or an opportunity for improvement somehow manages to get his point across to his manager’s manger’s manager, this will have taken 3 months. It will take another 6 months of meetings with people who have no knowledge or competency in the matter to get the project pushed through, budgeted, approved and scheduled. Most of this red tape will not improve the original idea, but it will riddle it with compromise. The guy on top won’t be willing to vouch for the idea (even if it’s a good one) out of fear of alienating his other direct reports.
In a shifting business context, the core idea of relevance (the “key” in Key Performance Indicator) is one that requires effort and input from throughout the organization. And most organizations are still severely top-down.
Flatten or shard: why hierarchy is dead
We used machines to speed up our processes, to scale them to high volumes of decisions and events, to make them more reliable and cheaper. We did this to such an extent that the bottleneck in organizations has become people’s ability to understand, plan and follow-up on change. Part of that is because our educational system still embeds our minds with the “assembly line” mentality; the other part is that both workers and managers prefer short-term (and short-sighted) gains and a risk-averse attitude.
The modern workplace needs to extend its mentality toolbox and means of interaction beyond list and tabels (towards charts, graphs, analytics and more scientifically founded decisions) to deal with increasing uncertainty and complexity.
Some ideas of improvement may include:
- Removing unnecessary overhead and flattening organizational structures.
- Rotating people between similar positions before promoting them.
- Creating cross-functional knowledge roles rather than cross-functional management roles
- Make sure managers have hands-on experience
There are two main trade-offs between flattening hierarchy (reducing subordination) and sharding business lines:
- Flattening reduces overhead, but it may also blur accountability
- Sharding clarifies boundaries, but reduces opportunities for cooperation and creativity
Ultimately, organizations have a choice between reduced risk and increased cooperation/innovation/creativity. And in today’s landscape, there is less and less of a clear recipe.
Instead of a conclusion
Organization face a great challenge of transitioning from traditional hierarchical/command-and-control setups to flatter or matrix-like structures. In this transition, confusion is the highest risk. So ultimately, the best tool for keeping things under control is keeping organizational process knowledge closer to the point of decision, not only towards the place of execution.
Delegation works great for activities, but it fails miserably for knowledge tasks.
This summer I decided it was high time for slimming down, mostly because I got tired from the simplest things – like going up the stairs for 4 floors. So in the way worthy of a project manager with engineering background, I set an objective, I made a plan and I started tracking the metrics. After all …
You cannot manage what you do not measure.
Start weight: 108 kg
Target weight: 93 kg
Delta: 15 kg
Budgeted time: 4 months
Targeted loss/month: 4 kg /month
Below, you can find the charts. I did not use any real time apps, or gadgets or wearables. I’m old fashioned like that: analog scale and Google Spreadsheets.
Above: Real measured weight is painted in blue, while the (linearly) planned target weight in painted in red.
Above: the “ahead-of-plan” metric (also called “buffer”), as measured as planned weight minus real measured weight.
Note that the points in the chart are not equally-spaced.
Above: the average daily loss. Note that the points in the chart are not equally-spaced.
My conclusion from this 3 month+ experience:
- Measuring relatively often keeps you focused, as in allows you to reinforce a concrete, practical small target every few days (or once a week), rather than a big monthly target.
- No matter how disciplined one is, weight loss does not occur at a constant pace. Some weeks you exceed you target, some weeks you miss it. See the average daily variation chart.
- Weekly targets don’t matter that much on the long term, but they can motivate and drive your actions and choices (i.e. salad instead of pizza, orange squeeze instead of Cola) on the short term. Missing a target every once in a while is good if and only if it motivates you.
- Don’t obsess over daily micro-measurement. Some days you are better hydrated before you measure yourself and some days you are less so. Therefore, it can seem you suddenly gained 1 kg, when there is no actual change. Always assume there is an inherent daily “noise” in your measurement which evens out on the long term (weeks, months). To even out the noise, try to do all measurement at the same time of day, using the same scale. Note: I have not kept daily measurement, so that noise is not visible on my charts.
All in all, now I feel much better. And I’m very proud of my analytics.
One of my personal projects which had been put on the back burner for to long was publishing a book on Amazon. The somewhat controversial subject is a process-modelling engineer’s view to dating: what metrics to track, what behaviors to expect, what rules of thumb to apply and what action to take.
I wanted to avoid the whole how-to-get-laid-tonight approach altogether, as I believe in dating, as in other aspects of life, a balanced view is preferable to immediate results based on fake assumptions. Under the disclaimer that the stuff presented in the book should not be used as a recipe or as an ultimate truth, I aimed to give the readers some tips, a few new perspectives and some good laughs.
I leave you with an excerpt and a link to the book.
It is a fun and efficient way of displaying confidence without being (too) annoying. The
method is based on assuming the answer to “Would you like us to go out?” is a “yes”. So
instead of just asking it plain and simple, make it sound like
“So, are we going out on Friday or on Saturday?”
Rephrasing the question like this makes you seem sure you’ll get a positive answer. And, like
all self-fulfilling prophecies, they increase your odds. Make no confusion: looking confident
doesn’t eliminate the possibility of being turned down. Just have a backup for the negative
answer, go with something like “I had no idea you were that busy” or “Excuse me, I had no
intention of impeding your academic/professional endeavors”.
Enjoy and do share your thoughts.
The cloud is all the hype now, but most people don’t keep in mind that it is not a silver bullet or a panacea. The cloud (or more technically speaking), the use of large scale virtualization is recommended for specific use cases. In this article, we’re going to go through these use cases for which the cloud is recommended. In the final chapter, I am going to emphasize some of the caveats, traps and risks associated with transitioning to the cloud.
Whether we realize it or not, we are already users of cloud services, every time we sign into Gmail, Picassa, Facebook, Linkedin, Office365 or DropBox. This is the most frequent and wide-spread use case for the cloud. For the average consumer, freelancer, small and medium company it does not make sense to invest in and directly operate services such as:
- Email (Gmail for Business, Office 365)
- File Storage (Google Drive, DropBox)
- Document Management (Gmail for Business, Confluence, Office 365)
The providers of such services afford to invest much more knowledge, effort into operating a reliable infrastructure – because of their scale. In other words, if developing service X (let’s say email) would cost $10 million, it would not make sense for a company with 20 employees, but it would make sense for a service provider which has 100,000 such companies as its customers. The advantage of using such cloud services revolve around:
- No up-front cost (like the cost of owning hardware)
- Predictable operating cost (for instance, Google drive sells 1TB for 120$/year)
- Less downtime and better quality of the service
- Better data reliability (as big service providers afford to store the same data in 3 or more places at any one time)
However, one also has to be aware of the risks and down-sides of such services:
- No Internet connection can mean no access to data or to the service. Although this risk can be partially mitigated by having local snapshots of the data, it is worth considering. Nevertheless, in today’s connected business, no Internet often means no business
- No physical ownership of the data. From an operating perspective this is rarely a problem, but this is worth considering from a legal and business continuity perspective.
- Potentially slower access to the data, especially for large files (studio size images, videos), as the Internet is still considerably slower than a local network.
To facilitate the transition from an on-site service to a cloud service, it is always a good idea to do a pilot program with a small team or for a smaller project, so as to have the change to smooth out any bumps in the road with minimal business impact.
You might get an average traffic 1 million page views/hour most of the year, but you might spike somewhere between 10 and 15 view/hour during 5 or 10 days of the year. This is especially common for e-commerce sites around the holidays (winter, spring). The non-cloud solutions would entail either over-scaling your infrastructure just to cope with those 5-10 days of traffic spikes or settling for unsatisfactory performance during the most profitable time of the year. A cloud solution would allow you to grow or shrink your infrastructure depending on these need with one-hour step. Let say you can handle the “normal” traffic with 4 web-servers and 4 application-servers. Simple math would dictate you would need somewhere between 40-60 web servers and 40-60 application servers to handle the peak load (the exact number would depend on application type, your business process, average machine load during “normal traffic). If you were to take this infrastructure on-site, it means that 90% of your cost would be waste during the 355 days of the year when you don’t need the extra-juice. What a cloud provider does is it allows you to only activate and pay for this extra-infrastructure when you need it. In the case were spikes are expected to have a yearly seasonality, it is reasonable to rely on cloud services only during that time of the year. However, many services – such as content streaming – may have a daily seasonality (high traffic during the evenings). The cases where the exhibited traffic seasonality is finer-grained (daily) may require moving the entire web serving solution into the cloud.
Once-in-a-while High Volume Data Crunching
For irregular, high volume workloads, the cloud is also your friend. If every three months, you need to crunch 100TB of data in one big scoop (in 12-48 hour), there’s no reason for you to keep and pay for the required infrastructure for the entire three months. This use case is common in research, where workloads are not periodic and tend to be intensive. Another use case would be reporting. However, you need to keep in mind that large volume reporting workloads (quarterly, yearly reports) can be split into smaller workloads (hourly/daily/weekly roll-ups), which can then be joined/summed up together fairly quickly even on a smaller infrastructure. This way, the effort is split up over time and the quarterly/yearly spikes at the reporting date can be a lot smaller. Top three vendors:
- Amazon Web Service
- Microsoft Azure
- Google Cloud Platform
Static Assets, Geographic Spread and Content Delivery Networks
It just might be the case that an US-based company has a lot of users from Europe or from Asia. In this case, it is highly advisable to use a Content Delivery Network (CDN). A CDN customarily delivers to end-users the static assets from your site (stuff that rarely changes and that is the same regardless of the user): images, CSS files, JS files, videos. The CDN is basically a network of severs (called nodes or edges) around the world which copies the static content from your website (even if your website is NOT cloud-supported) and then serve this static content several times to different users in its geographic vicinity. This achieves several advantages for your website:
- Offloading your main servers: your main server(s) only have to serve the content a few times (once for each CDN node/edge), while that content would then be served thousands of times to the end users.
- Closer means faster: on average, the edge will be geographically closer to the end-user, which means that the round-trip of the data packets will be faster, which means your site will load faster.
- Many is better: most CDNs will allow different requests (files) belonging to the same page from different domains, which means that end-users browser will be able to request more stuff at the same time, which is yet another source of speed up.
It is important to understand that your web servers do not need to be virtualized or cloud-hosted in order for you to take advantage of the advantages of CDN. You can very well setup a CDN over your on-site hosted website. If you have a small website/blog without too many requirements around it, you can very well try a CDN such as CloudFlare for free. They have a tutorial around it (a bit technical, but you’ll live).
Backup and Disaster Recovery
This subject is a touchy one. Cloud storage services usually offer a good-to-great level of reliability/durability for storage (think many 9s after the decimal separator). Amazon S3 promises 99.999999999% (that’s nine nines after the decimal separator). This is far better than what you could possibly achieve on-site, so it makes cloud services an ideal candidate for backing up your data. However, entrusting your data (and customer data, including personally identifying information, credit cards and so on) off-premises may be perceived as too high of a risk or may prove not compliant with security standards for which your company is certified ( such as PCI DSS, ISO/IEC 27001 ). One mitigation for such risk is to encrypt data with a private key before transmitting it to the cloud service. However, this raises the issue of securely and reliably storing the keys (in at least two geographic locations, in order to achieve disaster recovery capability). Thus, in the light version of this use case, one can use cloud storage services for periodically backing up data in a reliable/durable way, preferably with a layer of added encryption. However, in case of disaster, retrieving the data from cloud storage and resuming operations can take days or even weeks, which may prove unacceptable for business continuity. An use case more suitable for more mature companies is to have an up-to-date replicate of their entire infrastructure ready for deployment (but not deployed) in the cloud. This cloud infrastructure could be activated in case of disaster so as to handle the load while the on-site infrastructure is being reinstated, bringing customer facing downtime from days to hours. However, this scenario requires guaranteeing data freshness by more frequent snapshots and several complex disaster recovery test to ensure the cloud infrastructure would be deployed and function as expected.
As no analysis is complete without also emphasize the reasons against a certain solution, one should be careful in considering the following points when planning using or moving to the cloud:
- Cost. At a large scale (hundreds of machines), having an on-site infrastructure and/or a private cloud (on-site virtualization solution) can become more efficient than renting cloud services. However, most companies don’t reach such a scale. Nevertheless, one should always keep a close eye on cost, as the freedom of expanding and shrinking afforded by cloud providers comes with a price. Also, consider that considerable reductions in cost can be achieved by using reservation plans with cloud providers: making a commitment for a certain usage over 1-3 years in exchange for a reduction is cost.
- Data Ephemerality. Unlike physical machines, virtual machines get terminated (read “disappear”) all the time. By default, if they are not configured to use persistent storage (such as Amazon’s Elastic Block Storage or Simple Storage Service), the data on them disappears with them. Make sure your persistent data is actually always stored on persistent media!
- Operational practices. Given the fact that the cloud encourages the use of several (often smaller)virtual machines as opposed to a few ones, it quickly becomes unpractical for operations team to manually configure each machine. That is operations need to focus on automation processes (automatically deploying and configuring machines) and shift away from the “log on to the machine and configure it” view.
- Security. Although cloud providers offers a great array of tools for managing security (firewall, dynamic keys), most system administrators may not be familiarized with these tools and best practices. Make sure your team(s) has a good understanding before moving business-critical data and apps in the cloud.
- Performance/Cost. The virtualization associated with cloud computing comes with a performance penalty, the degree of which may vary greatly depending on the type of application you’re using. That isn’t to say you cannot get the same performance from a cloud machine that you can get from a physical machine (dedicated, bare-metal hosting). It just means that you may end up paying more for it. In other words, you might end up getting less bang for your buck. Be sure to benchmark performance on several instance types and make at least a high-level cost projection. Otherwise, you might end up unpleasantly surprising your CFO.
- Legal/regulatory requirements. Several companies (usually enterprise/corporate size) come under legal and regulatory requirements of not sharing customer data with third parties or of not storing/transmitting customer data outside of the country. Be sure to triple-check with said legal and regulatory requirements or to find a technical solution which does not store/transmit customer data to the cloud provider (for instance, using a CDN for serving static assets from sessionless domains would be a good solution from separating cloud-delivered content from in-house stored customer data).
The key take-away from this post is that cloud services provide a vast array of tools to help in today’s business and technology environment, without however being a one-size-fits-all solution. Either starting up on or transitioning to the cloud requires careful planning and an in-depth understanding of the processes to be implemented as well as of the technology landscape.
As always, any questions, comments and feedback are more than welcome. I’m also open to discussing specific use cases and integration scenarios.
or “The trap of positive thinking and how quitting should be an option”
Have drive, have perseverance. Give 110% percent. Go confidently in the direction of your dreams, because your dreams don’t have an expiry date. Live the life you have imagined. If at first you don’t succeed, try, try and try again.
Does all this motivational stuff sound familiar? About not giving up, about trying harder, about getting up stronger after every punch. Yes, being determined is good for your life and for your career and for your feeling of self-worth. Being focus and relentless helps you get the things you want. But at some point, after the second, or third or tenth attempt of changing something, achieving something – you gotta take a deep breath, stop and ask yourself …
Am I beating a dead horse?
Yeah, are you? Maybe you’re locking in. Maybe you’ve reached the top of a little hill and you’re wondering why you can’t go any higher.
Because regardless whether we’re talking about your job, or your business or your personal projects – there is the off chance that your vision does not match reality. No matter how much you want it, no matter how hard the Universe is conspiring to make your dreams come true, maybe someone else’s dreams are higher priority. Or I dunno, maybe you’re down the wrong path.
I’m just saying that your objectives or your aspirations need a review from time to time, just in case you’re stuck pushing against a dead end. That dead end might be your job – which provides to little opportunity, satisfaction or visibility. Or your business, which is in an industry with zero or negative growth.
The point is that this whole motivational culture actually adds pressure and negative stress to the decisions we make in our career. It makes it shameful to give up, to quit, to admit failure. “How will others see me?” or “What are they going to think?”. The worst part is that “positive thinking” forces us to feel guilt and take responsibility for stuff that isn’t always in our control.
Let’s take an example. You might think the reason why you didn’t get that promotion has to do with your not trying hard enough, not being a team player, not being smart enough, not reaching your objectives. True. Your being awful at what you do is definitely a possibility. Other possibilities include:
- You’re not good at that particular job (although you might be well above average in other jobs)
- You don’t have the same vision as your superior or maybe he just doesn’t like you
- You might not be a good fit in your team
This isn’t saying that you should blame external factors for each and every one of your failure/frustrations. Maybe you should just try something else. Roll the dice out of your comfort zone at least a little bit. The point is to know when to declare failure, when to throw in the towel – without feeling guilty or ashamed.
Of course, this whole post is about giving up and blaming the Universe. This post is about how choosing the problems you solve is as much your responsibility as actually solving them. You should teach yourself how to tell stuff that’s up for you to change (generally what you read, what you eat, who you hang out with, how much money you spend or save and other habits of yours) from stuff that’s not up to you to change (generally what other people read, eat, who they hang out with, how much money they spend or save and other habits of theirs)
Yes, you should focus on doing one thing (job, project, business) and doing it well. Yes, you should try several times with several approaches before you give up. Patience is a virtue – up to a point – then it becomes pathology.
When I was in high school, I read about the great unsolved problems of math, physics, cryptography. I had these geekish dreams of, at some point, proving Riemann’s hypothesis or Goldbach’s conjecture. I thought it would be cool to find the exact general solutions to the Navier-Stokes equations. But then I realized two things: a) doing math problems on paper bored me to death and b) I only thought math was fun if it tackled a practical problems using a computer (so I didn’t have to run the calculations myself). I could have been stubborn and forced myself into something that I hated. But I chose the easy way out: admitting my weakness (I hate repetitive work, I’d rather program a computer to do it) and taking advantage of my strength (I like to but real-world problems in mathematical/numerical models).
All in all, every once I a while you gotta check the pulse and be honest about it.
The last thing you want is to keep beating a dead horse.
Not to mention weird.