via Twitter https://twitter.com/mikegchambers
After building a mission-critical data production pipeline at ironSource that processes over 200 billions records every month, we’d like to share some of our rules written with blood.
Kinesis is an infinitely scalable stream as a service that consists of shards. The service is commonly used due to its ease of use and low overhead along side its competitive pricing. This is a common differentiator between Kinesis Streams and Kafka.
Like any managed service, Amazon Kinesis has some limitations you should be familiar with — and how to overcome these with scaling and throttling. It will be wise to leverage the AWS provided producers, consumers and available tools in order to leverage these best practices.
At a large scale, it’s hard to change architecture once in production and cost becomes a very big pain. The service is billed per 25kb payload unit, so it makes sense to aggregate messages if you have records that are smaller in size.
When sending data into your Kinesis stream you should compress and aggregate several messages into one in order to reduce costs.
The Amazon Kinesis Producer Library (KPL) aggregates and compresses (using Protocol Buffers) multiple logical user records into a single Amazon Kinesis record for efficient puts into the stream. The library is built by AWS in C++ and has (only) Java bindings. An open-source version in Golang is available.
The KCL library, is written by AWS and supports automatic de-aggregation of KPL user records. The KCL takes care of many of the complex tasks associated with distributed computing — such as load-balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to resharding.
While processing a Kinesis stream can be done using on-demand instances, it is highly recommend leveraging AWS spot-instances in order to process your stream — it is the most cost effective method.
There is also a way of processing the data using AWS Lambda with Kinesis, and Kinesis Record Aggregation & Deaggregation Modules for AWS Lambda. It is very easy to hook up a Kinesis stream to a Lambda function — but you must take cost into consideration and see if it makes sense for your specific use-case.
There are two sets of metrics you should take into consideration when monitoring your Kinesis Streams with CloudWatch:
For the stream-level metric, it’s good practice to set up an alarm on the GetRecords.IteratorAgeMilliseconds to know if your workers are lagging behind on the stream.
However, sometimes there might be a specific worker/shard that is out of sync — but the state won’t be reflected at the stream level via the global IteratorAgeMilliseconds average. In order to overcome this, I recommend running a Lambda script every minute and query at the shard-level for its IteratorAgeMilliseconds and alert if needed.
AWS recommends monitoring the following metrics:
Tracks the read position across all shards and consumers in the stream. Note that if an iterator’s age passes 50% of the retention period (by default 24 hours, configurable up to 7 days), there is risk for data loss due to record expiration. AWS advises the use of CloudWatch alarms on the maximum statistic to alert you before this loss is a risk. For an example scenario that uses this metric, see Consumer Record Processing Falling Behind.
When your consumer side record processing is falling behind, it is sometimes difficult to know where the bottleneck is. Use this metric to determine if your reads are being throttled due to exceeding your read throughput limits. The most commonly used statistic for this metric is average.
This is for the same purpose as theReadProvisionedThroughputExceeded metric, but for the producer (put) side of the stream. The most commonly used statistic for this metric is average.
AWS advises the use of CloudWatch alarms on the average statistic to indicate if records are failing to the stream. Choose one or both put types depending on what your producer uses. If using the Kinesis Producer Library (KPL), use PutRecords.Success.
AWS advises the use of CloudWatch alarms on the average statistic to indicate if records are failing from the stream.
If you push it to the limit, Kinesis will start throttling your requests and you’ll have to re-shard your stream. There might be several reasons for throttling. For example, you may have sent more than 1 MB of payload / 1,000 records per second per shard. But you might have a throttling problem caused by DynamoDB limits.
As noted in Tracking Amazon Kinesis Streams Application State, the KCL tracks the shards in the stream using an Amazon DynamoDB table. When new shards are created as a result of re-sharding, the KCL discovers the new shards and populates new rows in the table. The workers automatically discover the new shards and create processors to handle the data from them. The KCL also distributes the shards in the stream across all the available workers and record processors. Make sure you have enough read/write capacity in your DynamoDB table.
When re-sharding a stream, scaling is much faster when it’s in multiples of 2 or halves. You can re-shard your stream using the UpdateShardCount API. Note that scaling a stream with more than 200 shards is unsupported via this API. Otherwise, you could use the Amazon Kinesis scaling utils.
Re-sharding a stream with hundreds of shards can take time. An alternative method involves spinning up another stream with the desired capacity, and then redirecting all the traffic to the new stream.
Developing Kinesis Producers & Consumers
A deep-dive into lessons learned using Amazon Kinesis Streams at scale was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
I write this following a particularly frustrating day of thumb twiddling and awaiting slack messages from the AWS support team. Our Elasticsearch cluster was down for the better part of a day, and we were engaged with AWS support the whole time.
At my previous job working for Loggly, my team and I maintained a massive, multi-cluster Elasticsearch deployment. I learned many lessons and have a lot of tricks up my sleeves for dealing with Elasticsearch’s temperaments. I feel equipped to deal with most Elasticsearch problems, given access to administrative Elasticsearch APIs, metrics and logging.
AWS’s Elasticsearch, however, offers access to none of that. Not even APIs that are read-only, such as the /_cluster/pending_tasks API, which would have been really handy, given that the number of tasks in our pending task queue had steadily been climbing into the 60K+ region.
This accursed message has plagued me ever since AWS’s hosted Elasticsearch was foisted on me a few months ago:
"Message":"Your request: '/_cluster/pending_tasks' is not allowed."
Thanks, AWS. Thanks….
Without access to logs, without access to admin APIs, without node-level metrics (all you get is cluster-level aggregate metrics) or even the goddamn query logs, it’s basically impossible to troubleshoot your own Elasticsearch cluster. This leaves you with one option whenever anything starts to go wrong: get in touch with AWS’s support team.
9 times out of 10, AWS will simply complain that you have too many shards.
It’s bitterly funny that they chide you for this because by default any index you create will contain 5 shards and 1 replica. Any ES veteran will say to themselves: heck, I’ll just update the cluster settings and lower the default to 1 shard! Nope.
"Message": "Your request: '/_cluster/settings' is not allowed for verb: GET"
Well, fuck (although you can work around this by using index templates).
Eventually, AWS support suggested that we update the instance size of our master nodes, since they were not able to keep up with the growing pending task queue. But, they advised us to be cautious because making any change at all will double the size of the cluster and copy every shard .
That’s right. Increasing the instance size of just the master nodes will actually cause AWS’s middleware to double the size of the entire cluster and relocate every shard in the cluster to new nodes. After which, the old nodes are taken out of the cluster. Why this is necessary is utterly beyond me.
Adding an entry to the list of IP addresses that have access to the cluster will cause the cluster to double in size and migrate every stinking shard.
In fact, even adding a single data node to the cluster causes it to double in size and all the data will move.
Don’t believe me? Here is the actual graph of our node count as we were dealing with yesterday’s issue:
Back at Loggly, we would never have considered doing this. Relocating every shard in any respectably sized cluster all-at-once obliterates the master nodes and would cause both indexing and search to come to a screeching halt. Which is precisely what happens whenever we make any change to our Elasticsearch cluster in AWS.
This is probably why AWS is always complaining about the number of shards we have… Like, I know Elasticsearch has an easy and simple way to add a single node to a cluster. There is no reason for this madness given the way Elasticsearch works.
I often wonder how much gratuitous complexity lurks in AWS’s Elasticsearch middleware. My theory is that their ES clusters are multi-tenant. Why else would the pending tasks endpoint be locked down? Why else would they not give you access to the ES logs? Why else would they gate so many useful administrative API behind the “not allowed” Cerberus?
I must admit though, it is awfully nice to be able to add and remove nodes from a cluster with the click of a button. You can change the instance sizes of your nodes from a drop-down; you get a semi-useful dashboard of metrics; when nodes go down, they are automatically brought back up; you get automatic snapshots; authentication works seamlessly within AWS’s ecosystem (but makes your ES cluster obnoxiously difficult to integrate with non-AWS libraries and tools, which I could spend a whole ‘nother blog post ranting about), and when things go wrong, all you have to do is twiddle your thumbs and wait on slack because you don’t have the power to do anything else.
Elasticsearch is a powerful but fragile piece of infrastructure. Its problems are nuanced. There are tons of things that can cause it to become unstable, most of which are caused by query patterns, the documents being indexed, the number of dynamic fields being created, imbalances in the sizes of shards, the ratio of documents to heap space, etc. Diagnosing these problems is a bit of an art, and one needs a lot of metrics, log files and administrative APIs to drill down and find the root cause of an issue.
AWS’s Elasticsearch doesn’t provide access to any of those things, leaving you no other option but to contact AWS’s support team. But AWS’s support team doesn’t have the time, skills or context to diagnose non-trivial issues, so they will just scold you for the number of shards you have and tell you to throw more hardware at the problem. Although hosting Elasticsearch on AWS saves you the trouble of needing a competent devops engineer on your team, it absolutely does not mean your cluster will be more stable.
So, if your data set is small, if you can tolerate endless hours of downtime, if your budget is too tight, if your infrastructure is too locked in to AWS’s ecosystem to buy something better than AWS’s hosted Elasticsearch: AWS Elasticsearch is for you. But consider yourself warned…
Some things you should know before using Amazon’s Elasticsearch Service on AWS was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
In a recent blog, we examined the performance difference between the runtimes of languages that AWS Lambda supports natively. Since that experiment was specifically interested in the runtime differences of a ‘warm’ function, the ‘cold start’ times were intentionally omitted.
A cold start occurs when an AWS Lambda function is invoked after not being used for an extended period of time resulting in increased invocation latency.
Since the cold start times of AWS Lambda is an important performance consideration, let’s take a closer look at some experiments designed to isolate the variables which may impact the first-time invocations of functions.
From my experience running Lambda functions in production environments, cold starts usually occurred when an AWS Lambda function is idle for longer than five minutes. More recently, some of my functions didn’t experience a cold start until after 30 minutes of idle time. Even if you keep your function warm, a cold start will occur about every 4 hours when the host virtual machine is recycled — just check out the metrics by IO Pipe.
For testing purposes, I needed a reliable method for consistently ensuring a cold start of an AWS Lambda function. The only surefire way to create a cold start is by deploying a new version of a function before invocation.
For the experiment, I created 45 variations of AWS Lambda function. Using the Serverless framework setup below, it was easy to create variants of the same function with different memory sizes.
I recursively deployed all 45 functions and invoked each of them programmatically using the simple script below.
The deployment and invocation loop took about three minutes. To collect a meaningful amount of data points, I ran the experiment for over 24 hours.
I based the hypothesis on my knowledge that the amount of CPU resources is proportional to the amount of memory allocated to an AWS Lambda function.
Now it was time to see if the experiments supported my hypothesis.
To evaluate the the impact of memory on cold starts, I created 20 functions with 5 variants — using different memory sizes for each language runtime. The supported languages are C#, Java, Node.js, and Python.
After running the experiment for a little over 24 hours, I collected the following data — here’s the results:
Observation: C# and Java have much higher cold start time
The most obvious trend is that statically typed languages (C# and Java) have over 100 times higher cold start time. This clearly supports our hypothesis, although to a much greater extent than I originally anticipated.
Observation: Python has ridiculously low cold start time
I’m pleasantly surprised by how little cold start the Python runtime experiences. OK, there were some outlier data points that heavily influenced some of the 99 percentile and standard deviations — but you can’t argue with a 0.41ms cold start time at the 95 percentile of a 128MB function.
Observation: memory size improves cold start time linearly
The more memory allocate to your function, the smaller the cold start time — and the less standard deviation. This is most obvious with the C# and Java runtimes as the baseline (128MB) cold start time for both are very significant.
So far, the data from the first experiment supports the initial hypothesis.
To evaluate the the impact of memory and the package size on cold starts, I created 25 functions with various code and memory sizes. Node.js was the constant language for this experiment.
Here are the results from this experiment:
Observation: memory size improves cold start time linearly
As with the first experiment, the memory size improves the cold start time and standard deviation in a roughly linear fashion.
Observation #2 : code size improves colds tart time
Interestingly the size of the deployment package does not increase the cold start time. I would have assumed that the bigger package would equate to more time to download & unzip. Instead, a larger deployment package seems to have a positive effect on decreasing the overall cold start time.
To see if the behavior is consistent, I would love for someone else to repeat this experiment using a different language runtime. The source code used for these experiments can be found here, including the scripts used to calculate the stats and generate the plot.ly box charts.
Here are a few things I learned from these experiments:
Thanks for reading! If you like what you read, hit the ❤ button below so that others may find this. You can follow me on Twitter.
How does language, memory and package size affect cold starts of AWS Lambda? was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
Can I help you?
No, I just waited 30 minutes to say “hi”.
The cloud era will likely be the most disruptive experience IT departments have ever faced. There’s never been a more exciting — and painful — time to be involved in technology. Companies are rightfully expecting their technology organizations to add value, create products, solve problems, and not be so, you know, dreadful.
Every company is now a technology business — and that has a seismic impact on the structure of IT organizations. By predicting the outcome and making steps to get there more quickly, we can leapfrog the intermediate pain.
In the new world, IT is a strategic organization responsible for generating revenue, and tactical technology teams exist in every division. So far you’ve primarily seen this in marketing departments, but it will quickly spread to finance, facilities, R&D, purchasing, HR, and everywhere else. To address the challenges of managing technology when it’s everywhere, CIOs will begin replacing CEOs. Why wait until then to get started? Let’s fix it now.
If you ask the denizens of corporate America about their IT staff, the responses echo sentiments of disappointment, rage and bewilderment. Workers have come to accept that IT departments are no longer the source of innovation and competitive advantage, but a bloated bureaucracy of interconnected silos that work to enforce the status quo. If you work in IT in Corporate Land — congratulations!— you currently suck.
For years this has been acceptable to companies that look to IT, a division they barely understand, to protect their castle and simply keep systems operational. IT has been about stopping people from crossing the moat, opening the drawbridge occasionally, and frowning when people ask questions. In summary, “No Chrome for you, Slack’s out of the question and let me brick your Android device while I’m here.”
But Winter’s Coming, my friends, and the Summer of Suckage that has defined corporate IT is ending. If mobile was the tremor, cloud is the earthquake.
Of course, some companies will botch this transition terribly and will pay a hefty price in the marketplace. Others will seize this moment to build a new, better organization around a cloud-first, customer-focused IT department. If you’re wondering what this will look like, think of the Be Our Guest number where the plates and cutlery start singing when a customer shows up.
You know who hates your silos? Your customers. And your employees. Everyone, actually. Silos are the enemy of agility, yet IT departments are usually quick to organize themselves into PowerPoint-friendly operational groups.
From database administrators to network engineers, each group acts as a gatekeeper to change with a collective slowing effect that brings change to a grinding halt. Silos are also the enemy of accountability, so when something truly horrific happens publicly, everyone can point at someone else and dodge the blame ball.
With cloud services and massive amounts of automation, we can outsource a major part of what different silos do on a daily basis. This means that 50–70% of your existing technical workforce won’t have to put the cap on the toothpaste tube and can be deployed for more useful jobs. “Being more useful” is my euphemism for “doing work for customers” — namely, shipping features, solving their problems, and making them happy.
Now silo-lovers will scoff at this suggestion, claiming that employee X could never learn the skills of employee Y because of blah blah blah. They’ll also argue that nobody will ever be allowed to go on vacation in a non-silo model. Well the answer to that is — Google, Facebook and all the other tech companies we admire, who cross-train merrily and vacation frequently.
If you create small, product-driven teams with a specific actual customer goal, you’ll be amazed how stuff gets done andquality improves. The cross-pollination of knowledge is a by-product of the increased morale and motivation of the group. Seriously, try it out — and if it doesn’t work, you can always go back.
Ignorance breeds strong opinions. People with the poorest understanding of a topic often have the most extreme views. The cloud is no different. I’ve seen over and over how IT staff can overestimate their knowledge of cloud technologies, and then express the loudest anti-cloud objections.
I used to tackle this head-on, but it becomes an exhausting and fruitless exercise as people dig in their heels to avoid being wrong. You can skip this pain by identifying the problem actors ahead of time, pulling them aside, and charging them with the responsibility of being the key expert in Google Big Query, Amazon Redshift, Azure IoT or whatever.
Critically, you must send these people for training and, most importantly, get them certified. You’d be shocked how you can convert ferocious anti-clouders to strong cloud warriors who will drive your goals. Unlike many certification programs (Oracle’s Java certification, OMG) cloud training generally has a surprising number of included aha moments that serve to motivate students.
Here’s a thing — software engineers tend to be smart, conscientious people and often enjoy working on hard problems. So it boggles my mind when companies use security teams, legal, and other silos to block their work and pose so many questions to the point where nothing gets done and motivation fizzles out.
I worked at one financial services company that didn’t trust any software coming from its prized developers “for security reasons”. So I asked why there wasn’t a security guard watching every single bank teller’s transaction to make sure they weren’t giving away the shop. “Oh, we trust them to safeguard our assets,” they replied. So we trust the minimum-wage, high-turnover front-end of the business but don’t trust our brain trust that produces the software they use? That’s ridiculous — and any developer working in that environment should find a better job immediately.
Any decent developer tends to be full-stack by nature. They often know more about security holes than the security teams, and they come pre-installed with skills like database management and testing automation. They are often good at the programming languages that don’t appear on their resume and have weekends where they submit pull requests for open source projects for fun.
These are skilled people. You need to foster an environment that feeds their natural curiosity, develops their skill sets, and demonstrates that you trust them to drive your business forward.
Moving to the cloud can seem like an insurmountable challenge without an obvious place to start. It’s as much a people problem as a technical issue, so you have to find a common goal to rally everyone around.
I’ve found that starting with the technical side can be a mistake. The effort usually gets its wheel stuck in the mud of upgrades that don’t help and arcane mini-projects. These are technical debt pet-projects that take six months, go nowhere, and there should be a drinking game for every time somebody wants to start one.
Like any massively overwhelming task from “reduce CO2 emissions” to “lose weight”, you have to look for immediate wins that have the most visible impact. This isn’t about looking good — it’s about building confidence to know that change is possible, change is beneficial, and change is happening.
I guarantee that your business has a stack of major customer problems that have been put on the back-burner for a long time. Let’s dust-off the list of issues and commit to fixing them at the beginning of the cloud-first initiative.
For example, I once worked at a company where the cloud was having trouble taking off. They were trying to “fix stability”, which was a compound problem riddled throughout their existing systems. We quietly moved away from this amorphous goal, talked to various customer-facing groups (and actual customers, wow) and settled on three of their biggest immediate issues:
Each of these problems were well-recognized and used to stand up three cloud-oriented teams of 8–12 people. Since these problems had existed since the creation of fire, the initial sprint involved the usual “this will never get solved” pessimism. Within 5–6 sprints, each one was fully addressed.
How could multi-year customer issues be fixed in 12 weeks? While the success was attributed to the cloud initiative, it was really the success of a team-oriented structure. The issues where well-defined, had C-level support, and ultimately had a customer’s smiling face as the reward for delivery. People like helping customers — who knew?
With any revolution, the beginning is the most difficult time. In cloud, the technology is not the most difficult problem — the people and corporate environment are the key factors for determining whether your implementation will ultimately succeed or fail. So let’s fix it now.
How to build a cloud-first IT organization that’s as much about people as technology was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
At face value, blockchain networks and serverless computing appear have little in common. Serverless is stateless, blockchain is stateful; serverless is ephemeral, blockchain is persistent; serverless processing relies on trust between parties, blockchain is designed to be trustless; serverless has high scalability, blockchain has low scalability.
On closer examination, serverless and blockchain actually have a fair amount in common. For example, both are event-driven and highly distributed. Another key similarity is where they both process functions — in a layer above the infrastructure/server level. Even when the characteristics aren’t shared or similar, they are complimentary. For example, serverless is stateless — but blockchain can be used as a state machine for validating transactions.
Taking a look at the similarities and differences helps to gain a better understanding of both serverless and blockchain. The deeper analysis also informs how each of the technologies might impact this next wave of computing cycles.
Over the past few months, I’ve spent a lot of time diving into the world of blockchain. Much like serverless computing, there is a lot of pieces to consume and absorb. It took a fair amount of effort to work through the basic workflows, understand the different platforms, and relate together the various components.
I started on my blockchain discovery journey a few months ago after a conversation with Lawrence Hecht from The New Stack. During our call, we discussed how a high percentage of Alexa Voice Service/Echo applications are using AWS Lambda as their primary processing engine. In other words, Alexa plus Lambda equals application.
His hypothesis — one that I wholeheartedly subscribe to — is that serverless processing in combination with emerging API-driven platforms will be leading the way in serverless adoption.
Alexa + Lambda = Application
Certainly serverless processing will touch on almost all areas of IT computing. But this theory implies that the most rapid growth and adoption will be in combination with new platforms. In other words, new platforms plus serverless processing equals higher rates of growth.
Since these applications are starting from scratch, there is no legacy code or existing architectures to worry about. Serverless processing supports rapid development and quick access to scale. As a result, it’s a smart choice for people wanting to develop applications quickly and scale them without much additional effort.
Another driver is that new applications are often event-driven — or at least have large event-processing needs. This characteristic lends itself to microservices and functions as a service (FaaS) architectures. The new tools that serverless processing provides are perfect for handling event-driven computing patterns.
New Platform + Serverless = Many Applications
For this reason, I believe that the combination of blockchain plus serverless will have a sum far larger than their parts. The combination will be a predominant method for building and supporting blockchain-related applications — especially when it comes to private blockchain networks.
While exploring the world of blockchain, I found that most of the published material was either too high-level, too effusive, or dived too deep into cryptography and consensus algorithms. What was missing was a clear explanation — directed at architects and developers — that addressed the practical questions about building a blockchain-based applications for business use.
To lay the groundwork, readers should have some familiarity with the basics of serverless processing. If a refresher on serverless is needed, review my recent Thinking Serverless series, the blogs by Martin Fowler, or articles curated from the serverless community at A Cloud Guru.
Digital currency is based on blockchain — and the currency exists because of blockchain technology. Digital currency transactions are essentially notarized by the processing and storage nature of the blockchain. In other words, the currency’s value is protected by the chained hashed blocks and the distribution of the networks.
It is difficult, if not impossible, for digital currency to exist outside of a distributed blockchain network. For that to happen, digital currency would have to put faith into a single trusted entity — and hope that it wouldn’t game the system, get hacked, nor inflate the monetary supply. These are options that very few adherents of digital currency are willing to trust.
Blockchain networks are powered by digital currency. Organizations that operate nodes within the network are rewarded via the use of digital currency within the system. For example, payment for processing transactions in the Bitcoin network are paid in Bitcoin. Processing on the Ethereum platform is paid via Ether — the coin used in Ethereum.
Did you know that non-currency assets can also be traded and the transactions preserved within a blockchain?
Since blockchain technology can also be used for applications separate from digital currency, there is significant activity and investment in the space. Companies are using blockchain technology to create transactional-based solutions for trading and transferring physical and digital assets in a manner that is secure and readily verifiable.
With traditional marketplaces and trading platforms, transactions are stored in a centralized ledger owned by a single entity. Blockchain platforms allow these transactions to be digitally signed and preserved as an immutable record. It also stores the transactions in a distributed ledger across multiple independent and replicated nodes.
At its core, a fully formed blockchain network becomes a mechanism for designing the rules of a transactional relationship. The blockchain network acts as programmed adjudication or final settlement, reducing the need to appeal to human institutions. As a result, the blockchain become a programmable social contract which allows for trusted, validated, and documented interactions between parties at a very low cost.
Blockchain networks are comprised of near identical nodes operating in a distributed but independent manner. The network of nodes is used to validate sets of transactions and encapsulate them within blocks in a chain. At the core of the blockchain platforms is a distributed transaction processing engine that validates and cryptographically seals transactions.
These transactions are maintained in a distributed ledger that is replicated, shared, and synchronized within any participating nodes. Blockchains use cryptographic technology to create a permanent record of transactions. A set of transactions are cryptographically stored within a “block”. Successive blocks are added in a chain — secured and preserved in order using hashing algorithms.
The ledgers for public blockchains, such as Bitcoin and Ethereum, are stored in thousands of nodes across the world. Private blockchain networks, on the other hand, may only have a few nodes. Typically, any full participant within a blockchain network would want to maintain an active and operating node — ensuring the validity of the ledger independently from anyone else.
A consistent record of truth is made possible by the cryptographic and shared nature of the ledger. This is critical for addressing several of the problems blockchains are trying to solve:
Blockchain is a consistent record of truth that is shared among participants in the network. It becomes the ultimate arbitrator — eliminating any version of “he said, she said” disputes.
As noted earlier, there are a number of blockchain platforms. While Bitcoin is the most popular cryptocurrency, Ethereum has one of the most popular blockchain platforms — especially for purposes that go beyond just digital currency store. Hyperledger is a Linux Foundation project and has received significant support from large finance and technology institutions. It is also popular as measured by the number of projects using it and the level of community support.
Blockchain networks may be public, private, or hybrid. This means that a public transaction would be encrypted into the public ledger, while private transactions would be stored in a private ledger. Private transactions could also be stored in the public ledger, hence, the hybrid designation.
To support hybrid use cases, the Enterprise Ethereum Alliance is working hard to keep the public Ethereum network and private platform compatible. According to a source monitoring the effort, a big topic for the group is private transactions on a permissioned, or private, blockchain. Chase Bank’s fork of the Ethereum Go client (Quorum) has added private transactions — where only the sender and receiver know the details of the transaction. Compatibility with the interactions of the public chain, though, is still a driving tenet.
Digital currencies are tied to particular blockchains. Transactions involving a particular currency are represented, denoted, and enshrined in the distributed blockchain for that currency.
Bitcoin transactions are handled on the Bitcoin platform and stored in a Bitcoin ledger. Ethereum transactions are be handled on the Ethereum platform and stored in a Ethereum distributed ledger. Hyperledger transactions are handled on the Hyperledger platform and stored in a Hyperledger distributed ledger. And so on.
Some blockchains feature the concept of a digital tokens as a secondary asset or currency. These assets are priced in the base digital currency. The digital tokens can most often can be used for services on a particular application or sub-platform within that blockchain platform. A look at the listings in TokenMarket will show the digital assets that are available under the Ethereum blockchain.
For example, envision an Uber digital token that could be used as currency for any Uber block-chain enabled service. The service could simply draw from any Uber digital tokens in your Uber account. The tokens could be tied to a digital currency such as Ether, or just be built on the platform and gain or lose value within the platform — as well as via speculative interest.
Digital tokens are being released in ICOs — initial coin offerings. In some ways, it’s similar to pre-registration stock offerings in the 1800s and early 1900s. Any entity can create a token and offer it for sale in ICO. The transaction would be performed in the blockchain digital currency.
A report referenced in Blockchain News indicated that one quarter or $250M of the $1B of investment raised by blockchain companies was the result of ICOs.
Digital assets simply refer to the digital or physical items at the heart of a trade. Example of assets that might be purchased and transferred include a house, a car, shares in a company, or a painting. These transactions would be registered within a blockchain, and the asset becomes the item referenced in the exchange. A shipment could also be considered an asset at the core of a set of blockchain transactions — a blockchain use case that is already in operation.
All blockchain platforms contain a processing component that is a critical part of transaction assurances. The blockchain networks are set up for “miners” that competitively “forge” each successive blockchain.
The terms mining and forging are used to describe the process of validating and preserving transactions in blocks, as well as receiving new digital currency tokens in return for the work. The process of mining introduces new currency into the system at a reasonably fixed rate of frequency.
Miners are compensated for being the first to solve a block solution. This means being the first to calculate a hash for the block that is less than an arbitrary threshold set by arbitrary mechanisms of the network. Blockchain platforms are self-arbitrating with respect to setting and adjusting the thresholds — this allows the aggregated set of miners to mine blocks within specific and regular time windows.
All miners have access to each transaction. As part of forging a block, or a set of transactions, they process the code for each transaction and attempt to arrive at the hash solution.
In the article, “How Bitcoin Mining Works”, the author provides a detailed explanation of the process. It describes how each block’s hash is produced using the hash of the block before it, becoming a digital version of a wax seal. The process confirms that this block — and every block after it — is legitimate, because if you tampered with it, everyone would know.
This distributed nature of blockchain processing means that each active node can theoretically executes each transaction within the system. For example, the transfer of a public digital currency from one person to another might get processed by every node in the entire network.
In the case of Bitcoin or Ether, this could be mean over 10,000 nodes are executing the code for a given transaction. This same processing replication takes place for each type of transactions — whether it’s as simple as transfer of a digital asset or a transaction with extremely complex processing logic.
The code for each transaction is referred to as a smart contract, and the processing is referred as on-chain processing.
Functions are uploaded as part of the process for creating an asset type in the blockchain, The uniformity of the language and the processing model ensures a high degree of idempotency — meaning the outcome of executing a smart contract should be the same on each node within the system.
In the case of Hyperledger Fabric, smart contract processing is performed in Go. The language is OCaml is being used for a new blockchain platform expected to be released in June 2017. According to CryptoAcademy, “OCaml [is] a powerful functional programming language offering speed, an unambiguous syntax and semantic, and an ecosystem making Tezos a good candidate for formal proofs of correctness.”
Within public blockchain networks, there is a charge for processing transactions — a cost borne by the party submitting the transaction. In the case of Ethereum, the processing charge is called the “gas” and it’s submitted as part of the transaction in the form of ether, the base currency in Ethereum.
Private blockchain networks, however, can be more flexible with respect to transaction costing methods. The operating costs could be borne by the network participants with little or no cost, or could be accounted in a manner as determined by the network participants.
Various blockchain platforms are working on improved forms for arriving at processing consensus. The current form is called Proof of Work. A proposed approach within the Ethereum network is called Proof of Stake. You can read about the proposal and debate here, here, and here.
The blockchain processing model differs significantly from serverless processing. Not only do most serverless platforms support multiple languages, but the goal for serverless processing is one-time processing.
Processing the same transaction on a concurrent basis is antithetical to the serverless processing philosophy. Within blockchain platforms, however, this concurrent identical processing model is a key feature — ensuring and maintaining the blockchain as a validated and trusted source of transaction history and current state.
Given the closed nature of blockchain processing, there is no need — nor any entry way — for serverless processing of on-chain transactions for execution of smart contracts. However, there is a tremendous amount of need for processing off-chain transactions — especially in terms of setting up transactions, helping to perfect transactions, and addressing post-transaction workflows.
Why is there such a need for off-chain transactions? The reasons are because 1) on-chain processing capabilities are severely limited and 2) on-chain data storage is severely limited. As a result, off-chain data processing will need to take place for transactions that are complex and/or data-heavy.
With respect to the first issue of limited processing capabilities, on-chain transaction logic needs to be kept to a minimum in order to arrive at effective transaction throughputs. Cost mechanisms for processing transactions — ones that provide equitable rewards for processing transactions and operating the ledger — also impose costs for transactions.
Without the “gas” charged for transaction processing, transaction parties would get free rides on the network. In addition, they could easily overwhelm the processing capabilities of the network — blockchain DDOS attacks anyone?
To arrive at the optimal balance of performance, cost, and consistency for each transaction, transaction logic for blockchain applications need to adequately address both on-chain processing and off-chain processing. This also applies to addressing on-chain data and off-chain data. As a result, effective blockchain design means using the blockchain network for only for the most minimal amount of data processing and data storage necessary to perfect and preserve the transaction.
The separation of on-chain versus off-chain is illustrated by a recent post that describes using the Ethereum network as a mechanism for validating a game of chess.
The writer, Paul Glau, describes how it’s not practical to submit every move as a transaction into the chain. Each transaction would not only take a significant amount of time to settle, but also be potentially cost prohibitive because transaction cost. For example, a move by a player might take seconds whereas committing the move onto a blockchain might take minutes.
In assessing the problem, the team realized each move did not need to be a part of the chain. Independent arbitration — blockchain processing — is only needed to establish the start of a game as well as resolve a dispute in the game. Once a game is established, each player submits their move along with their perceived state of the game to the opposing player as a signed transaction off-chain. If the opposing player signs and submits a subsequent move, the prior move is deemed accepted.
A transaction is submitted to the blockchain only in situation where there is a dispute — a player believes there is a checkmate, stalemate, or timeout condition.
In such a situation, the processing of the smart contract would determine if such as condition was true, thereby dictating the outcome or next step — continued play, declared winner, or stalemate. It’s inevitable that questions will arise as to how complex the smart chain logic should and can be. In the case of chess, it’s proposed that a reasonably quick algorithm can be created to assess the condition within a smart contract. More complicated games such as Go, however, are another matter.
The split between on-chain processing and off-chain processing means that off-chain processing needs must be able to set up transactions and manage any post processing needs. Remember, blockchain provides a trusted record of transactions — but parties that are making use of that transaction don’t need to do so in an on-chain manner.
For example, a voting application that uses blockchain to verify eligibility of voters can access blockchain records for confirmation. Aside from registering a vote, any of its processing does not need to happen within a blockchain platform. Likewise, using a blockchain ledger to preserve the accident and repair history of an automobile will have incidents committed to the blockchain, but any actions performed on those incidents do not need to be done in an on-chain manner.
The limitations for storing data on-chain also has implications that directly relate to serverless processing. Any data that is supporting a transaction will need to be digitally preserved and linked as part of the transaction — like the actual contract that is recorded in the blockchain as being signed. Services are already being developed to perform this capability within several public blockchains. Tierion, for example, is performing this for Ethereum-based applications. For private blockchain networks, potential candidates for serverless processing include preparing the data, validating it, and accessing it post-processing.
The difference between what is processed on-chain versus off-chain largely comes down to trust levels. On-chain processing is designed to be trustless — meaning the parties do not have to trust each other in order to perfect a transaction. Off-chain processing performed by one party in a serverless environment is situated for cases where 1) there is no transaction effected, 2) two or more parties trust each other to forego any sort of consensus algorithm, or 3) there is a consensus algorithm in place to verify the results of the off-chain processing.
Blockchain and serverless processing are two independent technical innovations which are markedly different, but they share a number of things in common. While serverless is intended to be stateless, blockchain provides a publicly and independently verifiable way to maintain transactional states.
As application patterns quickly evolve to event-driven architectures, the need for independently verifiable transactional states will increase — and the more likely the serverless and blockchain will be used together. The use case for this combination is especially true in private and/or permissible blockchains where the trust level is higher, and the allowances for use of outside components and services more tolerable.
The next article will explore blockchain platforms and components in more depth, as well as dive into some of the issues blockchain platforms are trying to address. In addition, we’ll cover a few areas where serverless platforms can build hooks to better support the development and operation of blockchain applications.
Thanks for reading! If you like what you read, hit the ❤ button below so that others may find this. You can follow me on Twitter.
Special thanks to the following for assisting in the crafting of this post. Their collective insights and generosity in answering questions is greatly appreciated — James Poole, Paul Grau, and Robert Mccone.
How blockchain and serverless processing fit together to impact the next wave was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
After years of consuming a heavy diet of Microsoft Office and Oracle, we’ve been conditioned to think that software improvements are packaged in versions. This atomic ‘all or nothing’ upgrade comes from the days when vendors batched all the updates for a release, scripted the installation procedures, and sent you a CD that contained all the magic.
The user didn’t know whether the latest build contained a single change or 100% new code— it was just ‘the next version’. We also came to expect pain and unpleasantness from this method of upgrades. For most IT shops, upgrades are synonymous with outages, instability and irate calls — not fun.
This approach to upgrades is single-handily the most unhelpful idea to making better applications for users. We must shift this way of thinking about software changes.
I’ve been trumpeting the idea of features over versions for ages — and corporate IT leaders still look at me like I’m crazy. But do you know which version of Google search you’re using? What the current version Amazon.com’s homepage? And what are the odds that you’re using the exact same version of those pages as me?
In the world of cloud applications, nobody — nobody — is talking about versions. Features are all users care about — and with the cloud, we can now deliver without the tyranny of traditional deployment.
I’ve been busy again with another unicorn venture. This time I’ve built an incredibly useful Book Listing Service that allows you to add book titles and authors, and then list them. In expectation of its success and VC funding, I’m reserving my new Tesla.
Under the covers, the Book Listing Services looks like this:
As users are prone to do, they immediately provide feedback with a list of new feature requests which they never mentioned before. Tssk, users! And the developers also injected a couple of changes as well — so my backlog is already looking like this:
There are now hundreds of active paying users expecting me to start rolling out these changes — so what are my options? Traditionally, there are only a couple of ways to release changes into a production environment:
As a Product Manager, I want to roll all these features together into version 2 because multiple deployments are painful. But one of my best customers is unhappy and is demanding an immediate bug fix or she’ll stop using the product— you can only add book titles up to 50 characters. What to do!?!
Fortunately, our best developer is a master of micro-services, a curator of the cloud, and likes nothing more than using her services sorcery to solve problems like these. She decomposes the back-end design further:
We start mapping features to required changes in the main three components and realize we can do some clever things. It turns out that our two main services are actually Lambda functions and our front-end talks to AWS’ API Manager reach them.
I call our star customer and ask if she’s interested in becoming a beta user — we’re going to make the change today and she will be the only person to receive the feature. If it looks good and doesn’t cause any issues, we’ll then roll it out to everyone.
We will allow a subset of our users to see a different version in one component of the system, while everyone else stays on the old version:
Fast forward a week and this change is promoted to all active users — and I’m now choosing the color and interior finish of my Tesla. I work with my lead developer and very soon we are flowing testers and developers onto different versions of the components so everyone sees a slightly different ‘version’ of the system:
Our developers are now working on a feature that will check the book price using Amazon’s API, but it’s working against production versions of the other services. Needless to say, our QA people are doing back-flips down the hallway.
In this model of our coding factory, the features list— the backlog — is our list of customer orders, and we keep the conveyor belt full of code moving into production in a fairly constant, continuous state. We can easily roll back changes if needed, apply automated testing to the release process, and generally feel good that we are both delivering customer value and not hurting stability. Just so much winning.
Although my book listing example is very simple, you can see how we can do things very differently with service oriented architectures and the cloud. By decomposing functionality into smaller and smaller units, we can move versioning down from the overall product level down to the code level.
This approach has positive impacts on customers and product managers:
Though not new to the technical audience, I’m surprised how many business leaders have no idea we can do this.
The Books Listing application illustrates a micro-service approach, but there are several other types of upgrade that commonly affect cloud infrastructures.
If you’re managing your own code on EC2 instances, this isn’t too different from on-premise upgrades and has the same likelihood of success or failure. An upgrade script is running on the instance and if it works — profit! — and if not, we attempt to rollback and hope the rollback state is just like it was before.
Alternatively, and preferably, you have a stateless army of instances. You upgrade one, test it, and if it works as expected create an image to generate new clones. If it doesn’t work, you terminate it — a few pennies poorer but the lights are still on.
Amazon Machine Images (AMIs) are grossly underused among the clients I’ve worked with. You can have an inventory of different OS’s, stacks and applications at different version numbers all stored as APIs, waiting patiently to get spun up into live instances whenever you need them.
Whenever making a build change, always ask “Should I create an AMI?” before just doing it anyway.
2. Managed application upgrades
If you’re on Google’s App Engine or Amazon’s Elastic Beanstalk, these environments create a safe space to do tightly-controlled upgrades with reliable rollbacks and versioning baked in. These are almost always a better way to deploy apps than juggling EC2 instances and really handle a lot of the administrative pain for you. These are fantastic services that developers fall in love with quickly.
3. Complex project upgrades
This is where on-premise IT often shuns the cloud believers — “You could never perform an Exchange/PeopleSoft/Dynamics/[insert horrific software here] upgrade in the cloud!” they proclaim. Well, we can and we do, more quickly and with better results than the equivalent on-premise plate-spinning spectacle.
The reason is we have CloudFormation — which I’ve come to believe is the greatest thing since Python. Even if your upgrade has a mixed of server types, databases, firewalls and security changes, we can script the upgrade in CloudFormation and safely test by spinning up entire cathedrals of virtual hardware with the new version. Once it passes the tests, we just point users to the new stack. It’s a beautiful thing.
The reason I’m fascinated by versioning is because the process is central to getting new software out the door. As a lifelong Product Manager and entrepreneur, ‘releasing stuff’ is the most important thing a development team can do for me and, by extension, my customers.
In most IT shops, the traditional versioning and upgrade cycle becomes a reason not do things. But I’m more interested in working in a cycle that makes the lack of change the exception, not the rule. I want to ferret out anything that doesn’t change and increment its version number just to make a point. User feature requests never stop — if they do, our product stinks. And we need to keep delivering — if we don’t, our product will stink.
Technology managers often see a contention between change and stability. Their empire is a factory where widgets roll off an assembly line and anything that threatens the widget-rolling is bad. The problem is that just being operational doesn’t mean you’re producing anything of value to the customer.
We are standing up a whole operations where virtual conveyor belts run with nothing on them. But in software, the feature releases are the widgets. We need to turn this thinking on its head — when you release nothing, you produce nothing.
Keeping the lights on and ‘being stable’ is expected — it’s not an excuse for releasing nothing.
Thanks for reading! If you like what you read, hit the ❤ button below so that others may find this. You can follow me on Twitter.
Focus on features — not versions — when building products in the cloud was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
Hello Retail by the team at Nordstrom is a well deserving winner of the competition.
Hello Retail is a proof-of-concept Serverless architecture for a retail store. The team at Nordstrom built the project to experiment with Serverless Event Sourcing.
Nordstrom is an early adopter of Serverless architectures. The team has built Serverless microservices for a number of production systems. The use cases for Serverless include API backends and stream processing.
Microservices allowed Nordstrom to create smaller, cohesive services. When these microservices need to interact, services call the API of another service. But this approach creates code and operational dependencies between microservices.
Code dependencies created by calling other services creates complexity. The caller has to know which dependent services to call and how to call them. This becomes complex to manage in code as the number of dependencies grows.
Operational dependencies between services can affect performance and availability of the application. Services that are dependent on an API depend on the performance that API. Increased latencies or failures in one service will impact other services.
The solution to these problems is to reverse these dependencies by using events. Creating services that produce and consume events allows you to decouple them.
Event Sourcing is a well understood solution to this problem. But applying this solution to a completely Serverless application is new.
The team at Nordstrom built Hello Retail with one scenario in mind: a merchant adding a product to their store.
When a product is added to the store, two things need to occur. A photographer needs to take a photo of the product. After this, customers should see the new product with the new photo in the product catalog.
The Hello Retail project solves this problem with events. The three major events in this scenario are:
Various microservices in the system produce and consume these events. A central Event Log stream connects these producers and consumers together.
The best way to understand a system that uses Event sourcing is to follow the flow of events. Hello Retail has two main event flows: photographer registration and product creation.
Hello Retail requires a database of photographers. But, the system does not have a traditional Create Photographer API. Instead, the front-end creates a Register Photographer event.
To create the event, the front-end calls an API endpoint that triggers a function. This function writes the new event to the central event stream.
A second function is listening for the Register Photographer event. This function uses the event data to write a new photographer into the database.
The product creation process takes this architecture a step further. This process spans multiple microservices and events.
As before, instead of a Create Product API call, to create a new product, the font-end raises a New Product event. When a New Product event is written to the event stream two functions are triggered.
The Product Service writes product information to the products and categories databases. This allows customers to view the new product in the product catalog.
The Photograph Management Service assigns a photographer to take a photo of the new product. It is important to note here that the Product Service did not make a direct call to the Photograph Management service to initiate this process.
So without a direct call, how does the Product Service know when a photo of the new product has been taken?
When the photo of the new product has been taken, the Photograph Management service creates a New Photo event. The event triggers a function in the Product Service which updates the database with the new photo.
This architecture has many benefits as previously discussed. But there are also a number of challenges that must be overcome.
Hello Retail uses a Kinesis stream as the central Event Log. The consumers of the stream process events in batches from the end of the stream.
If there is an error with the consumer, the failed batch will remain at the end of the stream. The consumer will retry processing the batch until it is fixed or the events expire (configurable up to 7 days).
In an active system, events will continue to be added to the stream while the consumer is not processing events. This creates a backlog of unprocessed events called a log jam.
Poison pill data is a common cause of log jams. These are malformed or unexpected events on the stream. These events need to be removed from the stream and stored for manual processing.
Even with careful handling of events, sometimes log jams occur. When the consumer is fixed the unprocessed events will be processed automatically. But what happens when there is a logic error in the consumer?
In a system using Event Sourcing there are two sets of data, the Application State and the Event Log. Unlike a traditional system, it is the Event Log, not the Application state, that is the critical data to manage.
A system that is employing Event Sourcing should be able to rebuild the entire Application State from the Event Log at any time. Version control and an accounting ledger are examples of systems that use Event Sourcing.
So what happens when there is a logic error in a consumer? After the logic error is fixed, old events can be replayed through the consumer. The fixed consumer can then rectify the application state.
Hello Retail does not maintain a historical log of events. As a result, events cannot be replayed through consumers. This architecture needs a mechanism to persist events and replay events.
All events in Hello Retail are processed asynchronously. This introduces eventual consistency into all reads in the system.
Eventual consistency can be challenging to handle correctly. Systems with eventual consistency requires a user experience that reflects this characteristic.
A central Event Log presents interesting security challenges. A central Event Log will include events that contain private information. In a production system, Microservices may only be authorised to access a subset of events or event data.
A system to protect events and event data will be required to take this proof-of-concept to production. Nordstrom is investigating a system to encrypt data on the stream. Controlling the ability to decrypt data will allow Nordstrom to control which services can access events.
This project solves a common problem teams encounter when adopting microservices. It is a great starting point for Event Sourcing in a Serverless architecture.
The team at Nordstrom needs to solve three problems before this is production ready.
I am confident the great team at Nordstrom will be able to develop solutions to these problems.
I want to thank the team at Nordstrom for creating Hello Retail and sharing it with the community. It is a great example of applying a well understood architectural pattern to a Serverless project.
The Nordstrom serverless team is hiring talented developers with a passion for learning and trying new things. If you’re interested, drop them a line at firstname.lastname@example.org and let them know what you think of Hello, Retail!
If you want to read more on Serverless don’t forget to follow me @johncmckim on Twitter or Medium.
Imagine that you’ve secured VC funding for a new concept called Muscle Unbound. Silicon Valley refers to your concept as the AirBNB for fitness — homeowners can rent out their exercise equipment when it’s not being used.
In preparation for the big summer launch, you’ve started deploying your cloud architecture and finalizing the design of a mobile UX. The entire platform is coming together so quickly that Werner Vogels is calling you on Chime about a keynote presentation at re:Invent — and now is leaving messages on your Alexa.
Back in reality, you’re the technical lead for a national gym chain with 1000 locations. The company is planning to introduce the exact same concept — but you have to make it work with the existing on-premises systems.
Your company stores all transactional data in an Oracle nightmare, accounting data in PeopleSoft, member logins through a third-party application, your product data arrives lazily through mainframe batches, and there’s a security governance team approving code releases monthly.
Every one of these platforms will be touched as part of your project implementation. Can you hear that sound? That, my friend, is the airy sound of the candles burning on your retirement cake and all hope evaporating — long before version one is ever rolled out the door.
This experience is common to anyone who’s worked in a reasonably large company — with the added pleasure of being security-slapped and Docker-blocked to the point where all-day meetings seem productive. After a while, you become obsessed with the idea there’s a better way. You’ve read about large organization like Google, Amazon, Facebook and others shipping code like a start-up, although they have the advantage of employing more engineers than Starbucks has baristas.
While trying to find a better way, you might’ve heard about using micro-services to strangle the monolith. While it’s inspiring, it’s not at all clear how you get there. Even with the cloud’s latest suite of goodies, it’s hard to strangle something you can’t get your hands around — and monolith fights back. Surprisingly, a monolith’s survival usually has more to do with the people than the technology involved.
The behavior and interaction of teams is a big driver of dysfunctional infrastructure design. When something fails, whatever the management group lacks in knowledge they often make up for with loud opinions. Based on how individuals and teams are incentivized, the technical teams will often decide that it’s much safer for their careers to release less often, resist change, and avoid failure at all costs. The dynamic of the two groups results in monolith-buiding.
Oh, how we laugh at those old school companies with their IBM contracts. We picture how they have one special room with tons of air conditioning housing one big computer. We mock how it’s nursed by an army of middle-aged, well-dressed engineers who still use a Casio FX calculator and a pencil. Too smart for a job at the Geek Squad and too scared of heights to become cable installers — they are the sworn protectors of the mainframe.
In reality, we all build mainframes everywhere we go — no matter how small an application starts. The team adds features, bolts on unexpected interfaces, and lasso crap around something that once was nimble. The monolith is a virtual mainframe — it’s an unmovable black box of ordered chaos that always arises out of corporate systems.
You laughed at the actual mainframe programmer. But now you’re policing who gets to interface with your system, planning downtime — and dammit — you’ve got a pencil in the top corner pocket of your pressed white shirt. Richard Matheson would be proud.
When the tech moved from mainframe to client-server, then to n-tier and mobile, the premise was to move the work away from some central source. The process of decentralization itself supposedly breaks apart this rigidity, magic happens, and cue the end credits. But it doesn’t, it hasn’t, and it won’t.
The virtual mainframe of having one central system is now spread across lots of machines. It’s still there — hardening itself with every passing day. How did it survive when we thought we watched it die? How is this happening all over again?
I’ve pondered this question extensively while pretending to watch The Crown. There are three things I’d like to mention as background in my evolving theory of why monoliths occur:
This pattern repeats itself in businesses over and over. You raise the golden goose, it’s laying eggs, you keep feeding it. Over and over. The architecture always ends up looking the same — a hub-and-spoke diagram with little boxes emanating from the giant monolith in the middle. Here we go again.
While we’ve been feeding our own private monster, the IT world has gone from mainframe and in-house architecture to open source, cloud-based and disparate solutions, all while the rate of change has accelerated. The fundamental fact is that this monolith will never work with these newer paradigms, and your company will never be able to keep up with customer technical demands.
I’m convinced this is why new start-ups are effectively trouncing the old dinosaurs — it’s not that they have beanbags in the office and pajama days on Fridays.
Why did Lemonade think of AI-centric insurance claims that are processed in 3 seconds and not Geico? Geico has a monolith — managed by a British lizard — which prevents radical change.
Why isn’t a single major traditional retailer beating Amazon? You can rest assured that the heart beating in the middle of Sears, Macy’s and Nordstrom’s is a cold, concrete monolith that will never be delivering the hundreds of features a day that Amazon is shipping.
That’s why. And beanbags.
Getting passed this problem requires some rethinking of how things work because we cannot build truly distributed, agile systems this way.
1. Commit to starving the monolith
Don’t kick the can down the road and decide you’ll just add some more technical debt this one time. From now on, the monolith doesn’t get feed. That’s it. I know your Kanban board is growing relentlessly and you want a promotion, but we have to draw the line in the sand today.
Conway (brilliant, remember) also observed in systems design behavior that there’s never enough time to do it right but always enough time to do it over. So let’s just do it right for once.
2. “Two pizza teams”
Amazon is the only large company I’ve known that slayed the monolith violently and directly. And from the Amazonians I know, it sounds like the two-pizza team concept was key.
Let’s steal that. In your project, get the right 8–10 people together and own every single part of your solution. Don’t depend on a dozen other teams and getting prioritization in their queues because dinosaurs will be walking the earth again before your customers get any software. And let’s not wheel out the usual excuses of who’s going to be upset.
3. Build generically
When you’re building the shipping label system for your company, imagine it’s actually a start-up for shipping labels that will have thousands of external users. There is no existing monolith to connect into. You have to build everything you need to support your user base and their wide variety of systems. On your virtual private island of pristine code beaches, only a handful of APIs will connect to these systems of which you know nothing. Make those APIs rock.
If you’re not convinced, think about PayPal — they have a widget embedded on millions of websites and successfully manage payments with no idea about how any of their customer operate technically. Make sure you are always building the consumable widget or service that doesn’t know how its consumers work. Be RESTful and use the standards out in the wild that will help you.
4. Learn to embrace eventual consistency
In science fiction, HAL, the Matrix and the Terminators were all monoliths — they were single systems that knew everything going on real time. But notice how they could never get any upgrades out? The T-800 series was excellent at pursuing the Connors, but in reality Skynet could never have deployed a successful mobile app.
In our new world, our independent systems will be slightly out of sync with each other but that’s okay. Only when we realize that we don’t need to know the exact number of sticks of spaghetti in every retail store can we allow the zen-like feeling to wash over us. We are going to be building lots of small systems with their own independent data stores that don’t always know the score … just yet.
5. Use your cloud superpowers
Working on-premise encourages you to repeat the same behaviors. Move your code to GitHub. Work remotely. Use Slack. Try decomposing into serverless functions. The sheer lack of compatibility between these approaches and monolithic behaviors is the beginning of the revolution. It will feel odd at first — but as you build more and more away from the old, it will start to decay and die.
Companies that have monoliths sporadically realize the technical noose is tightening and occasionally launch initiatives that clearly come from business people and not developers. Their 5-year plans to slowly migrate away lose steam after 6 months. The noose tightens. And the plan to document the old system and build the new either duplicates the problem in a different version and newer hardware, or it never gets funded because the consultant’s analysis was so expensive.
Starving the monolith ultimately leads to dozens, hundreds and thousands of other systems, functions and processes that slowly but surely take over. The King is Dead, but there’s no single point when the heart stopped, and no single point when the revolution started — we just put the monolith out to pasture and stopped feeding.
Don’t strangle your monolith when migrating to the cloud — starve it to death was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
What’s the Community Saying About Serverless?
The Top 5 Blogs: Written by the community, for the community
New Course! Learn how to build scaleable and available serverless applications using The Serverless Framework with GraphQL.
When we talk about serverless architectures, we often talk about the natural fit between serverless and microservices. We’re already partitioning code into individually-deployed functions — and the current focus on RPC-based microservices will eventually change into event-driven microservices as serverless-native architectures become more mature.
We can generally draw nice microservice boundaries around components in our architecture diagrams. But that doesn’t mean we can actually implement those microservices in a way that achieves two important goals: 1) loose coupling between services, and 2) low end-to-end latency in our system.
This blog is the first part of the series exploring the missing pieces to achieve a vision for loosely-coupled, high-performance serverless architecture using AWS as an avatar for all serverless platform providers..
In this post, I’ll focus on loose coupling. In particular, I propose that the lack of Service Discovery as a Service as part of a provider’s ecosystem causes customers to implement their own partial solutions.
I’ll define loose coupling as the ability to change the resources that a given function uses after deployment. There are two important use cases for this:
In serverless deployments without Service Discovery as a Service, the functions exist in the same namespace. They are connected to other functions within their deployment through environment variables, which are fixed at deployment time. Updating one function requires an update/deploy to all callers — and every function must be deployed with the full physical resource ids that it uses.
Service discoveryallows us to keep our code from having to know exactly where to find the resources it depends on. An important part of this is the service registry, which gives us the ability to turn a logical name (e.g., UsersDatabase) into a physical resource id (arn:aws:dynamodb:us-east-1:123456789012:table/UsersDatabase-MVX3P).
If this mapping is known at deployment time, serverless platform providers generally have a way of including it in the deployment; for example, environment variables in AWS Lambda functions. But these mechanisms don’t allow for change without redeploying the function, so they don’t fulfill our need.
My experience has been that everybody who has implemented a serverless system has built their own way of solving this — which is pretty much the definition of undifferentiated heavy lifting. Any remote parameter store that is updatable at runtime will suffice. The EC2 Systems Manager parameter store is a good option.
At iRobot, we have solved this by using our tooling to inject a DynamoDB table into every deployment (i.e., it writes it into the CloudFormation template) to act as a runtime-updatable key-value store. The auto-generated name of this table is injected into each Lambda function’s environment variables using the CloudFormation resource support for env vars.
With service discovery in a traditional architecture, the service registry provides a mapping from logical name (e.g., “A”) to a physical resource id (e.g., v1.a.domain in Deployment 1, v2.a.domain in Deployment 2). The isolation by VPC or subnet provides separation between the deployments.
To step back a bit further, there’s another advantage provided by service discovery mechanisms in traditional microservice architectures: separation of environments.
Infrastructure as a Service offerings like on EC2 have comprehensive mechanisms for separating groups of resources. On AWS, this is accomplished at the highest level with Virtual Private Clouds (VPCs), which, as the name implies, completely partition EC2 resources into separate silos. Within a VPC, subnets can be used to further isolate instances from each other.
This separation is leveraged to create independent sets of service discovery information, such that the service discovery information itself can have a well-known name, rather than also needing some sort of lookup. For example, it can be accomplished through DNS, which works because the networks of different VPCs are isolated, so the DNS lookups for the same name in each can have different results.
Another option is a configuration manager like Zookeeper, etcd, or Consul — which works because the configuration manager deployments in different VPCs don’t know about each other. As result they don’t conflict, but have a well-known name within each VPC/subnet.
As noted by Martin Fowler, this separation isn’t currently present in any provider’s offering. On AWS, Lambda functions can be run in a VPC, but that is heavy-handed and complicated just to gain logical separation between the functions. This means that, for whatever remote parameter store is being used, there still needs to be a mechanism for separating those parameters between deployments.
With EC2 Systems Manager parameter store, this means the Lambda functions need to understand prefixing, and that prefix needs to be delivered to the function through its env vars. For iRobot’s solution, we create a DynamoDB table with each deployment, inject its name into an environment variable in every Lambda, and we have a library, injected into each packaged Lambda code, that uses it as a parameter store.
Azure actually provides this capability in Azure Service Fabric, but it is currently not available for use with Azure Functions.
Service Discovery as a Service tags functions and make a non-namespaced call to the service discovery service (e.g., Get(“A”)), which uses the tag to index into the namespace (e.g., Env1). At deployment time, the functions need only be tagged with an immutable identifier.
The functionality that is really needed is a new feature or service as part of the providers’ platforms. We need Service Discovery as a Service (SDaaS) — or more precisely, Service Registry as a Service.
What would this look like? I see it as relatively simple; a key-value store with multiple distinct namespaces. But the crux is this: when making a Get call, the namespace is chosen based on some property of the caller, rather than selected explicitly. Of course, explicit selection would also be available.
For example, a standalone version of this service could use the IAM role of the caller. This would have the added advantage of being usable by server-based implementations as well. A version integrated into AWS Lambda could leverage the recently-added tagging functionality.
To be fully functional as SDaaS, the service would have to allow phase rollouts of changes to the namespace selections. That is, it should support blue-green updates to the values that a given caller receives.
Whatever form this service takes, it would eliminate the need for customers to build their own solutions, allowing them to focus on the tasks specific to their needs and reducing the barrier to entry in the serverless space. As a critical step towards feature parity with traditional architectures, Service Discovery as a Service is the missing lynchpin for serverless.
Update: Tim Wagner, the GM of AWS Lambda and API Gateway, asked some good questions and I wrote a long response that forms an appendix to this post.
Service Discovery as a Service: The missing serverless lynchpin was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
When we talk about serverless architectures, we often talk about the natural fit between serverless and microservices. We’re already partitioning code into individually-deployed functions — and the current focus on RPC-based microservices will eventually change into event-driven microservices as serverless-native architectures become more mature.
We can generally draw nice microservice boundaries around components in our architecture diagrams. But that doesn’t mean we can actually implement those microservices in a way that achieves two important goals: 1) loose coupling between services, and 2) low end-to-end latency in our system.
In this series of posts, I’ll explore the key missing pieces using AWS as an avatar for all serverless platform providers.
Service Discovery is an essential part of a modern microservice architecture. The lack of Service Discovery as a Service as part of a provider’s ecosystem causes customers to implement their own partial solutions.
Because FaaS is billed by execution time, time spent waiting is money wasted — and synchronous invocation of other functions means double billing. However, despite steps in the right direction, asynchronous call chains are not sufficiently supported by providers’ platforms.
Event-driven architectures are a more natural fit for FaaS and serverless, but there are key difficulties with existing services, such as limited fanout and lack of checkpointing support, that prevent robust implementations.
On the cloud side, a microservice should control the APIs it exposes to other services and to clients. On the client side, there should be one cloud endpoint exposing an API that brings together all the services. These two goals are in conflict; existing API gateways don’t facilitate a good solution.
The ability to perform a controlled, phased rollout of new code is essential to operations at scale. Existing serverless platforms don’t provide this functionality at either the FaaS or API gateway level, and we need it in both places.
Permissions in serverless architectures are highly dependent on the providers’ IAM systems, which may use some mix of role-based access control, policy-based access control, and perhaps other schemes. These can present difficulties by coupling together infrastructure components between microservices.
In IaaS, availability zones allow customers to build resiliency in the face of provider incidents without incurring the high overhead of cross-region architectures. Serverless platforms are usually region-wide and therefore resilient in the face of incidents in the underlying IaaS, but they need an availability-zone like concepts to allow customers to be resilient in the face of software problems in the serverless platform itself.
A vision for loosely-coupled, high-performance serverless architecture was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
“Those who are crazy enough to think they can change the world usually do.” ― Steve Jobs
A Cloud Guru will teach you how to build an Alexa skill so you can help change the world. Join the Alexa “Speak Up!” Challenge by simply publishing a skill that amplifies a positive message or cultivates awareness and understanding of a cause — Speak Up!
Get inspired by finding a cause that you care about — consider building a skill that advocates for a non-profit, connects the local community, or supports individuals fighting a personal battle.
Alexa an incredible platform for amplifying messages of social change and justice into the homes of millions of users worldwide.
Amazon owns 70 percent of the voice-enabled speaker market, and analyst predict that more than 35 million people in the US will use one of these stand-alone devices at least monthly in 2017.
The Alexa ‘Speak Up!’ Challenge is open to developers worldwide.
The panel consists of several Alexa Champions — individuals formally recognized by Amazon as some of most engaged developers and contributors in the community.
The winners will be announced on our Facebook page on August 31st!
Introducing the all-new Echo Show: https://t.co/FcqHNUkzb5
Every month, Amazon offers developers of Alexa Skills a free T-shirt once they publish a skill. All you need to do is fill out their form with the name of your published skill and submit to Amazon!
There’s a new Alexa #developer t-shirt for May! Publish a skill, get a shirt. #nodejs #code templates available: https://t.co/PTAaYXt8IA 👕
Below are a few resources to help you publish your first Alexa Skill!
Don’t have an Echo? The Alexa Skill Testing Tool (EchoSim.io) is browser-based interface that allow developers to test their skills in development.
A Free Introduction to Alexa: The “Alexa Course for Absolute Beginners” allows anyone to learn how to build skills for Alexa. The beginner guide to Alexa will walk you through setting up an AWS account, registering for a free Amazon Developer account, and then building and customizing two Alexa skills using templates.
Dive Deeper with Alexa Development: A Cloud Guru also offers an extended version of the course for developers that want to extend their skills. Learn how to make Alexa rap to Eminem, how to read Shakespeare, how to use iambic pentameter and rhyming couplets with Alexa, and more.
AWS Promotional Credits for Alexa Developers: Developers with a published Alexa skill can apply to receive a $100 AWS promotional credit and can also receive an additional $100 per month in AWS promotional credits if they incur AWS usage charges for their skill — making it free for developers to build and host most Alexa skills.
User Groups: Join a local user group! The AWS User Group in South Wales and the Alexa User Group in Richmond, Virginia (RVA) are both offering free workshops on building Alexa skills. As a bonus, anyone that attends their Alexa workshop and publishes 3 skills in 30 days will also receive a free Amazon device!
Build an Alexa Skill to “Speak Up!” for a social cause and win a lifetime subscription was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
In the wake of Serverlessconf 2017 in Austin, there’s been an increasing number of discussions about today’s cold reality of serverless. While we can see the glory of serverless manifesting in the not-too distant future, the community still finds it difficult to test, deploy, debug, self-discover, and generally develop serverless applications.
The discussion has been amplified in recent days with tweet storms and the great threads on the Serverless Slack channel from Paul Johnston that prompted this post. The common sentiment is that the difficultly with serverless gets more acute when developing applications composed of multiple sets of functions, infrastructure pieces, and identities evolving over time.
On the one hand, the serverless approach to application architecture does implicitly address some of the high-availability aspects of service resiliency. For instance, you cloud assume — without empirical evidence — that AWS transparently migrates Lambda execution across Availability Zones in the face of localized outages. This is unlike a more traditional VM/container model, where you must explicitly distribute compute across isolated failure domains and load balance at a higher logical level (e.g. ELB and ALB).
While this intrinsic reliability is undoubtedly a good thing, overall resiliency isn’t so easily satisfied. Take for instance the canonical “Hello Serverless” application: an event-based thumbnailing workflow. Clients upload an image to an S3 bucket, a Lambda function handles the event, thumbnails the image, and posts it back to S3. Ship it.
Except, how do you actually test for the case when the S3 bucket is unavailable? Or can you? I’m not thinking of testing against a localhost mock API response, but the actual S3 bucket API calls — the bucket you’re accessing in production, via a dynamically injected environment variable.
Another example is when you have two Lambda functions, loosely coupled. The functions are blissfully ignorant of one another, although they share a mutual friend: Kinesis. In this use case, “Function A” publishes a message, perhaps with an embedded field whose value is another service’s event format (like an S3 triggering event) that’s consumed by “Function B”. While there’s no physical coupling, there’s potentially a deep logical coupling between them — one which might only appear at some future time as message contents drift across three agents in the pipeline.
How can we guard against this? How can we be certain about the set of functions which ultimately defines our service’s public contract?
Serverless is an implementation detail, not an architectural pattern.
The great thing about non-functional requirements is that they’re … non-functional. They speak to a system’s characteristics — how it should be — not what it should do, or how it should be done. In that sense, non-functional requirements both have nothing and everything to do with serverless.
The slide above is from Peter Bourgon’s excellent presentation on the design decisions behind go-kit, a composable microservice toolkit for Go. The concerns listed apply equally to a JVM monolith, a Go-based set of microservices, or a NodeJS constellation supported by FaaS. If you’re running something in production, those *-ilities lurk in the shadows whether or not they’re explicitly named.
In that sense, serverless is less a discontinuity with existing practice and more the next stage in the computing continuum — a theme emphasized in Tim Wagner’s closing keynote. It’s a technique that embeds more of the *-ilities into the vendor platform itself, rather than requiring secondary tools. Serverless enables us to deliver software faster and with fewer known unknowns — at least those that are externally observable.
Although serverless offloads more of these characteristics to the vendor, we still own the service. At the end of the day, each one of us is responsible to the customer, even when conditions change. We need to own it. And that means getting better at Ops. Or more specifically — cloud-native development.
For many of us, the end result of our furious typing is in many cases a cloud-native application. In more mature organizations, our software constructs go through a structured CI/CD pipeline and produce an artifact ready to ship. This artifact has a well-defined membrane through which only the purest configuration data flows and all dependencies are dynamic and well behaved.
On a day-to-day basis, though, there is often a lot of bash, docker-compose, DNS munging, and API mocks. There is also a lot of “works on my machine” — which may be true, at least at this instant — but probably doesn’t hold for everyone else on the team. And it definitely doesn’t provide a lot of confidence that it will work in the cloud.
The only way to gain confidence that a feature branch will work in the cloud is to run it in the cloud.
Operations is the sum of all of the skills, knowledge and values that your company has built up around the practice of shipping and maintaining quality systems and software. — Charity Majors, WTF is Serverless Operations
If everyone on the team is developing their service feature branch in the cloud, complete with its infrastructure, then we’re all going to get better at ops. Because it’s development and ops rolled together. And we’re all going to share a sense of Environmental Sympathy.
Environmental Sympathy, inspired by Mechanical Sympathy, is about applying awareness of our end goal of running in the cloud to the process of writing software.
While it’s always been possible to provision isolated single-developer clusters complete with VMs, log aggregators, monitoring systems, feature flags, and the like, in practice it’s pretty challenging and expensive. And perhaps most aggravating, it can be very slow. Short development cycles are critical to developer productivity and that’s not really a hallmark of immutable, VM-based deploys.
Serverless, precisely because it’s so heavily reliant on pre-existing vendor services and billed like a utility, makes it possible for every developer to exclusively develop their “service” in the cloud.
The service can have its own persistence engine, cache, queue, monitoring system, and all the other tools and namespaces needed to develop. Feature branches are the same as production branches and both are cloud-native by default. If during development, the *-ilities tools prove too limiting, slow, or opaque, developer incentives and operational incentives are aligned. Together we build systems that make it easier to ship and maintain quality systems and software. Which will also help to minimize MTTR as well.
Serverless, for both financial and infrastructure reasons, makes it possible to move towards cloud-native development and Environmental Sympathy. It represents a great opportunity to bring Dev and Ops — and QA, and SecOps) together. This allows us to mov from “worked on my machine” to “works in the cloud — I’ll slack you the URL.”
From #NoOps to #WereAllOps.
Scaling the serverless summit requires environmental sympathy with dev & ops was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
There’s no self-help group for someone like me — technology is in my blood. I owned my first computer when I was 5, and from then onward I collected technology like most kids had stuffed toys.
I started writing for a local computer magazine at 12, then a national UK magazine. When I got to college, I had a regular column on a British platform called Oracle Teletext.
It would have been easier to do cool kid things and hang out at the mall but spending every waking hour thinking about all the problems that can be fixed with machines kept me too busy.
After earning a CS degree at college, starting work as a software developer was a jarring experience. Having to work in teams of developers is not something they taught at school, nor did they train anyone on how to make sense of ancient convoluted systems. At school, you start from a blank screen and build something beautiful all by yourself, which very rarely happens at work.
My group built electronic trading software for hedge funds, portfolio managers, and the like. Most trading happens at the open and close like a twice-daily Black Friday that stresses the back-end. Our infrastructure was killing us as the platform grew and the amount of data choked any attempt to solve the problem without taking the system offline.
Best of all, our users were traders — and they were assholes on a good day. Their needs changed constantly and were communicated telepathically. They completely lost it when during an outage and had no tolerance for bad data, missing functions or sloppy UI.
Smart, volatile and demanding users — working for them ended up being the very best training for today’s user base, and it taught me several life lessons:
In the mid-2000s, I moved to the Bay Area and worked in startups for several years as a Technical Product Manager. This was during the phase when everybody who previously had an idea for a website now had ideas for mobile.
In the Bay Area, a Product Manager is a coder who is yelled at by customers and also produces road-maps nobody uses.
As we moved into mobile, our users were regular people with cell phones, and our competitors were either well-funded startups or the established technical luminaries. Our development teams were much smaller, budgets were tighter and yet our epic aspirations didn’t seem to notice we were horribly equipped for success.
Mobile made scaling problems insurmountable for start-ups — buying new servers sucked up budgets, configuring load balancers and database replication wasted development time that should have been spent perfecting the UI. And investors and founders, usually bored with the the grind of their real jobs and attracted to the gold rush, were on a mission to become the next billion dollar app with no revenue and an army of users.
At the time, there was iOS, Android, Windows and BlackBerry, all using different frameworks and languages, and it looked like these could fragment further. We were trying to put together apps that are essentially a dozen screens which could have been built as a .NET desktop app in a day. And yet we did manage to release apps, solve problems and build some businesses.
Sometime around 2010 it became clear to me that as a development group, we could confidently write solid applications running on machines in the same building. But deployment was difficult — and once apps hit production they weren’t performing as well.
We had been using some cloud apps for a while but hadn’t seriously used AWS until it became absolutely necessary. A client app had started to gain momentum and we didn’t have the money to scale up on-premise, so we became AWS users very quickly. It was a fortuitous but mildly alarming moment to realize we didn’t have any alternatives — but it quickly became the de facto way to build our products.
I had some lightbulb moments during this time:
In 2012 I attended the very first AWS re:Invent conference in Vegas and that changed everything. Witnessing the entire ecosystem around the platform, it was obvious that many people had been grappling with the same issues and there were a slew of great solutions available.
There was a haunting question about why nobody else was offering this — Amazon was the only game in town and either they were incredibly prescient or we were all being gleefully over-optimistic about this whole cloud thing. This lag continued for years — it gave AWS a 6-year lead over its competition which is why its capabilities still smoke the competition.
In our shop we weren’t the first to the cloud by any measure but we embraced it wholeheartedly. Within 6 months there were a number of unexpected side-effects:
In using cloud solutions as the backbone to all the products I’ve worked on, I’ve had to step up my technical game constantly. It’s not enough to be a Product Manager with road-maps and wire-frames — I need to know reliable patterns and trusted practices to create the best technical architecture.
This has meant constant training, taking on programming projects and learning new frameworks as the environment changes. It’s also meant making a commitment to conferences and workshops, which has become an automatic line-item in my budget.
On the business side, cloud has given me the confidence to assess viability and likely cost, predict timeframes more reliably and help business partners understand where the business ideas and the technology meet. In many ways, the concepts between agile, cloud and lean are so intertwined that I often think they are different views around the same thing.
Fail fast, waste little, learn constantly and always deliver customer value — cloud is central to making this work.
There are still plenty of naysayers. I worked for some more traditional companies after the California days and it was like jumping in the DeLorean and setting the clock to ‘Fail’.
They all grappled with an aging, fragile, expensive IT infrastructure that delivered limited business value and had no hope of helping them innovate or differentiate in the future. Those companies are waiting for a generation of executives to retire and competitive threats to reawaken the appetite that once made them giants.
There are also the fakers in the industry, the ones who for years dismissed cloud, laughed at Amazon and claimed it could never work. Now they scramble to promote their own clouds with the same limited tools and restrictive contracts they had on-premise.
The me-too players like Oracle serve to bring the laggards into the cloud ecosystem but they offer nothing fundamental or game-changing to the technology. 5 years ago they said cloud wasn’t secure and now they say only their clouds are safe, so I suppose fear can drive sales in anything.
But I live by mantra “Go where you are celebrated, not tolerated.” I’m not here to convince yesteryear’s IT professionals that our industry’s change is accelerating geometrically. I’m here because I’m committed to using the cloud and its toolbox to build the next generation of software that solves the next round of problems. I want to get to machine learning and AI, and move from onClick to onPrediction — the cloud is where all of this will happen.
So that’s my story. Most of us geeky kids who grew up with computers didn’t become Steve Jobs or Jeff Bezos but it’s been an amazing ride. The opportunities are everywhere and the future has never been brighter. My name is James. I’ve been a self-confessed cloud-oholic for the last 7 years. I don’t think that’s ever going to change.
My personal journey to the cloud from my first job on a trading floor to startups was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
On April 26–28 we hosted the 4th conference on serverless technologies and architectures in Austin, TX. Serverlessconf was attended by 450 serverless aficionados who listened to 35 fantastic presentations.
You can watch all the presentations on our YouTube channel right now!
Here’s a small collection of photos from the conference. Be sure to check out the Imgur album to see all of the pictures from Serverlessconf!
So that brings us to the end of Serverlessconf Austin 2017. Our sincere thanks to our speakers, attendees, and sponsors who made this conference so interesting and exciting. We love the passion in our community. It makes Serverlessconf a lot of fun to organize and run.
I also want to thank the amazing A Cloud Guru team who worked extra hard to make Serverlessconf Austin ’17 a special event. Thank you all for your hard work and infectious enthusiasm.
With that, I hope to see you at the next Serverlessconf!
I’ve been thinking a lot about testing recently. At work we have recently increased the number of our lambda functions by a significant amount due to additions of client applications and addition of features. This isn’t a massive deal to develop new features, but there has been something that has been beginning to bug me (if you’ll excuse the pun).
I’m all for creating tests. Whether it’s true “Test Driven Development” — or whatever the testing methodology du jour is now — is immaterial to me. Sometimes in a startup, you just have to deploy something fast, and write a test later (I know, I know — but I’m just giving people who’ve never worked in a startup the real world scenarios). And sometimes, the tests never get written because you think that your use case is already caught (it isn’t).
Often tests get written because of bugs occurring in the production environment. This will always occur unless you have endless money and time — which you won’t in a startup.
Tests are vitally important.
But if you’re using the prevailing testing wisdom — serverless is hard.
Serverless architecture uses a lot of services — hence why some prefer to call the architecture “service-full” instead of serverless. Those services are essentially elements of an application that are independent of your testing regime.
An external element.
A good external service will be tested for you. And that’s really important. Because you shouldn’t have to test the service itself. You only really need to test the effect of your interaction with it.
Here’s an example …
Let’s say you have a Function as a Service (e.g. Lambda function) and you utilise a database service (e.g. DynamoDB). You’ll want to test the interaction with the database service from the function to ensure your data is saved/read correctly, and that your function can deal with the responses from the service.
Now, the above scenario is relatively easy because you can utilise DynamoDB from your local machine, and run unit tests to check the values stored in the database. But have you spotted something with this scenario? It’s not the live service — it’s a copy of it. But the API is the same. So, as long as the API doesn’t change we’re ok, right?
To be honest, I’ve reached a point where I’m realising that if we use an AWS service, the likelihood is that AWS have done a much better job of testing it than I have. So we mock the majority of our interactions with AWS (and other) services in unit tests. This makes it relatively simple to develop a function of logic and unit test it — with mocks for services required.
This is similar to when using a framework such as Rails. You shouldn’t be testing that the ORM works. That’s the ORM maintainers job, not yours. So it stands to reason that if a service provides an interface and documentation about how the interface works, then it should be fine — right?
Here’s where there is a problem with serverless… sort of. Unit tests are easy with a FaaS function because the logic is often tiny. There is a tendency to an over reliance on mocks in my view but it works.
All other forms of testing are hard. In fact, I’d say we’ve possibly moved into needing a different paradigm to discuss this.
Through years of building monolithic applications, we’ve got absolutely obsessed that certain types of testing are absolutely vital — and if we don’t have them we’re “wrong”.
So let’s just step back a bit.
We’ve actually been having the discussion about distributed systems and testing for a while. The microservice patterns have shown us that it’s not always appropriate and often expensive to try to test everything in the way we do a monolith.
The key for integration testing with a microservice pattern is that you test the microservice and it’s integration with external components. Which is interesting, because you’re still imagining some sort of separation here.
In Lambda, in this context, every single Lambda needs to be treated as a microservice then for testing. Which means that your function’s unit tests (with mocks) need to be expanded to integration tests by removing the mocks, and using the actual service or stubbing the service in some way.
Unfortunately not every external service is easily testable in this way. And not every service provides a test interface for you to work with — nor do some services makes it easy to stub themselves. I would suggest that if a service can’t provide you with a relatively easy way to test the interface in reality, then you should consider using another one.
This is especially true when a transaction is financial. You don’t want a test to actually cost you any real money at this point!
For me, the easiest way to test a serverless system as a whole is to generate a separate system in a non-linked AWS account (or other cloud provider). Then make every external service essentially link to a “test” service, or as best we can limit our exposure to cost.
This is how I’ve approached it — and it relies on Infrastructure as Code to make it happen. Hence, the use of something like Terraform or CloudFormation.
But interestingly, when you go beyond a single function like this in a microservice approach, you get onto things like component testing and then system testing. Essentially testing is about increasing the test boundary each time. Start with a small test boundary and work out.
Unit testing, then integration, and so on …
But interestingly, our unit tests are doing the job of testing the boundary of each function reasonably well, plus doing the unit test, and also testing the function’s relation to external services reasonably well. So the next step is to test a combination of the services together.
But since we’re using external services for the majority of our interactions, and not invoking functions from within functions very often, then the test boundaries are actually relatively uncoupled.
Hmm… so basically, the more uncoupled a function’s logic is from other function’s logic, the closer the test boundary is as we move outwards in tests.
So after good unit and integration tests on a Function by Function basis, what comes next? Is it simply end to end testing next? This becomes really interesting, since that means testing the entire “distributed system” in a staging style environment with reasonable data.
Basically, what it seems to happen with a Function as a Service approach is that the suite of tests seem a lot simpler than you would normally do with a monolithic or even a microservice approach.
The test boundaries for unit testing a FaaS Function appears to be very close to an integration test versus a component test within a microservice approach.
Quick Caveat: if you do lots of function to function invocations, then you are coupling those functions and then test boundaries will change. Functions invoking functions make a separate test boundary to worry about.
Which comes back to something else very interesting. If you build functions, and develop an event driven approach utilising third party services (e.g. SNS, DynamoDB Triggers, Kinesis, SQS in the AWS world) as the event connecting “glue” — then you may be able to essentially limit yourself to testing the functions separately and then the system.
Not exactly, but close.
I would suggest the system testing is harder. If you’re purely using an API Gateway with Lambdas behind it, then you can use third party tools to test the HTTP endpoints and build a test suite that way. It’s relatively understood.
But if you’re doing a lot of internal event triggering, such as DynamoDB triggers setting of a chain of events and multiple lambdas, then you have to do something different. This form of testing is harder, but since everything is a service — including the Lambda — it should be relatively simple to do.
The person that builds the tool for this kind of system testing with serverless will do very well. At present, the CI/CD tools we have and testing tools around it are not (quite) good enough.
When I started thinking about this article, I was expecting to figure out a lot of things around how to fit better testing regimes into our workflow.
As this article has come together, what’s happened is an identification of why serverless approaches are different to monolithic and microservice approaches. As a result, I’ve realised the inherent advantages for testing of smaller bits of uncoupled logic.
You can’t just drag your old “Testing for Monoliths Toolbox” into the serverless world and expect it to work any more.
Testing in serverless is different.
Testing in serverless may actually be easier.
In fact, testing in Serverless may actually be easier to maintain as well.
But we’re currently lacking the testing tools to really drive home the value. — looking forward to when they arrive.
I’m often a reluctant test writer. I like to hack and find things out as I go before building things to “work”. I’ve never been one of the kinds of people to force testing in any scenario so I may be missing the point in some of this. There are definitely people more qualified to talk about testing than me, but these are simply thoughts on testing.
The serverless approach to testing is different and may actually be easier was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
Cost is often the major driver for many cloud migrations but it’s usually poorly understood in the beginning. It’s fine for startups to demonstrate the value of cloud when they’re coming from nothing, but when you’re dragging 20 years of data centers and legacy software behind you, it’s not always clear what the price tag will be.
This FAQ is similar to asking an attorney “do I have a case?”, or asking a doctor “is it serious?”. The answer depends on knowing which of the cloud services you will use and how you’ll use them — and even then, comparing the costs between cloud providers is difficult.
Upon first glance it’s just so cheap, it feels like shopping a hundred years ago — your dollar is going really far. For cloud novices, some napkin math quickly reveals that their entire infrastructure can be run for just $20 a month with change left over for coffee. Wow, the CFO is going to love you.
For any serious cloud application, you have no idea what it’s going to cost until you start using it. No idea. First, different vendors have wildly different ways of measuring and charging that seem obvious at first — but you’ll quickly find the monthly bills are like deciphering Egyptian hieroglyphics.
I’ve always found that Google Cloud is particularly painful in this regard. I’ve been using it for ages and I still don’t understand their charging model. Take a look at my recent statement for a personal test environment that isn’t even doing very much:
… see — bird, ankh, scarab, Batman symbol. It makes the additional charges on my cell phone bill look like common sense. And because Google’s environment features a ton of ‘managed services’, I can’t even begin to tell you where some of this usage is coming from.
The first thing to know is that the cloud doesn’t cost pennies.
So clearly this isn’t ideal — what can we do? First, don’t panic and don’t pay attention to the “we bill by the second” promises you’ll hear. Also realize that AWS didn’t become a $15 billion annual business by charging you pennies.
When you’re first starting out on your cloud journey, I would recommend this approach to billing:
T0 answer the original question…. it depends. For machine learning applications (heavily GPU-biased) with petabytes of data, Google Cloud might be the way to go. For a more traditional Microsoft business application in the cloud, Azure could be the answer. In my line of work, I’ve tended to find AWS pricing the most consistently reasonable — but that’s just me.
Also, pricing in the cloud is dropping all the time — AWS has had 52 consecutive price cuts the last time I checked. Although, occasionally second-tier players spring head-scratchingly-odd increases on their users. The net effect is that your provider of choice now may not be the most competitive long-term. So you’ll need to constantly monitor pricing options to get the best deal, and decide if switching over is worth the effort.
Part of the cloud migration rite of passage for many companies is suddenly realizing that something is wrong. Very, very wrong. Your friendly cloud salesmen promised you low cost, you promised your boss cost savings, he promised you a promotion and you promised your kids a trip to Disney World. Suddenly the invoices start arriving, promises are evaporating and getting a photo with Mickey Mouse is looking further away than ever. Sadness ensues.
Unfortunately, just because you understand on-premise doesn’t translate to an automatic grasp of the labyrinthine world of cloud billing. Here are some of the most common gotchas that ensnare cloud newbies:
Few companies use all three instance categories properly so here’s the world’s fastest primer on the differences:
Other providers have similar approaches. Basically, as you slide from immediacy and convenience towards guaranteed usage, it gets cheaper.
If you’ve ever worked in enterprise IT, you’ll be familiar with the CapEx vs OpEx battle that accountants get so excited about. The short version is that capital expenditure — which is buying hardware in our space — is good since it creates a tax write-off for a depreciating asset, whereas operating expenses can only be written off in the tax year they were incurred. Although I’m no expert in this area (seriously), I’ve noticed a tendency to write off servers over, say, 3 years and then not replace them for, say, ever. Accountants love this stuff.
Back in the non-accounting reality, if you’re managing on-premise IT infrastructure, your cost accounting is really tricky to the point of being imaginary. For instance, let’s say you are responsible for an inventory management system and a logistics platform. What is the percentage of hardware cost you assign to each system? And if you have personnel supporting both, how do you work out their cost? What about the data center real estate, property taxes, air conditioning and security?
As you drill down the physical stack, it’s gets progressively harder to figure out the costs of your operation, especially when a third system is added. And a fourth. And there’s so much overlap at different levels. Ultimately you create a model that satisfies accounting but isn’t particularly accurate or helpful.
In the cloud world, this is very different since it’s a metered service where you pay for what you use. In the same way you can calculate the amount of electricity used by a given store, piece of equipment or assembly line, you can attribute cloud costs by product, vertical, service or any other metric.
The method gives you a precise accounting for the cost of a development environment, cluster, region or tier. And since you can tag resources, you can apply internal categories — departments, silos or project codes — that make it very easy to compare apples to apples.
Here a five quick takeaways for getting a handle on the actual costs of cloud computing:
If you enjoyed this, click the heart icon below!
The cloud is on and the meter’s running — avoid the sticker shock of ‘pay as you go’ was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
When using DynamoDB, the API can take a little getting used to. One of the more annoying parts of the API is how attribute data types are specified. Each Item Attribute is itself a map with only one key, where the key signifies the datatype, like the following:
This means you can’t just access an attribute by saying item.attribute, you must say item.attribute.TYPE. Well, what if you don’t know the type? You could make the argument ‘This attribute is always a string, so let’s just assume it’s a string’, but you never know when someone in the future may change it’s type, either purposefully or in error.
Compare the two gists below:
The above is way easier to manage than the ‘raw API’ as you see below.
Now this is all well and good, there is never a time when you wouldn’t use the DocumentClient …unless you’re using DynamoDB Streams
DynamoDB Streams is a feature where you can stream changes off your DynamoDB table. This allows you to use the table itself as a source for events in an asynchronous manner, with other benefits that you get from having a partition-ordered stream of changes from your DynamoDB table.
The problem is, when you use AWS Lambda to poll your streams, you lose the benefits of the DocumentClient! You are no longer calling DynamoDB at all from your code. Your Lambda is invoked with the body from the stream. You never are ‘given the chance’ to insert the DocumentClient in the right place to do this JS object translation for you.
The good news is … we can tap into this translator directly!
I did some digging and learned how to pull the nice object translation layer out of the DocumentClient and invoke it directly. See the gist below:
That’s all there is to it! Your DynamoDB Streaming code running in Lambda now has the niceties of JS objects as provided by the DocumentClient.
That’s it for this post. I hope this helps you cut down on a lot of the boilerplate in your DynamoDB Streaming listeners.
Using the DynamoDB Document Client with DynamoDB Streams from AWS Lambda was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
The Fourthcast development team has been using AWS Lambda to host Alexa skills since the early days of the Alexa Skills Kit. Lambda, at times, can be like your neighbor’s pit bull. Sure it looks all cute an fluffy, but you know at anytime something viscous could happen. We’ve experienced many of those “Lambda bites”. Here’s what you should do to avoid them yourself.
You may already know that Lambda functions, when not used for a while, get recycled. The next invocation will require redeploying the function, which takes some extra time and thus additional latency to your users. We call this a “cold” invocation, as opposed to a “warm” invocation.
What you may not have known is that cold invocations are much worse if your Lambda function uses the network. The table below shows some timings of invoking a very simple lambda skill. Invoking a cold non-networked function takes 7 times as long as a warm one. But a cold function that uses network takes 15 times longer than a network-using warm function.
Even worse, if your function is inside a VPC, it can take more than 10 seconds to attach the Elastic Network Interface. We’ve had production skills timeout without ever really don’t a thing besides trying to talk to the network on a cold start.
To address this, Fourthcast uses a warming trigger on all of our Lambda functions. We attach the CloudWatch Events — Schedule trigger with a 5 minute period. Since Lambda functions go cold at around 7 minutes of non-use, this keeps the function warm pretty much continually, and significantly improves startup latency.
However, be warned that your function won’t stay active forever. Redeploying code or changing configuration will always cause a recycle. Also Alexa skills with heavy and concurrent use will require multiple deployments to run simultaneously. That second (or third) deployment will start cold. The warming function only keeps one deployment warm. Finally Lambda functions are automatically recycled periodically. We see a forced recycle about 7 times a day.
If you use the warming trigger, be sure to ignore events without the Alexa request key, and don’t rely on your invocation count to be meaningful for analytics anymore. Also don’t worry about the additional costs. Even at the biggest instance size, you’ll only be using up about 9,000 of your 266,667 free invocations allowed per month. If you use that much, you probably don’t need to warm your skill anyway.
If you’re not a Node.js team, great … move on your merry way! But if you’ve got skills using Node.js 4.3, it’s time to upgrade.
Node.js 4.3 had several annoying bugs, but the worst among them is an OpenSSL bug that you wont’ discover until you’re running under load in production. This little doozy will put your entire function into a bad state. SSL connections will fail without cause intermittently, but only if you’re using DynamoDB. There’s a work around, but mostly, just upgrade to Node.js 6.10.
In programming models that support async operations (here’s looking at you Node.js), it’s possible, and sometimes easy, to finish your function and hand a response back to Alexa before everything has finished processing.
Aysnc operations that get caught up in the Lambda freeze/thaw are absolute death. They’ll pop back to life in some later invocation, but will likely be timed out. The tell-tale sign is bizarre timeouts with request id’s in Cloudwatch that correspond to requests issued hours earlier. These kinds of errors can often push libraries or OpenSSL into odd failure states that can only be resolved by forcing a redeploy, such as resizing the function.
You don’t want these kinds of Heisenbugs. Look carefully for anything that is executing in async, but not on the logical path to completing the Lambda function. The normal culprits for us have been caching puts and analytics. Since they’re not critical to the skill logic, you do not know that they didn’t finish. One such error caused us to believe we were logging analytics for 6 months before we realized maybe only one in three datums were actually stored. Also avoid any async work done at startup, outside of the handler.
While you shouldn’t rely on this, make sure that you’re using the callback rather than any of the functions on the context object to complete the invocation. This will do a best-effort to complete in-flight async operation before freezing your function.
Most libraries for async operations come along with long default timeout periods. In Node.js it is normally 2 minutes. With Alexa, if you’re not answering the user within 7.5 seconds, Alexa will respond with a failure message for you. A well-behaved skill should be much faster than that.
It’s much better to fail an async operation early and be able to tell the user that something is wrong in your own words rather than to get the dreaded, “There was a problem with the requested skill’s response” message. Also, debugging long-running calls that have been frozen, as mentioned above, are a huge pain.
In the case of interacting with an AWS service via the SDK, make sure you also set a low value for maxRetries, since these are cumulative with timeouts.
Because of container reuse, it’s possible to stuff data into global memory and ,with a good probability, it will be there in the next invocation. However, despite the fact that we’ve seen major tool libraries for Alexa leverage this, it should be strongly avoided. A deployment can be recycled at any time for many different reasons. Also there is no guarantee that requests within the same session will be routed to the same deployment.
I have also seen some debate about if it’s OK to store data in global state that will be used again strictly within the same request. Normally in Node.js (e.g. in an express based website) this is a huge red-flag, since that state could be clobbered by other interleaving requests. However, while I can’t find any documentation that guarantees this, in practice Lambda will not issue a request to an instance if another is currently in-flight. Because of this, using global state in a single request is possible, but I wouldn’t rely on it. It’ll mean weird bugs if you migrate off of Lambda.
In short, avoid using global memory. At Fourthcast we use global state only as a first-level cache.
Most our skills at Fourthcast catch any errors and return a user-friendly error message. If you do this, make sure that you log custom metrics for tracking these “soft errors” since Lambda’s invocation error metrics won’t be relevant anymore. At Fourthcast we use a custom Cloudwatch metric for soft errors, which allows as to attach an alarm and be alerted of high error rates.
Run you skill at one of the higher memory levels. The low memory instances are also allocated a smaller slice of processor and have very slow file and network IO. Many mysterious errors clear up when we’ve suddenly got more memory. With Lambda’s very generous free tier, you’re not likely to incur costs anyway, so go ahead, set it to 1536 MB.
Lambda is the perfect tool for hosting Alexa skills, but you’ve got to watch out for these pitfalls and “Lambda bites”. At Fourthcast, we’ve hosted all of our skills using Lambda, and don’t miss fiddling with servers at all.
Fourthcast is a service that takes your podcast, and turns it into an Alexa skill. 33 million people will own a voice-first device in 2017. What will they listen to?
Put your podcast on Alexa in just a few clicks! Get Started!
We started the OpsGenie start-up journey in 2011 with three senior full stack developers who were very experienced in building Java enterprise, on-premise products that specialized in integration, data enrichment, and management of 1st generation infrastructure monitoring tools. We saw an opportunity in the market and decided to use our expertise to build an internet service for Alert/Incident Management.
But stepping into the SaaS world would bring many unknowns. Concepts like operational complexity, high availability, scalability, security, multi-tenancy, and much more would be our challenges. The first thing we decided was that sticking with AWS technologies would help us overcome many of those challenges. Even if there were better alternatives out there, we started to use fully/half managed Amazon services for our computing, database, messaging, and other infrastructural needs that I cannot remember right now.
As many start-ups do, we started coding with a single git repository. But somehow we didn’t have a monolithic architecture. It was still a monolith, of course, in the sense that it was built from the same code repository. :) We separated customer-facing applications from the ones that did heavy calculations in the background. OpsGenie architecture was composed of the following components, in the early days:
Rest API: A lightweight HTTP server and in-house built framework written on top of Netty, which provided the OpsGenie Rest API.
Engine: A standalone JSE application which calculated who should receive a notification — and when.
Sender: A standalone JSE application that talked to third party providers in order to send email, mobile push, SMS and phone notifications.
We were operating in two zones of Amazon’s Oregon region, and we designed the architecture so that all types of applications had one member alive in every zone, during deployments. We put front-end servers behind Amazon Elastic Load Balancers, and all inter-process communications were made via asynchronous message passing with SQS. That provided us with great elasticity, in terms of redundancy, scalability, and availability.
Then the same old story happened. We encountered the same obstacles and opportunities on the paths that every successful startup journeys … the product aroused keen interest in the audience, which then caused us to develop many more features, handle support requests, recruit new engineers, and so on! As a result, the complexity of our infrastructure and code base increased.
Our architecture began to look like the following:
Before I mention the problems that emerged with this architecture, I’d better talk a little bit about the engineering culture we were developing:
We had embraced the evolutionary and adaptive nature of Agile software development methodologies even before we started OpsGenie. We were already performing Test Driven Development. We started to use a modified version of Scrum when our developer size exceeded eight or ten. We accepted the importance of lean startup methodologies and fast feedback loops. We committed to the work needed to continually evolve our organization, culture, and technology in order to serve better products to our customers.
Even though the term is relatively new, we embraced the technical aspects of DevOps from its earliest beginnings. We have been performing Continuous Integration and Continuous Delivery. We have continuously monitored our infrastructure, software, logs, web, and mobile applications. Also, as soon as a new developer joined the company, got his or her hands a little bit dirty with the code, and understood the flows, then he or she began to participate in on-call rotations to solve problems before our customers notice them. And we continue to honor our commitments to an engineering culture based on such ideas and practices.
After this brief insight, I hope that the problems that we have faced seem more understandable. Here they are:
As I mentioned before, we were not the first internet service company facing these kinds of challenges. Many more out there survived and succeeded on a massive scale. All we had to do was learn from their experiences and figure out the way that was most appropriate for us.
Much has been said and discussed about microservices. There are floods of articles, blogs, books, tutorials, and talks about them on the internet. So I have no need or desire to explain the term, “microservices.”
Although pioneering companies like Amazon and Netflix switched to this architectural style in the previous decade, I think use of the term “microservices” exploded when Martin Fowler first wrote his blog post about the concept in 2014. Amazon CTO Verner Vogels mentioned their patterns as SOA in his interview published in 2006.
Instead of giving a complete definition, Martin Fowler addressed nine common characteristics of a microservices architecture:
Componentization via services
Organized around business capabilities
Products not projects
Smart endpoints and dumb pipes
Decentralized data management
Design for failure
When we looked at our architecture, we realized that we were not too far away from those ideals to move to a microservices-oriented architecture. Our most critical need was to organize as cross-functional teams in order to implement different business capabilities in different code bases. We already had at least some organizational expertise with the other characteristics Fowler described.
So, why are we moving to a serverless architecture instead of simply implementing microservices? There are a couple of advantages in using AWS Lambda instead of building Dockerized applications on top of AWS ECS — or deploying them to a PaaS platform like Heroku:
At the beginning of 2017, we recruited three senior engineers who had no previous knowledge of OpsGenie’s infrastructure and code base — and very little experience with cloud technologies. They started to code an entirely new product feature in a separate code base to be deployed to AWS Lambda service. In four months, they did an excellent job.
They prepared our development, testing, and deployment base — as well as implementing a brand new product extension. What they accomplished was a full blown application — not simple CRUD functions, database triggers, or any other officially-referenced architectural pattern. As I write these lines, they are sending it to production, to open for some Beta customers.
When we feel safe, and our delivery pipeline stabilizes, we plan to split our applications — domain by domain — and move them to serverless. And we will keep sharing our experiences in our engineering blog.
OpsGenie is on a journey to reap the benefits of serverless architecture was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
A major manufacturer recently announced a smart lamp that will setup you back about $200. While GE’s lamp is cool, neither my budget (nor my wife) supports a purchase of that size — so I set out to design and build a “smarter lamp” for the more budget conscious consumers.
GE’s Alexa lamp will be available in September for $200 https://t.co/53gFWdMsny
In order to compete with GE, the design of my smart lamp needed to include voice activation by Alexa. Since my wife wanted the ability to control the lamp using a standard switch, the design also needed to be practical and functional.
To meet my needs, the smart lamp had account for state — regardless of whether a voice or a physical switch was used for power.
The Smart Lamp Prototype
To build the solution, $35 was spent to purchase the required components.
For a full description of the process for building the smart lamp and the code to control the device, check out my project on Hackster.io.
For the initial prototype of a physical switch control, I used a push button. The controlling code, written in Python, simply waits to see if the button fires. When fired, the relay pin is set to High or Low depending on its current state.
The Alexa solution required two components; an Alexa skill, and a Python script that would take a command from the Alexa Skill and change the state of the relay pin.
Amazon has a specific API created for Smart Home skills which do not require the user to say the skill name to invoke. There is a great five- part tutorial on their blog with very clear instructions — I created my first draft of the skill within 30 minutes of starting.
Once my base skill was completed, I updated the code for use with my lamp prototype. The Smart Home Skill must provide Alexa with a list of smart devices — typically this device code resides with a device’s manufacturer. Since I was building my own, I hard-coded the device information:
AWS IoT Setup
The following AWS IoT developers guide explains the process for creating a AWS IoT “thing” to receive and route the messages. The key part of the process is creating the certificates which you’ll need to download and install on your LinkIt development board.
The Smart Home Skill also provides the ability to control the device, so I used AWS IoT to send MQTT messages to the device. I then created a script on the LinkIt development board that waited for an MQTT message, and set the relay pin start to High or Low based on the message.
Enabling the Alexa Skill
Before I could use skill, I had to enable it using my Alexa app and perform device discovery. Once that was complete, I was able to control my prototype via button and voice.
Once I confirmed the code was in working condition, I moved ahead to build a more robust solution.
Wire the Switch and Run the Wires
I decided to go with a pole style floor lamp which allowed me to snake the switch wires from the top of the lamp to the base. I had to drill two small holes in the back of the lamp to install the button, and then use heat wrap to cover the wires.
Solder the IoT Device to a Protoboard
Full disclosure — I haven’t soldered anything since my senior year in college and I barely passed by circuits 101 class. While my final solution was never going to be pretty…. it works. I’d recommend reviewing the intro to soldering course by Alexa Glow on Hasckster.io.
Connect the Relay and Assemble the Device
I cut the power cord to the lamp, soldered the ends, connected them to the relay, and then connected the relay to the LinkIt Duo. Once connected, I placed the LinkIt and relay inside the plastic project box.
The final steps of the process involved the creation of two services on the LinkIt board designed to automatically start the button and MQTT Python programs on reboot.
While the finished product might look like like a “dumb” lamp, I’d like to consider it a “smarter lamp” — built for total price of $35.
Darian Johnson is a technology consultant with deep experience implementing complex software architectures and leading large-scale software delivery programs. He currently works for Accenture as a member of their Amazon Web Services practice.
Darian has always enjoyed researching new technologies, so he eagerly used the tools and templates provided by the Alexa team to learn about skill development. His first skill combined his passion for fitness with his interest in machine learning. His Mystic Mirror skill won second place in Alexa’s Internet of Voice challenge on Hackster.io.
How to build your own Alexa voice activated IoT smart lamp for less than $35 was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
Over the past year, a lot of progress has been made since I set out on a mission to provide Spanish speakers — over 560 million worldwide — access to AWS technical content in our native language.
Our efforts to engage and grow the Hispanic AWS user community has resulted the availability of a Spanish AWS Certification course, an AWS en Español LinkedIn group that’s grown to 700 members, and a live-broadcast of our monthly meeting via YouTube!
To engage the Hispanic community at the local level, my recent focus has shifted to creating and organizing AWS User Groups in Latin America and Spain. The user groups provide professionals an opportunity to interact with local AWS practitioners, ask questions, and share ideas.
During April, I’m excited to announce that 4 new AWS User Groups were established in Latin America (LATAM) and Spain with sponsorship from A Cloud Guru. Please welcome these Hispanic user groups to the growing AWS community!
Join the AWS Community Day festivities in San Francisco on 15th of June to celebrate! AWS Community Day is a free, community-organized event featuring technical discussions led by expert AWS users and community leaders from the Bay Area and throughout the West Coast.
“AWS Community Day will offer us a unique opportunity that we cannot get at AWS re:Invent or AWS Summit.” — John Varghese, Leader, Bay Area AWS User Group
If you are a passionate AWS user and are interested in joining or starting your own AWS User Group, check out the list of existing groups or learn how to get one started. Feel free to reach out to me via LinkedIn or Twitter if you are interested in creating an AWS User Group in LATAM or Spain!
Hispanic AWS User Groups are the newest addition to the growing global community was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
It’s easy to get the impression that the cloud is so alien to on-premise IT that you have to wait until a new project comes along to try it out. Fortunately for patience-impaired like me, we can easily migrate existing workloads across and see immediate benefits.
Getting one migration project off the ground is the key to convincing the Powers That Be that 100% cloud is where the company wants to be in the future. This step is all about building credibility to achieve our ultimate goal of becoming cloud natives.
While cost savings are usually the hook, that’s not what cloud is really about. At its core, cloud provides three benefits that are either needlessly hard or impossible to do yourself:
Basically, is it working? Is it there? Do I have enough resources? Fault tolerance and high availability are snooze-inducing buzzwords for the average human, but flip the words around and they just mean your application can tolerate faults and will be available more than you would expect. In practice, pulling off this magic trick is all about finding bottlenecks and points of failure. In essence, you are creating a plan B for everything, always assuming plan A is going up in smoke.
But you also need to determine which website is worth the effort. For a corporate webpage that manages employee’s tennis court reservations, who really cares if it only works 95% of the time running on someone desktop PC? Big deal if it breaks (apologies to tennis fans). But if your site is streaming video for the Game of Thrones finale, you damned well better achieve 100% availability (I’m looking at you, HBONow). This is clearly much better candidate for migration.
It’s not new, it’s not sexy, but your company’s website is important and it’s one of the few ways your customers get a glimpse into your internal technology horror show. It’s a good place to start for a cloud migration since the transition is well understood and your glorious success will be highly visible.
There are many ways to build a website on-premise but here is one of the most common approach:
Bad, bad, bad. This is a sorry design based largely on the ‘hope and pray’ approach that has a unhappy track record for disappointment. If one piece fails, it all fails. Cue screaming customers, mad executives and pagers beeping at 3am.
Apart from the declining availability that happens when you multiply lots of 99% probabilities together, it also cannot be upgraded without downtime. This is just about the laziest setup for a website (though surprisingly common) and while might be fine for a hobby blog — would be a train wreck for anything remotely popular. Let’s make it work properly a la cloud.
Pray that nobody kicks the server and the hard disk lasts forever because it scores low on our Big Three.
In practice there are just as many ways to cloudify a project but here’s my first sketch at using AWS to lift this website into the 21st century. Marvel at my graphic for a few moments and I’ll explain on the other side…
This isn’t as complicated as it looks but it was fun to draw. Piece by piece, this is what we have:
I know what you’re thinking. “I just wanted a Honda Civic and you gave me a Tesla delivered by a SpaceX rocket.” I did, but fortunately it’s cheaper and more reliable than the Honda (if that’s even possible, hey Honda fans?).
This is the sort of environment you can build out in a few hours on AWS and might easily have an average running cost of a few hundred dollars a month (depending on your usage). The reason it’s so fast and cheap isn’t because I’m the best cloud guy in the world with extremely reasonable rates and a great can-do attitude, it’s because cloud is code. Let’s repeat that together (the cloud part):
Cloud is code. Infrastructure is code. Build it up. Throw it away.
Nobody is ordering servers, racking hardware or approving purchase orders. We simply build out a CloudFormation template (like a blue print for your house), click “Create” and automation happens. An army of bots builds exactly what we want and we’re done. The hardest part will be migrating files and content, and even that can be fairly simple with a few scripts.
Ok, version one solved many of our problems presented in the Dire Stack of On-Premise Failure. It gave us much more availability, durability was effectively solved, and while scalability was impressive, it could still be improved.
Imagine you have a webpage that’s going to get massive amounts of traffic unpredictably across multiple geographic regions. Suddenly you get one million visitors from Australia when a TV ad runs during a national event, and then nothing for 24 hours. And now the traffic hits the West Coast, 10 million visitors during TV ads on cable in the evening, and then it goes quiet. How do you scale up fast enough or make sure the right regions are in place?
In order to accommodate this extreme traffic, I present for your consideration “version 2”:
In the classic website model, you need a web server, database and code to connect it together. In this new version:
You might remember from last week’s blog post on Mobile Apps that this is a serverless implementation that effectively handles the scaling for you. It’s exceptionally resilient to denial of service attacks and offers blazing performance for a distributed visitor base. It’s also much cheaper to implement than version 1. While it’s not going to work for CMS-based sites that rely on the more traditional stack, it’s an A+ alternative for high traffic landing pages with a spiky demand curve, such as those targeted by TV ads.
Sometimes the Big Bang migration can be alarming so it’s also worth mentioning a couple of small move alternatives that would greatly improve your overall website infrastructure with just a little cloud.
There are many ways to bring cloud into existing applications and workloads in your organization. The ephemeral nature of virtual hardware can be difficult to grasp and somewhat unsettling.
Pre-cloud, we built everything like it was poured concrete, a major production that was hard to change and move. Cloud providers gave us small ready-made pieces that are like Lego bricks. We can add, change, build up, tear down and have this enormous flexibility that takes a while to fully appreciate.
In most companies, getting ‘on premise’ servers versus cloud servers, it’s like comparing communist-era food rationing to Costco.
This simple example is just a website. Imagine what you could do for a distributed point of sale system in retail. You could create cloud-based services that securely support your cash registers, website and mobile e-commerce app. Write once, use everywhere. Mind blown.
Did you enjoy this? Click the little heart below!
Baby steps to the cloud: migrating your corporate website was originally published in A Cloud Guru on Medium, where people are continuing the conversation by highlighting and responding to this story.
Last week at Serverlessconf in Austin, I made a bold and inflammatory claim: Node is the wrong runtime for serverless FaaS.
Shots fired #ServerlessConf
As expected, some people … disagreed. Contextually, it was meant to grab attention, and then used to illustrate what I’ll explain below.
But is that useful in our brave new FaaS world?
Functions in serverless architecture should, in general, be single-threaded and single task. If you’re doing a lot of different things in a Lambda function, you’re probably doing it wrong. Yes, everybody’s got a fanout function somewhere that benefits from concurrency or other similar requirement. But as a general rule, if your function is taking on multiple unrelated tasks — you should probably redesign to split it apart.
Don’t build little webservers in your Lambda functions.
If our functions aren’t using asynchronous techniques, what happens to all the concurrency we need? Concurrency between calls/flows is provided by the FaaS platform’s scaling, but what about concurrency within a flow?
The answer is that it needs to move into our infrastructure. Serverless systems are more or less inherently event-driven, and event processing needs to be asynchronous. However — and this was the point in my talk — the existing FaaS providers do not yet have the functionality in place to build graphs of async FaaS. AWS Step Functions and Azure Logic Apps aren’t quite sufficient yet — and what that should look like is the subject of another post.
Note: IOPipe is leveraging Node to keep HTTP connections open between Lambda invocations. That’s actually a pretty huge deal, but I haven’t heard if it’s not possible with other runtimes.