Future

Serverless Chats

Episode #61: The Well-Architected Serverless Lens with Heitor Lessa

About Heitor Lessa

Heitor Lessa is a Principal Specialist Solutions Architect at Amazon Web Services. He has spent the last 10 years in a number of roles, focusing on networking, infrastructure, and development. Since joining AWS in 2013, he’s been helping organizations of all sizes and segments across EMEA to design cloud native applications as well as software development best practices.

Watch this episode on YouTube: https://youtu.be/bFjT3TrpbZg

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm speaking with Heitor Lessa. Hey Heitor, thanks for joining me.

Heitor: Thanks for inviting me. It is a pleasure to be here.

Jeremy: I'm super excited to have you here. You are a principal specialist solutions architect at Amazon Web Services. Why don't you tell the listeners what you do at Amazon Web Services and sort of what a principal specialist solutions architect does.

Heitor: I know it's a long title. I guess we can just say I'm a solutions architect at AWS. My day to day is basically working with customers and enable developer teams to find the best solutions on how to either build something on AWS or migrate, let's say a microservices monolith to a microservices or optimize something that they have.

But more recently, I'm also working with customers to help them build developer communities' inside. Similar to what we have at Amazon which bring in pros and stuff like that.

Jeremy: Very cool. Now I know you're doing a million different things. I don't know how you're not running AWS yet. I think you're next in line, I think. But you're doing a million things there. And one of the things though that you've been working on in the past and I know you're still involved with it, is the Well-Architected Serverless Lens. And I want to get into this because this is one of those things where if you're trying to find best practices and you're reading all these blog posts and you're looking at anti-patterns and good patterns and all this kind of stuff, I think it gets really, really confusing.

And your team and a bunch of other people at AWS and a bunch of the community heroes and all kinds of people got together and put together this Serverless Lens. If people are familiar with the Well-Architected Framework, which is talked about quite a bit, there's also this thing called the Well-Architected Serverless Lens. What's the difference between those two things?

Heitor: Sure. So the Well-Architected started way back in 2016, even before that to be fairly honest, where customers were looking to use AWS but we have roughly 50 services back then. Compared to today we have a lot more. And basically those customers were asking, "How do I use X service versus the other service? How do I go to production with this critical application? How do I model from my on-premises applications to something more cloud native? How do I migrate?" And things like this. Or even specific questions like, "How do I set up a multi-account? How do I better protect my accounts from a security perspective or billing?"

Well-Architected brings all those best practices that are agnostic from a workload perspective that typically applies to many of them, whether using serverless or containers, so usually what would work really well. But the challenge of Well-Architected as the platform evolved, we started to have more high level services like serverless or some service like AI/ML, which you have to treat them slightly different. The best practices still apply on how you set up AWS accounts, how do you do backups, how do you think about relational databases versus NoSQL databases?

But when you get to things like Multi-AZ and EBS volumes for serverless, they don't quite make sense. The Lens was a project to say, what are the customers using that Well-Architected actually helped them but they still lack a lot of good practices that are very specific to the technology they chose? So serverless was one of them. IoT was also another one. And more recently, last month we also announced Analytics Lens. If you're interested in big data, AI, those pieces, Analytics covered that pretty well. That's the difference.

The Lens is a... It doesn't replace, Well-Architected, it's more as an add on to the all these best practices we've been sharing for the past few years.

Jeremy: Yeah, because as an add on, it makes sense. The original serverless or sorry, the original Well-Architected Framework has, I think, 47 questions or so that asked you about specific areas and there's the five pillars and we'll get into some of that because I do think it's interesting to think of it that way. But the Serverless Lens just has more questions. What's the reason for all those extra questions?

Heitor: Sure. Well-Architected, when we started the Lens, if I'm not mistaken, again, there was the 47 questions but now we had just a recent update where some of those questions might change now. But the Serverless Lens, I think, if I'm not mistaken, we started with 31 questions because we were trying to get every single detail of servers and every best practice. But that was primarily a academic paper. So Lens started as a let's set up a document where you can go and find out when do I use serverless, is serverless as a good thing for me? How do I choose between all these services? How do I know the operational best practices for serverless?

As we started digging into those best practices, we felt we needed a lot more questions to dive into, okay, what type of metrics do you need? What type of alarm do you need? When do you use containers versus Lambda functions? When do you use orchestration versus synchronous calls? We only started in 2017. We had all these questions that customers were asking us. We put together into a document and we started writing. That took us roughly six to 10 months to put together into 50 pages when we announced Serverless Lens.

Jeremy: Right. And then that was a white paper, like you said. That was just a document. But now that's been moved into the Well-Architected tool, which is pretty cool. If anyone's used that or hasn't, I suggest you go and try it out. But that just takes you through and asks you all those questions and you can kind of keep track of your progress. How did you go from the white paper to the tool?

Heitor: Yeah. In 2017, when we announced, Werner went on the stage and talked about this idea of getting those best practices for serverless. And in 2018, we got an immense amount of downloads for Lens. If I'm not mistaken, it was over 20,000 downloads in less than six months, specifically for serverless best practices.

But then those questions started to ask more. How about Alexa? How about X, Y, Z? So what we found was trying to keep writing those pieces into the document was pretty difficult to keep up with the serverless space as well and how much it changes. What we found was instead of keep adding more questions and more pages of documents, we came together and thought, what if we evolved the lens project into the console? The customer would go to the console and say, "I want to review my architecture and I'm also doing serverless. I'm also doing analytics. I'm doing IoT or I'm doing something specific for FSI," Financial Services for instance.

So we thought we would experiment first with serverless and that's exactly what we did. But the challenge of migrating to the console as you probably have seen the console, you go and review your architecture, you have a very specific question and a few best practices you typically are doing or you're not doing yet.

That didn't map well with a white paper academic because you had to read what was the question, what was the best practice, how will you implement it? We went from those 30 plus questions down to nine questions with much more specific best practices. That we announced in February just a few months ago.

Jeremy: Yeah. Yeah. It's a great tool. One of the things, you're talking about best practices. Yesterday's serverless best practice could be today's serverless anti-pattern, because it does. It changes very rapidly and things are always sort of changing. How do you actually figure out what the best practice is?

Because there's a lot of posts out there about best practices and anti-patterns and so forth. Especially whether Lambda should be calling Lambdas and stuff like that. How do you actually go about deciding whether or not something is a best practice or not? And how does it make it into the lens?

Heitor: Sure. I think that's a great question. That was probably the fear number one when trying to think about serverless best practices in 2017. Because you remember API gateway was back there, it was very early days. SQS wasn't even EventSource back in the days. There were so many questions we were kind of unsure whether that would work or not. One of the two things that happens in this Well-Architected is that we have the concept of the pillars, as you mentioned, the five pillars; operations, security, performance, reliability and cost. That helps a lot in breaking that down of those best practices. And think does that fit into here or is that more of an opinion as of now.

And then the second is that we have a rule of 80% to 20%. If it's something that's been working for 80% of our customers and our technical field solutions architects, technical account managers, evangelist, developer advocates. All these people in those communities internally know that these are things that what's working for customers in production, then this fits into the 80%.

Things that are the 20% are what we call the edge or leading practices, are something that we know it might work for certain customers who have a certain expertise already with AWS, but eventually might become a common practice, like the Lambda-Lambda communication type of thing, that back in the day, we weren't even discussing that time or something like EventBridge that until this year, it wasn't something widely discussed. So something we definitely would get there.

Jeremy: Awesome. Yeah. Well, I was actually going to say something like EventBridge, which is probably my favorite service now that exists, maybe after Lambda. That's not in the lens right now. And you've had a bunch of new services that have come out like Lambda layers, Provisioned Concurrency, RDS Proxy. These are not in the Lens yet. Is it just because they're so new or there's certain things about those services? Have they not matured enough yet where they're considered to be best practices?

Heitor: It's a bit of two. I think it's always like... One of the pieces that we love at AWS is we like to launch those features or those services early so we can iterate on those features and those services with customers as we hear from them. It's quite similar to the process of deriving best practices for Well-Architected. When we announced something... Like EFS was actually just announced, immediately you would think, well EFS for Lambda makes a lot of sense for AI/ML use cases, for some shared state if you will, but it's something that we have to observe how that's working out for customers.

When you look at the Lens, the white paper per se, we have the scenario spaces that we not only share with you, these are the common architecture diagram for that specific use case, but we share what we call configuration notes, which is what are the common gotchas and caveats that might be different depending on use case and depending how you use.

One type of use case might be true to the vast majority, but if you have high throughput or high concurrency, the whole best practice landscape changes completely to you. That's one of them. For those new features, we are definitely listening, having our ears to the ground and hearing from customers how you're using, how is that working out for you?

Layers is one example. It's something that we've seen customers using for custom builds like FFmpeg or something very specific like a chromeless browserless if you will. It's working out really, really great. Everyone loves it and it works. But when you're trying to use layers in a very large enterprise, you have a couple of caveats. For instance, when you're trying to share dependencies that change very frequently, you have all those Lambda functions now being redeployed and that causes cold starts.

There are some caveats that we need to make sure we know exactly how to deal with, so when we write, we tell you, "This is why this is a good practice. These are the caveats and if you are in the caveat space, this is how you handle it."

Jeremy: Yeah. Yeah. And I know with layers too. I mean, I'm not a huge fan of the way the versioning works, where it's just an incremental version; version one, version two, version three. And I know you can use like SAR, for example, the Serverless Application Repository and you can wrap a layer up in a SAR app and give it semantic versioning and things like that.

There's just a lot of steps to go through where you're working out what is the best way to do it would be really interesting. I do want to go back to EventBridge for a second though, because this is one of those services where when it was announced and it was announced in the middle of the summer last year, I think it was. So it's about a year old now. And actually, it's just over a year old or it was just announced about a year ago, which is kind of crazy to think about.

There's not been a lot of literature on EventBridge. And I know that you see some people using it. You've got some dev advocates that are really pushing it. I love it. I think it's great. I think the things that have been added to it. But are there reasons why that still hasn't made it into the Lens? I know that you don't have things like DLQ's. Is that still one of the reasons why I think that's not there yet?

Heitor: Yeah, there are a couple of reasons. One, it's primarily time for me as well to distill all the feedback we get from customers and figure out what is a good practice versus what people learned by trial and error at some point, which is not something we want to tell everyone, "Just go and to use it." And there's also the fact of what you just mentioned about DLQs and some of those pieces. In the Well-Architected, when we're about to suggest a service or suggest a way of doing, we always have to keep in mind the five pillars.

For instance, I know EventBridge, we know EventBridge doesn't have DLQ, but nothing stops you from having a SQS queue as a target first, and then a Lambda function which will handle that. When you're looking from a large enterprise that's going to get easily 300, 400, hundreds of queues which then makes it more difficult for you to manage that piece at scale. Some of those pieces would come into play. But DLQs in some of those space, I think I would definitely add them, they're caveats. There are ways for you to work in production and works really well. But these are more caveats. So it's something we're looking at right now to introduce a scenario and not under question specific.

The pieces that we're not entirely sure yet, from a Well-Architected perspective on EventBridge is some of the event modeling, some of those best practices, inter-service versus intra-service, multi-account approach. It's very easy for the EventBridge, which by the way, it's one of my favorite services too, although I shouldn't be saying that. It's great that EventBridge can give you that visibility of how this async communications are going, their schemas that load code bindings. This is fantastic. But there's also some of the tracing capabilities that you want to know where these events went, who filtered, who routed you where.

We're trying to figure out those patterns and how can we tell customers to use EventBridge that will work for just 80% of customers that may not have this extensive background on event modeling, DDD and all those good tools and good practices. Customers were deep in doing event microservices or events at some extent, it's kind of a no-brainer. EventBridge, it just fits the bill and it works like that.

But on our side from the Well-Architected, we also have to be mindful of customers who may not have used the cloud or are trying to use for the first time and they are looking for those good practices to jump straight in. Going from monolith straight to something like events maybe a bit too much for them. We're treading carefully on that.

Jeremy: Yeah. Well, I don't work for AWS. I will tell people to go use EventBridge because I think it's amazing. One other thing though maybe about this and not EventBridge specifically, but you mentioned this idea of leading practices or edge practices. How much of a risk is it? Because every service that goes out there from AWS is... Everything has its problems here and there. There's little caveats here and there, but for the most part those services are solid and you can use those services or at least I've always felt I can use those services in production with quite a bit of confidence.

Is there some sort of rule that I should follow as a developer or as an organization where I say, "Okay, I'm following the Serverless Lens 95% but I do want to introduce EventBridge or I do want to introduce the RDS Proxy or something like that." Is that okay for me to do? I'm sure you'll say it is, but I mean, is that... Where do I draw that line? I don't want to be all leading edge but at the same time, I want to be able to take advantage of some of these new services.

Heitor: Yeah, I think as you mentioned, it's hard to say to everyone, "Don't use these services, don't use these features," because if the service is out there, there is a need for it. As we've seen that Lego going out and about explaining how to use EventBridge is a marvelous thing. It's something that I find amazing how customers are using EventBridge and many other services as well. And so it is with layers as well as an example.

The other pieces that I think I would divide that into two buckets that are ones that we're not entirely sure it's going to work in production. For instance, before RDS Proxy, many, many customers that I worked with for the past four or five years on serverless, we are all basically implementing SQL proxy clusters on multiple availability zones to deal with the issue of the connection pooling or using other practices as well. But we also know that we didn't have a reference architecture that would go and show about them how to do that piece.

From that reason, we refrained from referring to RDS Proxy because it was in preview. So if it's in preview, I wouldn't recommend production. Easy one, clear cut. The other pieces like we just announced... We recently announced Provisioned Concurrency. It's amazing for specifically Java applications on serverless or others that require some predictable latency.

Those are the pieces that because it's Lambda, but it's an additional feature, it's about trying and figuring out if your KPIs, if your requirements still work as you use that feature. It's not so much about AWS telling do not use that or perhaps use this instead, that this 20%, 80% is more of our role to make sense of this plethora of announcements that we also have that we need to make sense it works for the vast majority of customers.

If we don't hear from customers using it, it's difficult for us to prove it's working and we recommend that. One thing that in the Layers because I think it's a good point to bring that up now is what we call general design principles, which in fact, was the hardest thing to write. It took me, I don't know, months, with other people who are still figuring out how to do it.

The general design principles is what I use when I'm trying to use another service or trying to recommend something to someone that might not be into this best practices arena but I know it might work for them. The general design principles are seven principles that usually helps you understand whether serverless is going to be good for you on the use case you have or maybe we need to update those principles, please let us know.

But also if a service is going to help you out aimed at towards that direction, not only from the five pillars, but also from the principal perspective. I remember reading lots of articles from you about using step functions everywhere. And also now you discovered Lambda Destinations, which is a new feature. And there's this discussion about when do I use one versus the other?

In the design principles, we have one specific that we say orchestrate your application with state machines, not functions. Which back then, it made a lot of sense. But now with Lambda Destinations, this might be not entirely true anymore. That's kind of a, "Wow!" As long as I can orchestrate that with that feature, I'm still orchestrating that. It's not inside my code that's handling all of that pieces. Design principles help us to, I guess, navigate toward this fine line, as we call edge or leading practices versus best practices.

Jeremy: Yeah. Actually, from a choreography versus an orchestration standpoint, I actually am a huge fan of choreography. I think that's a better way to make microservices talk to one another, trying to orchestrate them. I love step functions and for workflows that absolutely need to follow every step and you might have to roll things back and so forth. I think step function is the way to go.

But once EventBridge was introduced, and again, having some more visibility into what things do, for me, I always set up one route that catches everything and just dumps it all into a Kinesis Data Firehose and then that goes into S3 and then you can query it with Athena. So you sort of have that... It's not a DLQ, but at least it's a record you can see every event that came into the system.

But once EventBridge was introduced, the idea of choreography just becomes so much simpler to kind of work around as opposed to having to use something like SNS or something else that has to be sort of set up and could be disconnected between services. I always used recommend the pattern of setting up just a microservice that was only an SNS topic that was your EventBus essentially. That was one of the... Because again, I'm a huge fan of that style.

All right. Let me ask you this question because this is something probably... You see a lot of people that just start building and I think if you just start building that's great. You want to build a monolithic Lambda function? If that's how you want to get started, great. Then you start realizing some of the benefits of breaking it up and some of that sort of stuff, whether it's scaling independently or it's the principle of least privilege and things like that, where you can really get very specific about individual routes and things like that.

But anyways, I do recommend people to start building. But at what point do you need to seriously look at the Serverless Lens and say, "Okay, we can't launch a production application until this is live." I guess the question maybe is what's in it for me as an organization, as a developer to follow this really strictly?

Heitor: Sure. That's a very great question and I'm happy you brought this up because there's a misconception in the fields that you use Well-Architected once you've already finished your application and now, you're thinking about go to production and what else do you need to fix or implement. This would be a costly way of doing business for one particular reason. When you're creating your sprints and how you're basically designing your backlog and what features you're going to prioritize and things like this, by the time you get to the Well-Architected and you get this, lots of questions and over 100 best practices, the reaction that most customers had based on my own experience for the past seven years working in AWS is, "Oh no, I won't have the time. I need to go live next week." And, "I will fix it later."

But actually that later may never come. And we know that. Other things get in the way and that happens and it's just a natural thing. I like to recommend people to use Well-Architected when you are thinking, when you are researching, when you are thinking about which service should I use, what are the common patterns should I use? Or what are the common things should I watch out for?

In the console, it might not be that obvious at first because we do that for a reason too. When you say review my architecture and you start answering those questions, even if you answer a single question or two questions, in the report or under the status of your application, it shows how many high risks you have and how many medium risks you have. I would only recommend you go to production if you have no high risks. Medium risk are something that for instance, if you have no Cloudwatch alarms in your serverless application or no tracing or no structure logging or centralized logging at all, then it might be difficult for you to go to production.

But if you don't have, let's say Canary deployments as an example, or if you don't have high itempotency in certain parts of your application, it's something you can definitely go in production and improve as you see along. Looking at high risks first, if you get that nailed, absolutely go forward. And medium with something, there's always room for improvement and we know that.

Jeremy: Yeah. I like that. I like the strategy of serverless anyways in applying that to the Well-Architected piece is it is very iterative. And actually, James Beswick just had a post that showed like, "Hey, I'm going to start by capturing this one thing, but then I'm going to send it to EventBridge or to SNS and then I'm going to process some secondary component and then I'm going to do something else." I love that idea of building incrementally, but I agree. If you're running npm and you see that you've got 900 high risk issues, you don't want to deploy that. So it's the same thing I would say with the Serverless Lens.

All right, so let's get into some of the details of the Serverless Lens itself. Because we've been talking about best practices and some of the services you can use. So let's actually get into those. We don't have to spend a lot of time on it. But I think it'd just be interesting to kind of review those so people know which services are available to them and what are the sort of current best practices. Like you said, I'm sure that's going to evolve over time. But anyways, so here's... Let's start with the compute layer. This is sort of in the white paper, there's a definition of all these.

And you should definitely go and read the white paper, by the way. If you haven't done that, that's just a really good resource to have all that data and you can kind of get that beforehand, maybe even before you start building your application so that when you start going through the tool, that those checkboxes are there for you. So let's start with the compute layer. From a serverless perspective, what are the compute options available to us from AWS?

Heitor: Specifically on the lens, we have Lambda for the doing the compute side of things for you, we also have API gateway for doing some of the REST spaces and we have step functions for doing some of the orchestration of that state. Ideally, AppSync would be there too. But we're trying to figure out whether that's going to be in the next updates for that.

But these are the primary ones, based on the best practices we have. Compute layer for us is any service within the Lens that we selected that process your external requests, do some sort of computing, do some sort of a controlling access to those requests before they get you your business logic for instance.

Jeremy: Right. Yeah. And I think it's actually interesting that API gateway is in that compute section because it does actually do quite a bit. You can do throttling and you can do transformations. API gateway is a very powerful service that has some really cool features around it. All right. What about the data layer?

Heitor: The data we have... Well, you basically are working with persistent storage, we're not dealing with the cache specifically yet. This, we were looking at DynamoDB as one of the clear winners. DynamoDB is definitely being used by quite a lot of customers, specifically on the serverless. And when we call on not only Dynamo, but we call out specific pieces of Dynamo like DynamoDB Streams, and more recently, DAX that we've added too. And we also cover other pieces like S3, we cover Elasticsearch. And then that's where we cover AppSync as well.

The reason why we're undecided between AppSync being the data layer and also on the compute layer is that a lot of customers are using AppSync on designing what we call a schema first. So they're dealing exactly how your model application which looks similar to a database modeling if you will. I wouldn't say kind of but we had to make a decision. And so AppSync that in that case, actually, it could fit in both criterias. On the compute, because there's a lot of authorization, a lot of logic, VTL, like API gateway. But there's a lot more of data aspects in AppSync and we tend to say that specifically in the Lens, if you're building data-driven applications, when you're trying to model things around your data, then AppSync from a GraphQL perspective, if it's something new, then it makes a lot more sense.

Jeremy: All right. Now, question for you. How did you let Elasticsearch creep into a Serverless Lens?

Heitor: One of the things that happened in the definition was we had all these questions first, do we add containers in there, specifically Fargate? Do we add something like Elasticsearch because even though it has servers, it's something that we know customers are... Specifically in 2017 was the most common solution for analyzing your logs and there was kind of a best practice we needed to include.

I think it has to updated this year to rehash some of those. But the definition for us is a way to introduce all of those services that we're going to talk to out the lens. And we created categories to basically introduce what exactly that service does within the architecture you choose. For an Elasticsearch, Elasticsearch was specifically you wanted to... You have a scenario for mobile applications, so full-text search for those. Elasticsearch is actually the only option right now. Well, you could do that in Lambda in somewhat different ways. But that's kind of off the point now.

Jeremy: Yeah. But it's funny with Elasticsearch, because that is... It's been my go to for, I want to say since 2009, maybe 2008, well before Amazon even had the Elasticsearch service in place. That is one thing. That is definitely one of the missing pieces of serverless is to have a full-text search capability around that.

Also, I guess, caching as well for just a generic caching layer. I know some other providers have other things. What about Aurora Serverless, though? I know that's not in there now. Aurora Serverless still has problems, the same problems with exhausting connections and zombie connections just like you would if you're connecting to RDS. But actually it doesn't work with RDS Proxy. Which is interesting.

And if anybody wants to know why that is my guess, and maybe you can answer this question, my guess is because you need to exceed a number of connections and CPU usage in order for the autoscaling to trigger. And if you don't, if you had RDS Proxy in front of there, then that might not trigger your server to scale, which is actually part of the problem with my serverless MySQL package is that if you put that in front of Aurora Serverless, if you don't set it right, it doesn't scale up, which can be more of a problem. Anyways, where is Aurora Serverless on that list? How much of a risk is it to use that?

Heitor: I wouldn't call it as a risk because we know customers aren't using that. So the reason is not in the Lens yet. It's exactly for the reason you just mentioned about that connection pooling because we didn't have specific guidance on what are the best ways to tackle that. Now we have RDS Proxy. Now it's something that we could not only bring RDS, ElastiCache and a bunch of other services that are VPC specific, which previously had a lot of latency.

Now we can bring all these services in the new updates. And then we can add some call outs, if you are going to use Aurora Serverless for instance, here are some of the caveats that we know customers are using successfully in production. For now, it's not because it's from 2019 from the last re:Invent. But in upcoming updates, we do plan to have that.

Jeremy: And there's the data API too, which is very cool for, I would say, for asynchronous stuff. I don't know if it's ready for synchronous processes because it does have a higher sort of startup latency. But yeah, again, so many... I don't even know how you're going to fit all these things in a single Serverless Lens. There's just too many things to add. All right, so what about messaging? The messaging and the streaming layer? So we talked about EventBridge not being there yet. What do you have available to do that?

Heitor: Yeah, so before EventBridge, which is something again and we do plan to add in upcoming update, especially now that so many customers are using and James Beswick, Developer Advocate, has been doing a great job evangelizing some of the possible use cases. Before that, the classic ones you just mentioned about; SNS, SQS. I didn't have SQS specifically call out there. But SNS is basically like the go-to for low latency asynchronous communication between services. And that's still the case today. If customers are using EventBridge and SNS, and you can use SNS for very low latency.

Although, it's not the same feature set. It's a very different service. And then streaming well, Kinesis kind of clear winner. And then we also add the Kinesis Firehose. The only difference there is that we worked out a lot more, I guess specifics about Kinesis and streaming. But we didn't add because well, now it's public, we have a specific Lens about analytics that dives into much greater length about Kinesis Firehose and some of the configuration pieces that you might want. Messaging, we basically kept it short for SNS but we do plan to add now with SQS as an event source, and EventBridge.

Jeremy: All right, so now even though you have to pay per shard for Kinesis, do you consider Kinesis to be serverless?

Heitor: I think we had this conversation when we wrote in the 2017. I can't tell you how much of a debate. Just to give an idea, when the first Serverless Lens, I had roughly 70 plus revisions before we got out. And even before we went out publicly, I had over 700 revisions, 700 edits on things like, "Is this serverless? Is this serverless? Is this..." What I basically had to do is... The agreement we made was, there are things that are not serverless. Elasticsearch is definitely not one of them. Kinesis on the other hand, you definitely have this knob that you have to tweak about the shard counts.

But at the same time, it's something that empowers or is basically the backbone of many serverless applications doing streaming. For that reason, we decided to say, "Is this something that is the backbone of a serverless application successfully running in production?" If it's it, then what can we do to make it more easier to manage and easier to operate following those best practices? So Kinesis falls into that bucket.

Jeremy: All right. You didn't answer my question. I wanted you to email Chris Munns and tell him that yes, it was serverless. All right. How about the user management and identity layer?

Heitor: Sure. Well, in this case, we only have Cognito. Cognito helps us and that, I would say serverless. Though I implicitly answered the previous one. So Cognito helps us to do the old offload mechanisms or more recently, a lot of customers are now using for custom authentication mechanisms like passwordless, signatureslack or many other communication tools nowadays.

So Cognito falls into that bucket. There's not much to say there. We didn't spend too much time explaining too much about identity pools versus user pools. We briefly talk on the security pillar about ensuring that you're using identity metadata like Scopes in OAuth flows and many other mechanisms to do something more secure. But beyond that, it's plain simple Cognito integration and using Federation if you can too.

Jeremy: Yeah. And what about like JSON Web Tokens? I know that the new HTTP API's, which we haven't talked about, but those are primarily just used... I think that's all they uses is the J-Web Tokens. Is that something where the Lens might eventually get to a point where it says, it's okay to use OAuth? That might be fine for the identity layer? Because you're already integrating with those. Or is it something with a lens is going to say very specific to just AWS services?

Heitor: No, not really. If you go to the security pieces, when you go to the very first question actually, when we ask about some of the security identity or throttling, if you will, we do a rundown in the paper specifically, of when to use IAM authentication-64, when to use API keys, when to use custom authorizers, when to use something like OAuth like JWT. Previously with a REST API gateway, it was very contrived example of just using validating the data that the token was valid. That works for simple use cases but it wouldn't work for something more enterprise where you need to verify a lot more logic on JWT specifically.

We don't make that distinction about do not use JWT or use this instead. What we call out in the Lens is here are all these possible ways of you to do authorization, and specifically, this is authentication, this is authorization and these are the pros and cons of which. That's kind of the line we go.

Jeremy: Right. All right, so what about the edge layer because we I think edge computing is getting extremely popular. I don't know if you've been following along, but CloudFlare just did this huge thing where now it's like nanosecond cold starts and expanding the workloads they can do, adding more languages. So I think the edge is going to be really interesting, especially from a compliance standpoint in terms of where you're processing data and handling workloads and things like that.

Where are we now with the edge in terms of the Serverless Lens, but where do you think... I'd actually asked you a question beyond the Serverless Lens, but where do you think AWS is going to go with that edge computing stuff?

Heitor: One thing I was... Actually first I would definitely agree. I think it's something that's becoming more and more popular. One thing that I was very surprised to see the uptick of customers using it and without naming customers specifically yet because they're not public, is the amount of customers going from single page applications which we've been seeing have been popular over the years to something like going back I guess, if you will, into the server side rendering and now more specifically something like Gatsby which is something super popular and super handy.

But that Lambda app edge thing or doing the compute at the edge is becoming hugely popular in the streaming. And more recently, customers are using edge to do not only click streaming of those analytics pieces, but also doing data ingestion in multiple regions which is something that has been quite popular now. For the edge layer, it's something I want to add in the new update as well. Specifically cover the server side rendering.

There are customers doing hundreds of thousands of requests per second on server side rendering that it's something that we want to detail a bit more what we mean by server side rendering to do that at the edge, how do you do cache, so you reduce your cost, but you equally get the performance and SEO of that.

At the moment from what I've been following on this on the edge pieces, customers are now more comfortable with server side rendering and now more recently, incrementally static generation with something that Next.js just did. We have being following on that pieces. The applications that live completely in the edge, I haven't seen that much yet. But as edge makes more progress, lifts some of those limitations for timeouts and RPC calls, I think we might be seeing that shortly.

Jeremy: Yeah. No, I agree. I think that SPAs are great and they have their use and I think if you're doing server side rendering and then that rendered page then the first paint happens very quickly, then you can start interacting with it and so forth. I think it is a really popular way that we're seeing a lot especially like you said, with Next.js and like what Vercel doing and some of these other companies. I think that's really, really interesting.

That'll be cool to see where that's going to fit into an overall serverless strategy, because certainly if your front end is going to be rendering web pages, then I think the edge is going to play a major part in that. And if AWS has a good strategy around that, that'll be very interesting to see. Okay, so system monitoring and deployments.

Heitor: In that case, it's a very simple one. We see Cloudwatch as being like the backbone of all those metrics and logs and KPIs that customers use. X-Ray, which is our official mechanism for doing distributed tracing. And then SAM is when we call out our official way of doing deployments as well. But we don't rule out specifically choose these framework over the other framework, what we recommend instead or from being timeless, because as you mentioned, it changes a lot, the landscape, is do use a serverless framework.

In this case, we're basically outlining Sam as the official version from AWS. But we also recommend many other frameworks as well. It's all about what services to use for metrics, KPIs and logs and tracing and you use should make sense of all this little Lambda functions that tend to grow organically. How do you handle those? That K is a framework.

Jeremy: Right. And actually, I think that that category is probably one of the largest categories that's been or the category that's been affected most by third party services. So you have all of those monitoring tools that have launched. You have a number of different deployment engines and frameworks that help with that.

In the system now or in the current version of the whitepaper in the Lens, you're recommending SAM as you said. What do you think about the CDK though and also maybe SAR. Where are those going to fit in do you think in the future of the lens?

Heitor: The SAR, we have some references already as links but not specifically as a service because it wasn't something that, I think it acts more... Aids your deployments as opposed this is how you deploy things as in SAM or Serverless framework, if you will. And the CDK is different though. CDK, it wasn't GA until recently and some of the constructs... I think, if I'm not mistaken, API gateway still is not GA, it's actually either preview or beta at the moment.

There are certain things that I wouldn't be able to recommend in the Lens. As in, we know customers could use L1 constructs like lower level constructs and make their way up because it's basically CloudFormation either way, but in this case, we're basically recommending SAM because we know it's working in production and we know they can just use it. CDK, it's more of a discussion. We need to think how we can frame this in the Serverless Lens.

I think what CDK enables today, it's something we couldn't do easily before. Like the likes of Liberty Mutual and many others. Alma Media, is one of the examples that I was blown away by how they use a CDK so effectively. When you're doing multiple best practices at a larger organization in multiple teams, CDK makes it so much easier to onboard those practices, internalize those blueprints than if you were to do SAM or Serverless Framework. While Serverless Framework has the components if I'm not mistaken, which can now do something similar, but doing it in an imperative way, it makes it easier.

But at the same time, I have recently have found customers actually tripping up with the amount of abstractions. Developers got abstract. It's just the case. If you have the power of doing it, why are you not doing?

Jeremy: Exactly.

Heitor: But then at the same time, you get the problems with, I now have no idea what this line, build the best application possible serverlessly on AWS constructor and then you have to dig in and it's kind of a complicater. And this is not a new problem. This is something that we've been seeing.

Like 2015 when I started doing microservices in production, when customers would have 7 deployment tools because one found a better way to abstract things. I think CDK has its place on those customers looking to use programming language to easily deploy those applications but in the serverless, SAM is definitely predominant. CDK on containers on the other side is definitely made life so much easier to deploy containers on AWS.

Jeremy: Yeah. And I was sort of against the CDK initially, because I was like, "Oh, there's another layer of abstraction on top of every layer of abstraction on top of a layer of abstraction." But what I do like about what people are doing with the CDK and then you mentioned this about sort of baking in some of those best practices and same thing with serverless components is sort of, if you're a team and you say, all right, here's the bootstrap for a serverless microservice. And really all we want to do... All of my X-Ray and my logging and whatever my security best practices are an all that stuff, any layers that I need, if you can encapsulate that all into one construct and then be able to just add services or add routes or whatever it is that you're doing on top of that, I think that's really powerful level of abstraction.

Because I think that could give us the ability to just say, all right, I don't have to have a 600 file bootstrap template that I use to start every new serverless project, that a lot of that stuff could just be baked in. And again, same thing with serverless components. Even Pulumi, and some of these other ones that are doing sort of similar stuff. I do like that idea of potentially being able to sort of encapsulate that. But anyway, so deployment approaches. You mentioned Canary deployments earlier. What are the best practices now for those deployment approaches?

Heitor: Deployment approaches haven't changed much. It's still the same as before. We still have all at once which to basically deploy a thing specifically on Dev. You're deploying something, you're iterating fast, and you want to make sure whatever you deploy, it's working or it's not working. When you're going to production, you still have the mix between should I do Blue-Green? Should I do canaries? Canaries is an easier one to think about, to reason about because you have to have a lot of traffic to be able to shift a percentage of your traffic to a newer version.

And that's kind of where most people get tripped and when you're doing server side rendering, which is why I want to have a dedicated piece of server side rendering at the edge before trying to use Canary deployments at server side rendering when in fact more than 70% of the traffic was being cached.

You wouldn't be able to see any of that and then you would go and then you'd break. Blue-Green is kind of a classic one. You keep both of them and then you switch or using a DNS or using some other pieces. It hasn't changed much. It's not specific to serverless per se. But in the Lens, we cover how you could do that using SAM or using any other frameworks. I think there's a table. I'm just looking at the lens paper now myself. There is a table that we basically tell you, "These are the differences between all these three and when to use each."

It's more common for people to use linear deployments. So you're shifting a percentage of traffic over a period of time. And then you use KPIs to revert if something went wrong.

Jeremy: Yeah. That's interesting, because I think what you... The best thing you can take away from that is just do not do SAM deploy right into production from your laptop, for example. You should have some strategies in order to especially see ICD and some of those other things that I think make a lot of sense. Read the paper, figure that stuff out. I do want to talk quickly though, before I let you go about some of the use cases and that's one of the things that the paper does is it outlines a few scenarios.

This is what I think more people need to see because like you said, your best practices and the things that make it into these papers are based off of whether or not customers and technical specialists and evangelists and so forth are using these successfully. Let's just go through these quickly. Just give me an overview of it. And maybe you can outline some of the best practices for these. But like RESTful microservices, for example.

Heitor: The RESTful microservices... Well, actually one of the classic ones, once API gateway came out, most customers were using this as the go-to use cases. The microservices, instead of creating a bigger picture of what the microservice might look like, it wouldn't fit into your diagram, we chose to do something more conservative, which once you update now, in the next one to show a bit of caching, a bit of other pieces that also come through. Now, a VPC that enables you to do more interesting things.

In the RESTful API, what we cover is how do you have an API that your client will interact with with a contract and then how your backend, in this case, using Lambda functions can interact with your persistent storage. We chose something very simple and then you basically just store something into DynamoDB.

However, into the caveats or configuration notes, we actually covered some of those pieces that are more specific. Like we talked about data like geographically being close. How do you work with API gateway access logs. Back in the days, people were just enabling logs for API gateway and all of a sudden you have your incoming requests and your responses all in plain text in your logs. Customer were more sensitive to security, they would be like, "Oh, no, there's got to be a better way."

In the configuration notes, we would basically talk you through some of those things. How do you do logging for the REST API gateway the better way and how do you basically model some of those other pieces to do full-text search on logging operations. It's a very contrived example. It's more to show. This is how simple our RESTful micro service could be in serverless. But it's something that we want to update to include now DAX which wasn't easily done before a DynamoDB with Lambda or ElastiCache now and things like this.

Jeremy: Yeah. And I love just the idea of RESTful microservices with serverless. Because again, it just... You don't need web servers anymore. It's just amazing what you can do with these APIs. And I know there are a lot of blog posts out there that say, "Oh, the cold starts and so forth. It's not ready for primetime." It is ready for primetime.

So there are lots of your customers using this. I know I've been using this for, I don't know, probably 25-30 projects at this point that are out there. Very good scenario and example and use case for that. All right, so another one that I know that Aleksander Simovic would like is the Alexa skills.

Heitor: Yeah. The Alexa skills, it was a partnership with the Alexa team. We know many customers have been using Alexa for Lambda. There are also other use cases as well. But Alexa was one of them that not only Alex as well used a lot, and advocates a lot, he did so many things as well. But one thing that we saw was missing from the Alexa was we were always explaining to customers how to use Alexa with serverless into here's how you can choose a random number from one to 15. Or here's the Hello World example.

There wasn't anything about good practices or good design decisions, because this is also a very different way. You also need to think about UX of your audio and your transcript and how you're interacting with the customer. It basically gave a little bit more room for Alexa scenario to tell you when you are designing a skill, what are the things you have to keep in mind and what a good experience or what delightful experience actually means.

Some of those kind of things should keep in mind, and we also go into more detail about how a proper Alexa's queue might look like. It's not going to be something like an Alexa talking to a Lambda that talks to a Dynamo. There are other things as well. That we cover things like DynamoDB outscaling, what if you're using IoT with Alexa homescale? How does that fit together? That Alexa skill covers a common Alexa's skill how to design it, and when you expand and Alexis skill to do more things, how does that look like as a whole?

Jeremy: Yeah. All right. What about Mobile backends?

Heitor: Mobile backends is a... You probably have seen the server line example. It's becoming one of my favorite ones nowadays. The Mobile backend, it's covering AppSync or GraphQL specifically on how customers are specifically building mobile applications these days. There's been some changes with data store and Amplify changed a lot recently we have to update. But it covers things like when you are dealing with SMS or multiple-factor authentications or user registration or assets or dealing with a single graph as we typically call in GraphQL and handling multiple types of data sources like Elasticsearch for full-text, some part of your mobile application, parts of your data that could be into DynamoDB or NoSQL, parts of your data that could be in a relational database and some other third party communications that you want to use Lambda with.

The mobile covers all of these aspects on how we use all these different services to hydrate data that's in DynamoDB that now goes into Elasticsearch or how does your user use a single API that can talk to different data sources based on what your customer wants?

Jeremy: Yeah. I love the approach to Mobile backends especially with GraphQL and being able to avoid that overfetching and underfetching problem. That's all great stuff there. All right, what about stream processing?

Heitor: The stream processing is one that it got a new update not in the Serverless Lens, but specifically on the Analytics Lens. It covers things like the stream processing and how you handle batch processing specifically. It doesn't go into a very detailed like Analytics Lens as of now, but we do cover things like best practices about using a single shard but when you have to use a new parallelization factor or how do you design a good streaming solution to your payload and stuff like that. How do you handle high throughputs in DynamoDB with streaming or partition keys and stuff like that.

For Lambda doesn't have a specific library for handling like KPL or Kinesis Producer Library or Kinesis Consumer Library like you normally have in a EC2 or container. So it gives you some of the workarounds on how you can handle that, some other libraries that you could do or dealing with duplicate records or idempotency, and things that evolved as well.

Like we used to say, because of the way Stream Processing work, if your Lambda function fails, it will block the stream and it would keep sending the same records. On queue, you'll probably have some data loss. Recently, we announced async controls that give you more flexibility. We added that recently too.

Jeremy: Awesome. All right. And then the final one here is the web application, which is sort of goes beyond just, I guess, the RESTful microservice and adds S3 and some of the other stuff, edge computing. I know you said you want to update that with some SSR and some of that. But what do we have currently for that scenario?

Heitor: At the moment it's very similar to the mobile application. It learns from the mobile where you have the static assets into your S3 and you have a CDN on top. So you separate the two. You still use Cognito exactly the same way for using user authentication, user management. But you're now dealing with your API gateway, handling the authorization of the JWT token that you got from Cognito and then landing in DynamoDB, which is very similar to the REST API.

The difference is that you're now using some more sophisticated authorization mechanism that you probably would do in a REST API, because it could be service to service communication where IAM would be a lot simpler. Or you could also use custom API keys when you're doing things like throttling, but not only throttling, but also tiers. Your application is in freemium, in premium or business or enterprise. It got a little bit more details on if you're going to go down that route. These are some of the good practices you have to follow.

Jeremy: Right, perfect. All right. I wanted to talk to you about the five pillars, but I think we're running out of time. Maybe what we could do is just... So you mentioned them earlier. You mentioned operational excellence, security, reliability, performance efficiency and cost optimization. But I think what a really interesting aspect of serverless on this has to do with cost.

There's a whole bunch of things around reliability and security that are already baked in. But I'd like to talk to you just, use some time wisely here and talk to you about the cost aspect of it. Cost optimization in serverless applications. And I guess, in the Well-Architected Framework in general, what are your thoughts on that? Why is that so powerful, you think?

Heitor: There are many aspects that we could tackle. I think when customers think about cost, they typically would think about if I have the server running, how much it would cost versus having a serverless approach? They will try to do apples to apples when in fact it isn't. I think we have this discussion many times written in blogs like yourselves, or Yan Cui or Ben or even Lego as well on growing serverless teams, when in fact, one of the most costly aspect is actually developer hours.

Actually, those developers, one of the things that... I think the main reason I was so passionate about serverless was when I used to work with customers where we had to have a platform team, we have to have a SRE, if you like and many other people. Basically maintain a basic platform to run those services that they needed. And serverless, when I was working with British Gas, specifically or Centrica, as they went on to re:Invent three years ago, all we had was you know what, let's start with four developers, you add one architect to help us and you have someone from security as well and someone with ops so we can basically have a team that we can have people on and off. But this developer should be able to do.

Basically three months, less than three months to be honest, they had no AWS experience and they got off the ground and got something production as well with those practices. That changed the cost perspective because there was one of the applications that they had over 100 people to maintain. When you're thinking about a server or a... We don't even have to go too deeply on the load balancer versus Fargate, versus in all these nitty details, by people specifically, instead of saying, "Oh, we don't need all these people anymore." It's actually quite the opposite.

You could train these people now to do a different role. And you now have this army of talented people already in your organization, you could be retrain to add more features, and you can ditch the competition in a way. I think that cost is something that you've seen the Serverless Lens as well. Which was also the hardest. How do you ask questions about costs when serverless is mostly cheap, if you will, inexpensive if you will.

Jeremy: Exactly. I think that to me is the biggest... The biggest cost factor is not how much does it cost to run a Lambda function or what does it cost for API gateway? There are certain services where you can start running up some bills, but for the most part, it's just how much does it cost to have those SREs like you said or to have those DevOps people or to have all these other ops people that have to constantly monitor servers and make sure things are up and running and then just the wasted processing time that you're... All that idle time that you're probably getting over provisioning, under provisioning, trying to set up autoscaling.

There's just so much that you can save by going serverless. Awesome. All right, I have one listener question for you and if anybody wants to ask questions to the guests here on Serverless Chats, go to serverlesschats.com/insiders, sign up to be an Insider, and you can ask questions to our wonderful guests, like Heitor here.

I have a question from Michael. And he... I'm not 100% sure I understand the question exactly, but maybe we can break it down. He said, "I'd like to learn about best practices on sharing models between services that use the same table." So talking about I'm assuming single table design in DynamoDB. "So in my case, I have an API service and an ETL service which share the same table, and I haven't found an approach that I'm happy with yet."

I don't know if there's talking about entity models or API gateway models, but I don't know. What are your thoughts on that question?

Heitor: Yeah. I was going to ask you the same. I'm not sure if he means about API gateway models where you basically define your contract and how your client or your consumer is going to work with or if it's about the data entity model when you're trying to design a database or a single table, if you will.

In the case of API gateway for models, I think there's a great example of... We have an open source and example code serverless e-commerce platform, where I think is the most... It is the most comprehensive example that shows how you deal with API gateway models, especially event schemas for EventBridge as well and the tooling around it. And then it shows you some of the design decisions and why we made that what we made.

Have a look at that one to give you an idea about the tooling and how to share some of those models across services, when it makes sense because there are some parts of the contract that you can share. From a database perspective, I think he goes into where we think, is that multiple services accessing the same database or should we... Maybe we need to think about that? Can you explain a little more? Or is it something just like an ETL, like a service airline. Is another service adding new flights into the service that already handles those flights.

Then in that case, it's more about making sure you don't get into the situation where the ETL uses more of your database than your service can use at the time because I've had many incidents in production like that as well, even in serverless. When the ETL function basically took over all the concurrency of the whole account. So you need to be mindful of both things. But from the models perspective, I don't know if that person means single table but it goes... As long as they have a way to protect your ETL or not over consuming or over utilizing in a way that impacts your customer experience, I don't see that much of an issue.

The issues I saw with models is mostly managing with frameworks like service framework or SAM, how do I store this things that are stored in a file? How do I make sure it's easy to change? But also, especially in that single table context, it's quite complex to get it right. But once you get it right, it looks like a dream. But then you also need to think about when you make changes, how do you make those changes? I don't think I have an answer if I understood that correctly. I think we're going in circles, I guess.

Jeremy: No. Yeah. Well, I'm wondering too if maybe and from the ETL perspective and maybe this doesn't answer the question. But if you have an API service that's accessing a table and loading certain types of data, then you have an ETL task that runs at night, it depends on what that ETL task is doing. If that's converting or if it's doing aggregations maybe. So it's aggregating counts across logs or something like that, I wouldn't want that ETL task touching my production table directly.

I think I would want that ETL task, if it's doing some sort of roll up, that should operate maybe in its own table and then just send those aggregations through an API gateway maybe or through an event. Maybe EventBridge. And so that the API service could accept that event or those updates, but through a contract, maybe through an API or through something like an event schema. I don't know.

And maybe this is not answering the question. But I think it's interesting debate too, because that's part of the problem with microservices in serverless is where's the boundary of the microservice and does every function get its own table? No. But I mean, that's the kind of thing. Like how many functions should be interacting with it and should functions ever cross between bounded contexts? I don't think they should. But I think there are still a lot of people trying to figure that out.

Heitor: In fact, I was looking at the analytics lens. I was just searching for ETL and they actually go into a great length about how to choose your ETL. Whether you need like a nightly batch or if you're doing a string batch or on demand batch, or high frequency ETL. Have a look at that. If hopefully that answers your question on the design principles of analytics lens, there are a bunch of scenarios specifically on how to use ETL. And separate your query like you're just mentioning, Jeremy, when instead of an API, you have a data lake, if you will and you use Athena to search for that specific aggregation.

If that's what you're after, it's quite difficult to know the question but Analytics Lens covers a lot specifically about ETL. Ping us on Twitter. We're happy to help, have that debate publicly as well.

Jeremy: Awesome. All right. Well, listen Heitor, thank you so much for spending the time with me sharing all this knowledge. If people want to find out more about the Serverless Lens or they want to contact you, how do they do that?

Heitor: So the Serverless Lens, if you just search for Well-Architected Serverless Lens, you will find the whitepaper. But you also go to the console and if you search for Well-Architected, you will find a Well-Architected tool right in the console. When you're searching... When you're creating a new application, actually, you can basically select Serverless Lens and you get all the questions and the best practices.

That's the best way to find the best practices. If you don't want to read the 80 pages upfront, because it will give you more summarized version of what is the best practice, why is it important for you to do and how exactly do you do step by step, how do we evolve that.

And if you want to find me, I'm also on Twitter, @heitor_lessa and if you need to message me as well, if you're doing something on the Serverless Lens and you want to give some updates or if you are have specific feedback, reach out to me on email as well. Use my last name, Lessa, L-E-S-S-A@amazon.com.

Jeremy: All right. Well, we will get all that into the show notes. Thanks again Heitor.

Heitor: Big pleasure. Thanks for having me Jeremy.


Episode source