Future

Serverless Chats

Episode #105: Building a Serverless Banking Platform with Patrick Strzelec

About Patrick Strzelec

Patrick Strzelec is a fullstack developer with a focus on building GraphQL gateways and serverless microservices. He is currently working as a technical lead at NorthOne making banking effortless for small businesses.


LinkedIn: Patrick Strzelec
NorthOne Careers: www.northone.com/about/careers


Watch this episode on YouTube: https://youtu.be/8W6lRc03QNU  

This episode sponsored by CBT Nuggets and Lumigo.

Transcript
Jeremy
: Hi everyone. I'm Jeremy Daly, and this is Serverless Chats. Today, I'm joined by Patrick Strzelec. Hey, Patrick, thanks for joining me.

Patrick: Hey, thanks for having me.

Jeremy: You are a lead developer at NorthOne. I'd love it if you could tell the listeners a little bit about yourself, your background, and what NorthOne does.

Patrick: Yeah, totally. I'm a lead developer here at NorthOne, I've been focusing on building out our GraphQL gateway here, as well as some of our serverless microservices. What NorthOne does, we are a banking experience for small businesses. Effectively, we are a deposit account, with many integrations that act almost like an operating system for small businesses. Basically, we choose the best partners we can to do things like check deposits, just your regular transactions you would do, as well as any insights, and the use cases will grow. I'd like to call us a very tailored banking experience for small businesses.

Jeremy: Very nice. The thing that is fascinating, I think about this, is that you have just completely embraced serverless, right?

Patrick: Yeah, totally. We started off early on with this vision of being fully event driven, and we started off with a monolith, like a Python Django big monolith, and we've been experimenting with serverless all the way through, and somewhere along the journey, we decided this is the tool for us, and it just totally made sense on the business side, on the tech side. It's been absolutely great.

Jeremy: Let's talk about that because this is one of those things where I think you get a business and a business that's a banking platform. You're handling some serious transactions here. You've got a lot of transactions that are going through, and you've totally embraced this. I'd love to have you take the listeners through why you thought it was a good idea, what were the business cases for it? Then we can talk a little bit about the adoption process, and then I know there's a whole bunch of stuff that you did with event driven stuff, which is absolutely fascinating.

Then we could probably follow up with maybe a couple of challenges, and some of the issues you face. Why don't we start there. Let's start, like who in your organization, because I am always fascinated to know if somebody in your organization says, “Hey we absolutely need to do serverless," and just starts beating that drum. What was that business and technical case that made your organization swallow that pill?

Patrick: Yeah, totally. I think just at a high level we're a user experience company, we want to make sure we offer small businesses the best banking experience possible. We don't want to spend a lot of time on operations, and trying to, and also reliability is incredibly important. If we can offload that burden and move faster, that's what we need to do. When we're talking about who's beating that drum, I would say our VP, Blake, really early on, seemed to see serverless as this amazing fit. I joined about three years ago today, so I guess this is my anniversary at the company. We were just deciding what to build. At the time there was a lot of architecture diagrams, and Blake hypothesized that serverless was a great fit.

We had a lot of versions of the world, some with Apache Kafka, and a bunch of microservices going through there. There's other versions with serverless in the mix, and some of the tooling around that, and this other hypothesis that maybe we want GraphQL gateway in the middle of there. It was one of those things that we wanted to test our hypothesis as we go. That ties into this innovation velocity that serverless allows for. It’s very cheap to put a new piece of infrastructure up in serverless. Just the other day we wanted to test Kinesis for an event streaming use case, and that was just a half an hour to set up that config, and you could put it live in production and test it out, which is completely awesome.

I think that innovation velocity was the hypothesis. We could just try things out really quickly. They don't cost much at all. You only pay for what you use for the most part. We were able to try that out, and as well as reliability. AWS really does a good job of making sure everything's available all the time. Something that maybe a young startup isn't ready to take on. When I joined the company, Blake proposed, “Okay, let's try out GraphQL as a gateway, as a concept. Build me a prototype." In that prototype, there was a really good opportunity to try serverless. They just ... Apollo server launched the serverless package, that was just super easy to deploy.

It was a complete no-brainer. We tried it out, we built the case. We just started with this GraphQL gateway running on serverless. AWS Lambda. It's funny because at first, it's like, we're just trying to sell them development. Nobody's going to be hitting our services. It was still a year out from when we were going into production. Once we went into prod, this Lambda's hot all the time, which is interesting. I think the cost case breaks down there because if you're running this thing, think forever, but it was this GraphQL server in front of our Python Django monolift, with this vision of event driven microservices, which has fit well for banking. If you just think about the banking world, everything is pretty much eventually consistent.

Just, that's the way the systems are designed. You send out a transaction, it doesn't settle for a while. We were always going to do event driven, but when you're starting out with a team of three developers, you're not going to build this whole microservices environment and everything. We started with that monolith with the GraphQL gateway in front, which scaled pretty nicely, because we were able to sort of, even today we have the same GraphQL gateway. We just changed the services backing it, which was really sweet. The adoption process was like, let's try it out. We tried it out with GraphQL first, and then as we were heading into launch, we had this monolith that we needed to manage. I mean, manually managing AWS resources, it's easier than back in the day when you're managing your own virtual machines and stuff, but it's still not great.

We didn't have a lot of time, and there was a lot of last-minute changes we needed to make. A big refactor to our scheduling transactions functions happened right before launch. That was an amazing serverless use case. And there's our second one, where we're like, “Okay, we need to get this live really quickly." We created this work performance pattern really quickly as a test with serverless, and it worked beautifully. We also had another use case come up, which was just a simple phone scheduling service. We just wrapped an API, and just exposed some endpoints, but it was just a lot easier to do with serverless. Just threw it off to two developers, figure out how you do it, and it was ready to be live. And then ...

Jeremy: I'm sorry to interrupt you, but I want to get to this point, because you're talking about standing up infrastructure, using infrastructure as code, or the tools you're using. How many developers were working on this thing?

Patrick: How many, I think at the time, maybe four developers on backend functionality before launch, when we were just starting out.

Jeremy: But you're building a banking platform here, so this is pretty sophisticated. I can imagine another business case for serverless is just the sense that we don't have to hire an operations team.

Patrick: Yeah, exactly. We were well through launching it. I think it would have been a couple of months where we were live, or where we hired our first dev ops engineer. Which is incredible. Our VP took a lot of that too, I'm sure he had his hands a little more dirty than he did like early on. But it was just amazing. We were able to manage all that infrastructure, and scale was never a concern. In the early stages, maybe it shouldn't be just yet, but it was just really, really easy.

Jeremy: Now you started with four, and I think, what are you now? Somewhere around 25 developers? Somewhere in that space now?

Patrick: About 25 developers now, we're growing really fast. We doubled this year during COVID, which is just crazy to think about, and somehow have been scaling somewhat smoothly at least, in terms of just being able to output as a dev team promote. We'll probably double again this year. This is maybe where I shamelessly plug that we're hiring, and we always are, and you could visit northone.com and just check out the careers page, or just hit me up for a warm intro. It's been crazy, and that’s one of the things that serverless has helped with us too. We haven't had this scaling bottleneck, which is an operations team. We don't need to hire X operations people for a certain number of developers.

Onboarding has been easier. There was one example of during a major project, we hired a developer. He was new to serverless, but just very experienced developer, and he had a production-ready serverless service ready in a month, which was just an insane ramp-up time. I haven't seen that very often. He didn't have to talk to any of our operation staff, and we'd already used serverless long enough that we had all of our presets and boilerplates ready, and permissions locked down, so it was just super easy. It's super empowering just for him to be able to just play around with the different services. Because we hit that point where we've invested enough that every developer when they opened a branch, that branch deploys its own stage, which has all of the services, AWS infrastructure deployed.

You might have a PR open that launches an instance of Kinesis, and five SQS queues, and 10 Lambdas, and a bunch of other things, and then tear down almost immediately, and the cost isn't something we really worry about. The innovation velocity there has been really, really good. Just being able to try things out. If you're thinking about something like Kinesis, where it's like a Kafka, that's my understanding, and if you think about the organizational buy-in you need for something like Kafka, because you need to support it, come up with opinions, and all this other stuff, you'll spend weeks trying it out, but for one of our developers, it's like this seems great.

We're streaming events, we want this to be real-time. Let's just try it out. This was for our analytics use case, and it's live in production now. It seems to be doing the thing, and we’re testing out that use case, and there isn't that roadblock. We could always switch off to a different design if you want. The experimentation piece there has been awesome. We’ve changed, during major projects we've changed the way we've thought about our resources a few times, and in the end it works out, and often it is about resiliency. It's just jamming queues into places we didn't think about in the first place, but that's been awesome.

Jeremy: I'm curious with that, though, with 25 developers ... Kinesis for the most part works pretty well, but you do have to watch those iterator ages, and make sure that they're not backing up, or that you're losing events. If they get flooded or whatever, and also sticking queues everywhere, sounds like a really good idea, and I'm a big fan of that, but it also, that means there's a lot of queues you have to manage, and watch, and set alarms and all that kind of stuff. Then you also talked about a pretty, what sounds like a pretty great CI/CD process to spin up new branches and things like that. There's a lot of dev ops-y ops work that is still there. How are you handling that now? Do you have dedicated ops people, or do you just have your developers looking after that piece of it?

Patrick: I would say we have a very spirited group of developers who are inspired. We do a lot of our code-sharing via internal packages. A few of our developers just figured out some of our patterns that we need, whether it's like CI, or how we structure our events stores, or how we do our Q subscriptions. We manage these internal packages. This won't scale well, by the way. This is just us being inspired and trying to reduce some of this burden. It is interesting, I’ve listened to this podcast and a few others, and this idea of infrastructure as code being part of every developer's toolbox, it’s starting to really resonate with our team.

In our migration, or our swift shift to full, I'd say doing serverless properly, we’ve learned to really think in it. Think in terms of infrastructure in our creating solutions. Not saying we're doing serverless the right way now, but we certainly did it the wrong way in the past, where we would spin up a bunch of API gateways that would talk to each other. A lot of REST calls going around the spider web of communication. Also, I'll call these monster Lambdas, that have a whole procedure list that they need to get through, and a lot of points of failure. When we were thinking about the way we're going to do Lambda now, we try to keep one Lambda doing one thing, and then there's pieces of infrastructure stitching that together. EventBridge between domain boundaries, SQS for commands where we can, instead of using API gateway. I think that transitions pretty well into our big break. I'm talking about this as our migration to serverless. I want to talk more about that.

Jeremy: Before we jump into that, I just want to ask this question about, because again, I call those fat, some people call them fat Lambdas, I call them Lambda lifts. I think there's Lambda lifts, then fat Lambdas, then your single-purpose functions. It's interesting, again, moving towards that direction, and I think it's super important that just admitting that you're like, we were definitely doing this wrong. Because I think so many companies find that adopting serverless is very much so an evolution, and it's a learning thing where the teams have to figure out what works for them, and in some cases discovering best practices on your own. I think that you've gone through that process, I think is great, so definitely kudos to you for that.

Before we get into that adoption and the migration or the evolution process that you went through to get to where you are now, one other business or technical case for serverless, especially with something as complex as banking, I think I still don't understand why I can't transfer personal money or money from my personal TD Bank account to my wife's local checking account, why that's so hard to do. But, it seems like there's a lot of steps. Steps that have to work. You can't get halfway through five steps in some transaction, and then be like, oops we can't go any further. You get to roll that back and things like that. I would imagine orchestration is a huge piece of this as well.

Patrick: Yeah, 100%. The banking lends itself really well to these workflows, I'll call them. If you're thinking about even just the start of any banking process, there's this whole application process where you put in all your personal information, you send off a request to your bank, and then now there's this whole waterfall of things that needs to happen. All kinds of checks and making sure people aren't on any fraud lists, or money laundering lists, or even just getting a second dive from our compliance department. There's a lot of steps there, and even just keeping our own systems in sync, with our off-provider and other places. We definitely lean on using step functions a lot. I think they work really, really well for our use case. Just the visual, being able to see this is where a customer is in their onboarding journey, is very, very powerful.

Being able to restart at any point of their, or even just giving our compliance team a view into that process, or even adding a pause portion. I think that's one of the biggest wins there, is that we could process somebody through any one of our pipelines, and we may need a human eye there at least for this point in time. That's one of the interesting things about the banking industry is. There are still manual processes behind the scenes, and there are, I find this term funny, but there are wire rooms in banks where there are people reviewing things and all that. There are a lot of workflows that just lend themselves well to step functions. That pausing capability and being able to return later with a response, so that allows you to build other internal applications for your compliance teams and other teams, or just behind the scenes calls back, and says, "Okay, resume this waterfall."

I think that was the visualization, especially in an events world when you're talking about like sagas, I guess, we're talking about distributed transactions here in a way, where there's a lot of things happening, and a common pattern now is the saga pattern. You probably don't want to be doing two-phase commits and all this other stuff, but when we're looking at sagas, it's the orchestration you could do or the choreography. Choreography gets very messy because there's a lot of simplistic behavior. I'm a service and I know what I need to do when these events come through, and I know which compensating events I need to dump, and all this other stuff. But now there's a very limited view.

If a developer is trying to gain context in a certain domain, and understand the chain of events, although you are decoupled, there's still this extra coupling now, having to understand what's going on in your system, and being able to share it with external stakeholders. Using step functions, that's the I guess the serverless way of doing orchestration. Just being able to share that view. We had this process where we needed to move a lot of accounts to, or a lot of user data to a different system. We were able to just use an orchestrator there as well, just to keep an eye on everything that's going on.

We might be paused in migrating, but let's say we’re moving over contacts, a transaction list, and one other thing, you could visualize which one of those are in the red, and which one we need to come in and fix, and also share that progress with external stakeholders. Also, it makes for fun launch parties I'd say. It's kind of funny because when developers do their job, you press a button, and everything launches, and there's not really anything to share or show.

Jeremy: There's no balloons or anything like that.

Patrick: Yeah. But it was kind of cool to look at these like, the customer is going through this branch of the logic. I know it's all green. Then I think one of the coolest things was just the retry ability as well. When somebody does fail, or when one of these workflows fails, you could see exactly which step, you can see the logs, and all that. I think one of the challenges we ran into there though, was because we are working in the banking space, we're dealing with sensitive data. Something I almost wish AWS solved out of the box, would be being able to obfuscate some of that data. Maybe you can't, I'm not sure, but we had to think of patterns for tokenization for instance.

Stripe does this a lot where certain parts of their platform, you just get it, you put in personal information, you get back a token, and you use that reference everywhere. We do tokenization, as well as we limit the amount of details flowing through steps in our orchestrators. We'll use an event store with identifiers flowing through, and we'll be doing reads back to that event store in between steps, to do what we need to do. You lose some of that debug-ability, you can't see exactly what information is flowing through, but we need to keep user data safe.

Jeremy: Because it's the use case for it. I think that you mentioned a good point about orchestration versus choreography, and I'm a big fan of choreography when it makes sense. But I think one of the hardest lessons you learn when you start building distributed systems is knowing when to use choreography, and knowing when to use orchestration. Certainly in banking, orchestration is super important. Again, with those saga patterns built-in, that's the kind of thing where you can get to a point in the process and you don't even need to do automated rollbacks. You can get to a failure state, and then from there, that can be a pause, and then you can essentially kick off the unwinding of those things and do some of that.

I love that idea that the token pattern and using just rehydrating certain steps where you need to. I think that makes a ton of sense. All right. Let's move on to the adoption and the migration process, because I know this is something that really excites you and it should because it is cool. I always know, as you're building out applications and you start to add more capabilities and more functionality and start really embracing serverless as a methodology, then it can get really exciting. Let's take a step back. You had a champion in your organization that was beating the drum like, "Let's try this. This is going to make a lot of sense." You build an Apollo Lambda or a Lambda running Apollo server on it, and you are using that as a strangler pattern, routing all your stuff through now to your backend. What happens next?

Patrick: I would say when we needed to build new features, developers just gravitated towards using serverless, it was just easier. We were using TypeScript instead of Python, which we just tend to like as an organization, so it's just easier to hop into TypeScript land, but I think it was just easier to get something live. Now we had all these Lambdas popping up, and doing their job, but I think the problem that happened was we weren't using them properly. Also, there was a lot of difference between each of our serverless setups. We would learn each time and we'd be like, okay, we'll use this parser function here to simplify some of it, because it is very bare-bones if you're just pulling the Serverless Framework, and it took a little ...

Every service looked very different, I would say. Also, we never really took the time to sit back and say, “Okay, how do we think about this? How do we use what serverless gives us to enable us, instead of it just being an easy thing to spin up?" I think that's where it started. It was just easy to start. But we didn't embrace it fully. I remember having a conversation at some point with our VP being like, “Hey, how about we just put Express into one of our Lambdas, and we create this," now I know it's a Lambda lift. I was like, it was just easier. Everybody knows how to use Express, why don't we just do this? Why are we writing our own parsers for all these things? We have 10 versions of a make response helper function that was copy-pasted between repos, and we didn't really have a good pattern for sharing that code yet in private packages.

We realized that we liked serverless, but we realized we needed to do it better. We started with having a serverless chapter reading between some of our team members, and we made some moves there. We created a shared boilerplate at some point, so it reduced some of the differences you'd see between some of the repositories, but we needed a step-change difference in our thinking, when I look back, and we got lucky that opportunity came up. At this point, we probably had another six Lambda services, maybe more actually. I want to say around, we'd probably have around 15 services at this point, without a governing body around patterns.

At this time, we had this interesting opportunity where we found out we're going to be re-platforming. A big announcement we just made last month was that we moved on to a new bank partner called Bancorp. The bank partner that supports Chime, and they're like, I'll call them an engine boost. We put in a much larger, more efficient engine for our small businesses. If you just look at the capabilities they provide, they're just absolutely amazing. It's what we need to build forward. Their events API is amazing as well as just their base banking capabilities, the unit economics they can offer, the times on there, things were just better. We found out we're doing an engine swap. The people on the business side on our company trusted our technical team to do what we needed to do.

Obviously, we need to put together a case, but they trusted us to choose our technology, which was awesome. I think we just had a really good track record of delivering, so we had free reign to decide what do we do. But the timeline was tight, so what we decided to do, and this was COVID times too, was a few of our developers got COVID tested, and we rented a house and we did a bubble situation. How in the NHL or MBA you have a bubble. We had a dev bubble.

Jeremy: The all-star team.

Patrick: The all-star team, yeah. We decided let's sit down, let's figure out what patterns are going to take us forward. How do we make the step-change at the same time as step-change in our technology stack, at the same time as we're swapping out this bank, this engine essentially for the business. In this house, we watched almost every YouTube video you can imagine on event driven and serverless, and I think leading up. I think just knowing that we were going to be doing this, I think all of us independently started prototyping, and watching videos, and reading a lot of your content, and Alex DeBrie and Yan Cui. We all had a lot of ideas already going in.

When we all got to this house, we started off with this exercise, an event storming exercise, just popular in the domain-driven design community, where we just threw down our entire business on a wall with sticky notes, and it would have been better to have every business stakeholder there, but luckily we had two people from our product team there as representatives. That's how invested we were in building this outright, that we have products sitting in the room with us to figure it out.

We slapped down our entire business on a wall, this took days, and then drew circles around it and iterated on that for a while. Then started looking at what the technology looks like. What are our domain boundaries, and what prototypes do we need to make? For a few weeks there, we were just prototyping. We built out what I'd called baby's first balance. That was the running joke where, how do we get an account opened with a balance, with the transactions minimally, with some new patterns. We really embraced some of this domain-driven-design thinking, as well as just event driven thinking. When we were rethinking architecture, three concepts became very important for us, not entirely new, but important. Item potency was a big one, dealing with distributed transactions was another one of those, as well as the eventual consistency. The eventual consistency portion is kind of funny because we were already doing it a lot.

Our transactions wouldn't always settle very quickly. We didn't know about it, but now our whole system becomes eventually consistent typically if you now divide all of your architecture across domains, and decouple everything. We created some early prototypes, we created our own version of an event store, which is, I would just say an opinionated scheme around DynamoDB, where we keep track of revisions, payload, timestamp, all the things you'd want to be able to do event sourcing. That's another thing we decided on. Event sourcing seemed like the right approach for state, for a lot of our use cases. Banking, if you just think about a banking ledger, it is events or an accounting ledger. You're just adding up rows, add, subtract, add, subtract.

We created a lot of prototypes for these things. Our events store pattern became basically just a DynamoDB with opinions around the schema, as well as a package of a shared code package with a simple dispatch function. One dispatch function that really looks at enforcing optimistic concurrency, and one that's a little bit more relaxed. Then we also had some reducer functions built into there. That was one of the packages that we created, as well as another prototype around that was how do we create the actual subscriptions to this event store? We landed on SNS to SQS fan-out, and it seems like fan-out first is the serverless way of doing a lot of things. We learned that along the way, and it makes sense. It was one of those things we read from a lot of these blogs and YouTube videos, and it really made sense in production, when all the data is streaming from one place, and then now you just add subscribers all over the place. Just new queues. Fan-out first, highly recommend. We just landed on there by following best practices.

Jeremy: Great. You mentioned a bunch of different things in there, which is awesome, but so you get together in this house, you come up with all the events, you do this event storming session, which is always a great exercise. You get a pretty good visualization of how the business is going to run from an event standpoint. Then you start building out this event driven architecture, and you mentioned some packages that you built, we talked about step functions and the orchestration piece of this. Just give me a quick overview of the actual system itself. You said it's backed by DynamoDB, but then you have a bunch of packages that run in between there, and then there's a whole bunch of queues, and then you're using some custom packages. I think I already said that but you're using ... are you using EventBridge in there? What's some of the architecture behind all that?

Patrick: Really, really good question. Once we created these domain boundaries, we needed to figure out how do we communicate between domains and within domains. We landed on really differentiating milestone events and domain events. I guess milestone events in other terms might be called integration events, but this idea that these are key business milestones. An account was open, an application was approved or rejected, things that every domain may need to know about. Then within our domains, or domain boundaries, we had these domain events, which might reduce to a milestone event, and we can maintain those contracts in the future and change those up. We needed to think about how do we message all these things across? How do we communicate? We landed on EventBridge for our milestone events. We have one event bus that we talked to all of our, between domain boundaries basically.

EventBridge there, and then each of our services now subscribed to that EventBridge, and maintain their own events store. That's backed by DynamoDB. Each of our services have their own data store. It's usually an event stream or a projection database, but it's almost all Dynamo, which is interesting because our old platform used Postgres, and we did have relational data. It was interesting. I was really scared at first, how are we going to maintain relations and things? It became a non-issue. I don't even know why now that I think about it. Just like every service maintains its nice projection through events, and builds its own view of the world, which brings its own problems. We have DynamoDB in there, and then SNS to SQS fan-out. Then when we're talking about packages ...

Jeremy: That's Office Streams?

Patrick: Exactly, yeah. We're Dynamo streams to SNS, to SQS. Then we use shared code packages to make those subscriptions very easy. If you're looking at doing that SNS to SQS fan-out, or just creating SQS queues, there is a lot of cloud formation boilerplate that we were creating, and we needed to move really quick on this project. We got pretty opinionated quick, and we created our own subscription function that just generates all this cloud formation with naming conventions, which was nice. I think the opinions were good because early on we weren't opinionated enough, I would say. When you look in your AWS dashboard, the read for these aren't prefixed correctly, and there's all this garbage. You're able to have consistent naming throughout, make it really easy to subscribe to an event.

We would publish packages to help with certain things. Our events store package was one of those. We also created a Lambda handlers package, which leverages, there's like a Lambda middlewares compose package out there, which is quite nice, and we basically, all the common functionality we're doing a lot of, like parsing a body from S3, or SQS or API gateway. That's just the middleware that we now publish. Validation in and out. We highly recommend the library Zod, we really embrace the TypeScript first object validation. Really, really cool package. We created all these middlewares now. Then subscription packages. We have a lot of shared code in this internal NPM repository that we install across.

I think one challenge we had there was, eventually you extracted away too much from the cloud formation, and it's hard for new developers to ... It's easy for them to create events subscriptions, it's hard for them to evolve our serverless thinking because they're so far removed from it. I still think it was the right call in the end. I think this is the next step of the journey, is figuring out how do we share code effectively while not hiding away too much of serverless, especially because it's changing so fast.

Jeremy: It's also interesting though that you take that approach to hide some of that complexity, and bake in some of that boilerplate that, someone's mostly didn't have to write themselves anyways. Like you said, they're copying and pasting between services, is not the best way to do it. I tried the whole shared packages thing one time, and it kind of worked. It's just like when you make a small change to that package and you have 14 services, that then you have to update to get the newest version. Sometimes that's a little frustrating. Lambda layers haven't been a huge help with some of that stuff either. But anyways, it's interesting, because again you've mentioned this a number of times about using queues.

You did mention resiliency in there, but I want to touch on that point a little bit because that's one of those things too, where I would assume in a banking platform, you do not want to lose events. You don't want to lose things. and so if something breaks, or something gets throttled or whatever, having to go and retry those events, having the alerts in place to know that a queue is backed up or whatever. Then just, I'm thinking ordering issues and things like that. What kinds of issues did you face, and tell me a little bit more about what you've done for reliability?

Patrick: Totally. Queues are definitely ... like SQS is a workhorse for our company right now. We use a lot of it. Dropping messages is one of the scariest things, so you're dead-on there. When we were moving to event driven, that was what scared me the most. What if we drop an event? A good example of that is if you're using EventBridge and you're subscribing Lambdas to it, I was under the impression early on that EventBridge retries forever. But I'm pretty sure it'll retry until it invokes twice. I think that's what we landed on.

Jeremy: Interesting.

Patrick: I think so, and don't quote me on this. That was an example of where drop message could be a problem. We put a queue in front of there, an SQS queue as the subscription there. That way, if there's any failure to deliver there, it's just going to retry all the time for a number of days. At that point we got to think about DLQs, and that's something we're still thinking about. But yeah, I think the reason we've been using queues everywhere is that now queues are in charge of all your retry abilities. Now that we've decomposed these Lambdas into one Lambda lift, into five Lambdas with queues in between, if anything fails in there, it just pops back into the queue, and it'll retry indefinitely. You can drop messages after a few days, and that's something we learned luckily in the prototyping stage, where there are a few places where we use dead letter queues. But one of the issues there as well was ordering. Ordering didn't play too well with ...

Jeremy: Not with DLQs. No, it does not, no.

Patrick: I think that's one lesson I'd want to share, is that only use ordering when you absolutely need it. We found ways to design some of our architecture where we didn't need ordering. There's places we were using FIFO SQS, which was something that just launched when we were building this thing. When we were thinking about messaging, we're like, "Oh, well we can't use SQS because they don't respect ordering, or it doesn't respect ordering." Then bam, the next day we see this blog article. We got really hyped on that and used FIFO everywhere, and then realized it's unnecessary in most use cases. So when we were going live, we actually changed those FIFO queues into just regular SQS queues in as many places as we can. Then so, in that use case, you could really easily attach a dead letter queue and you don't have to worry about anything, but with FIFO things get really, really gnarly.

Ordering is an interesting one. Another place we got burned I think on dead-letter queues, or a tough thing to do with dead letter queues is when you're using our state machines, we needed to limit the concurrency of our state machines is another wishlist item in AWS. I wish there was just at the top of the file, a limit concurrent executions of your state machine. Maybe it exists. Maybe we just didn't learn to use it properly, but we needed to. There's a few patterns out there. I've seen the [INAUDIBLE] pattern where you can use the actual state machine flow to look back at how many concurrent executions you have, and pause. We landed on setting reserved concurrency in a number of Lambdas, and throwing errors. If we've hit the max concurrency and it'll pause that Lambda, but the problem with DLQs there was, these are all errors. They're coming back as errors.

We're like, we're fine with them. This is a throttle error. That's fine. But it's hard to distinguish that from a poison message in your queue, so when do you dump those into DLQ? If it's just a throttling thing, I don't think it matters to us. That was another challenge we had. We're still figuring out dead letter queues and alerting. I think for now we just relied on CloudWatch alarms a lot for our alerting, and there's a lot you could do. Even just in the state machines, you can get pretty granular there. I know once certain things fail, and announced to your Slack channel. We use that Slack integration, it's pretty easy. You just go on a Slack channel, there's an email in there, you plop it into the console in AWS, and you have your very early alerting mechanism there.

Jeremy: The thing with Elasticsearch ... not Elasticsearch, I'm sorry. I'm totally off-topic here. The thing with EventBridge and Lambda, these are one of those things that, again, they’re nuances, but event bridge, as long as it can deliver to the Lambda service, then the Lambda service kicks off and queues it automatically. Then that will retry at a certain number of times. I think you can control that now. But then eventually if that retries multiple times and eventually fails, then that kicks it over to the DLQ or whatever. There's all different ways that it works like that, but that's why I always liked the idea of putting a queue in between there as well, because I felt you just had a little bit more control over exactly what happens.

As long as it gets to the queue, then you know you haven't lost the message, or you hope you haven't lost a message. That's super interesting. Let's move on a little bit about the adoption issues. You mentioned a few of these things, obviously issues with concurrency and ordering, and some of that other stuff. What about some of the other challenges you had? You mentioned this idea of writing all these packages, and it pulls devs away from the CloudFormation a little bit. I do like that in that it, I think, accelerates a lot of things, but what are some of the other maybe challenges that you've been having just getting this thing up and running?

Patrick: I would say IAM is an interesting one. Because we are in the banking space, we want to be very careful about what access do you give to what machines or developers, I think machines are important too. There've been cases where ... so we do have a separate developer set up with their own permissions, in development's really easy to spin up all your services within reason. But now when we're going into production, there's times where our CI doesn't have the permissions to delete a queue or create a queue, or certain things, and there's a lot of tweaking you have to do there, and you got to do a lot of thinking about your IAM policies as an organization, especially because now every developer's touching infrastructure.

That becomes this shared operational overhead that serverless did introduce. We're still figuring that out. Right now we’re functioning on least privilege, so it's better to just not be able to deploy than deploy something you shouldn't or read the logs that you shouldn't, and that's where we're starting. But that's something that, it will be a challenge for a little while I think. There's all kinds of interesting things out there. I think temporary IAM permissions is a really cool one. There are times we're in production and we need to view certain logs, or be able to access a certain queue, and there's tooling out there where you can, or at least so I've heard, you can give temporary permissions. You have this queue permission for 30 minutes, and it expires and it's audited, and I think there's some CloudTrail tie-in you could do there. I'm speaking about my wishlist for our next evolution here. I hope my team is listening ...

Jeremy: Your team's listening to you.

Patrick: ... will be inspired as well.

Jeremy: What about ... because this is something too that I always found to be a challenge, especially when you start having multiple services, and you've talked about these domain events, but then milestone events. You've got different services that need to communicate across services, or across domains, and realize certain things like that. Service discovery in and of itself, and which queue are we mapping to, or which service am I talking to, and which version of the service am I talking to? Things like that. How have you been dealing with that stuff?

Patrick: Not well, I would say. Very, very ad hoc. I think like right now, at least we have tight communication between the teams, so we roughly know which service we need to talk to, and we output our URLs in the cloud formation output, so at least you could reference the URLs across services, a little easier. Really, a GraphQL is one of the only service that really talks to a lot of our API gateways. At least there's less of that, knowing which endpoint to hit. Most of our services will read into EventBridge, and then within services, a lot of that's abstracted away, like the queue subscription's a little easier. Service discovery is a bit of a nightmare.

Once our services grow, it'll be, I don't know. It'll be a huge challenge to understand. Even which services are using older versions of Node, for instance. I saw that AWS is now deprecating version 10 and we'll have to take a look internally, are we using version 10 anywhere, and how do we make sure that's fine, or even things like just knowing which services now have vulnerabilities in their NPM packages because we're using Node. That's another thing. I don't even know if that falls in service discovery, but it's an overhead of ...

Jeremy: It's a service management too. It's a lot there. That actually made me, it brings me to this idea of observability too. You mentioned doing some CloudWatch alerts and some of that stuff, but what about using some observability tool or tracing like x-ray, and things like that? Have you been implementing any of that, and if you have, have you had any success and or problems with it?

Patrick: I wish we had a better view of some of the observability tools. I think we were just building so quickly that we never really invested the time into trying them out. We did use X-Ray, so we rolled our own tooling internally to at least do what we know. X-Ray was one of those, but the problem with X-Ray is, we do subscribe all of our services, but X-Ray isn't implemented everywhere internally in AWS, so we lose our trail somewhere in that Dynamo stream to SNS, or SQS. It's not a full trace. Also, just digesting that huge graph of information is just very difficult. I don't use it often, I think it's a really cool graphic to show, “Hey, look, how many services are running, and it's going so fast."

It's a really cool thing to look at, but it hasn't been very useful. I think our most useful tool for debugging and observability has been just our logging. We created a JSON logger package, so we get up JSON logs and we can actually filter off of different properties, and we ship those to Elasticsearch. Now you can have a view of all of the functions within a given domain at any point in time. You could really see the story. Because I think early on when we were opening up CloudWatch and you'd have like 10 tabs, and you're trying to understand this flow of information, it was very difficult.

We also implemented our own trace ID pattern, and I think we just followed a Lumigo article where we introduced some properties, and in each of our Lambdas at a higher level, and one of our middlewares, and we were able to trace through. It's not ideal. Observability is something that we'll probably have to work on next. It’s been tolerable for now, but I can't see the scaling that long.

Jeremy: That's the other thing too, is even the shared package issue. It's like when you have an observability tool, they'll just install a layer or something, where you don't necessarily have to worry about updating your own tool. I always find if you are embracing serverless and you want to get rid of all that undifferentiated heavy lifting, observability tools, there's a lot of really good ones out there that are doing some great stuff, and they're specializing in it. It might be worth letting someone else handle that for you than trying to do it yourself internally.

Patrick: Yeah, 100%. Do you have any that you've used that are particularly good? I know you work with serverless so-

Jeremy: I played around with all of them, because I love this stuff, so it's always fun, but I mean, obviously Lumigo and Epsagon, and Thundra, and New Relic. They’re all great. They all do things slightly differently, but they all follow a similar implementation pattern so that it’s very easy to install them. We can talk more about some recommendations. I think it's just one of those things where in a modern application not having that insight is really hard. It can be really hard to debug stuff. If you look at some of the tools that AWS offers, I think they’re there, it's just, they are maybe a little harder to implement, and not quite as refined and targeted as some of the observability tools. But still, you got to get there. Again, that's why I keep saying it's an evolution, it's a process. Maybe one time you get burned, and you're like, we really needed to have observability, then that's when it becomes more of a priority when you're moving fast like you are.

Patrick: Yeah, 100%. I think there's got to be a priority earlier than later. I think I'll do some reading now that you've dropped some of these options. I have seen them floating around, but it's one of those things that when it's too late, it's too late.

Jeremy: It's never too late to add observability though, so it should. Actually, a lot of them now, again, it makes it really, really easy. So I'm not trying to pitch any particular company, but take a look at some of them, because they are really great. Just one other challenge that I also find a lot of people run into, especially with serverless because there's all these artificial account limits in place. Even the number of queues you can create, and the number of concurrent Lambda functions in a particular region, and stuff like that. Have you run into any of those account limit issues?

Patrick: Yeah. I could give you the easiest way to run into an account on that issue, and that is replay your entire EventBridge archive to every subscriber, and you will find a bottleneck somewhere. That's something ...

Jeremy: Somewhere it'll fall over? Nice.

Patrick: 100%. It's a good way to do some quick check and development to see where you might need to buffer something, but we have run into that. I think the solution there, and a lot of places was just really playing with concurrency where we needed to, and being thoughtful about where is their main concurrency in places that we absolutely needed to stay functioning. I think the challenge there is that eats into your total account concurrency, which was an interesting learning there. Definitely playing around there, and just being thoughtful about where you are replaying. A couple of things. We use replays a lot. Because we are using these milestone events between service boundaries, now when you launch a new service, you want to replay that whole history all the way through.

We've done a lot of replaying, and that was one of the really cool things about EventBridge. It just was so easy. You just set up an archive, and it'll record everything coming through, and then you just press a button in the console, and it'll replay all of them. That was really awesome. But just being very mindful of where you're replaying to. If you replay to all of your subscriptions, you'll hit Lambda concurrency limits real quick. Even just like another case, early on we needed to replace ... we have our own domain events store. We want to replace some of those events, and those are coming off the Dynamo stream, so we were using dynamo to kick those to a stream, to SNS, and fan-out to all of our SQS queues. But there would only be one or two queues you actually needed to subtract to those events, so we created an internal utility just to dump those events directly into the SQS queue we needed. I think it's just about not being wasteful with your resources, because they are cheap. Sure.

Jeremy: But if you use them, they start to cost money.

Patrick: Yeah. They start to cost some money as well as they could lock down, they can lock you out of other functionality. If you hit your Lambda limits, now our API gateway is tapped.

Jeremy: That's a good point.

Patrick: You could take down your whole system if you just aren't mindful about those limits, and now you could call up AWS in a panic and be like, “Hey, can you update our limits?" Luckily we haven't had to do that yet, but it's definitely something in your back pocket if you need it, if you can make the case to AWS, that maybe you do need bigger limits than the default. I think just not being wasteful, being mindful of where you're replaying. I think another interesting thing there is dealing with partners too. It's really easy to scale in the Lambda world, but not every partner could handle that volume really quickly. If you're not buffering any event coming through EventBridge to your new service that hits a partner every time, you're going to hit their API rate limit really quickly, because they're just going to just go right through it.

You might be doing thousands of API calls when you're instantiating a new service. That's one of those interesting things that we have to deal with, and particularly in our orchestrators, because they are talking to different partners, that's why we need to really make sure we could limit the concurrent executions of the state machines themselves. In a way, some of our architecture is too fast to scale.

Jeremy: It's too good.

Patrick: You still have to consider downstream. That, and even just, if you are using relational databases or anything else in your system, now you have to worry about connection limits and ...

Jeremy: I have a whole talk I gave on that.

Patrick: ... spikes in traffic.

Jeremy: Yes, absolutely.

Patrick: Really cool.

Jeremy: I know all about it. Any final advice for companies like you that are trying to bite off a piece of the serverless apple, I guess, That's really bad. Anyways, any advice for people looking to get into this?

Patrick: Yeah, totally. I would say start small. I think we were wise to just try it out. It might not land with your development team. If you don't really buy in, it's one of those things that could just end up unnecessarily messy, so start small, see if you like it in-shop, and then reevaluate, once you hit a certain point. That, and I would say shared boilerplate packages sooner than later. I know shared code is a problem, but it is nice to have an un-opinionated starter pack, that you're at least not doing anything really crazy. Even just things like having opinions around logging. In our industry, it's really important that you're not logging sensitive details.

For us doing things like wrapping our HTTP clients to make sure we're not logging sensitive details, or having short Lambda packages that make sure out-of-the-box you're opinionated about not doing something terribly awful. I would say those two things. Start small and a boiler package, and maybe the third thing is just pay attention to the code smell of a growing Lambda. If you are doing three API calls in one Lambda, chances are you could probably break that up, and think about it in a more resilient way. If any one of those pieces fail, now you could have retry ability in each one of those. Those are the three things I would say. I could probably talk forever about the rest of our journey.

Jeremy: I think that was great advice, and I love hearing about how companies are going through this process, what that process looks like, and I hope, I hope, I hope that companies listen to this and can skip a lot of these mistakes. I don't want to call them all mistakes, and I think it's just evolution. The stuff that you've done, we've all made them, we've all gone through that process, and the more we can solidify these practices and stuff like that, I think that more companies will benefit from hearing stories like these. Thank you very much for sharing that. Again, thank you so much for spending the time to do this and sharing all of this knowledge, and this journey that you've been on, and are continuing to be on. It would great to continue to get updates from you. If people want to contact you, I know you're not on Twitter, but what's the best way to reach out to you?

Patrick: I almost wish I had a Twitter. It's the developer thing to have, so maybe in the future. Just on LinkedIn would be great. LinkedIn would be great, as well as if anybody's interested in working with our team, and just figuring out how to take serverless to the next level, just hit me up on LinkedIn or look at our careers page at northone.com, and I could give you a warm intro.

Jeremy: That's great. Just your last name is spelled S-T-R-Z-E-L-E-C. How do you say that again? Say it in Polish, because I know I said it wrong in the beginning.

Patrick: I guess for most people it would just be Strzelec, but if there are any Slavs in the audience, it's "Strzelec." Very intense four consonants last name.

Jeremy: That is a lot of consonants. Anyways again, Patrick, thanks again. This was great.

Patrick: Yeah, thank you so much, Jeremy. This has been awesome.

Episode source