Future

Serverless Chats

Episode #8: Observability in Modern Applications with Ran Ribenzaft

About Ran Ribenzaft:

Ran is a passionate developer, with vast experience in network, infrastructure, and cyber-security. He's constantly chasing new technologies, with a current focus on Serverless. He is an open source contributor and currently the co-founder and CTO at Epsagon, a tool for monitoring serverless applications.

Transcript:

Jeremy: Hi, everyone. I'm Jeremy Daly and you're listening to Serverless Chats. This week, I'm chatting with Ran Ribenzaft. Hi, Ran. Thanks for joining me.

Ran: Thank you very much for having me, Jeremy.

Jeremy: So you are the CTO at Epsagon. So why don't you tell the listeners a little bit about yourself and also what Epsagon is doing?

Ran: So I'm Ran, the co-founder and the CTO over at Epsagon, based in Israel, Tel Aviv, a fun place to be, very warm. In my previous roles, I've been doing mostly cybersecurity stuff. So mostly getting into kernels and things that I can't tell you, you probably heard some of them in the news. But let's keep it discreet. And in my recent role, I'm the CTO here at Epsagon, a start up where I'm one of the co-founders, and mainly focuses on monitoring and troubleshooting modern applications. But in general, the idea is to have a single platform where you get both the monitoring capabilities, which is "is my application working properly," "is it meeting the SLAs performance,"  and so on. And the troubleshooting part, where something bad happens, you need to scroll through the logs and correlate between them and do all this distributed tracing. That's Epsagon in a nutshell.

Jeremy: Alright, great. So I wanted to have you on because I want to talk about observability in modern applications, and by modern applications, we mean cloud native, or serverless, or distributed, or whatever sort of buzzword we might want to use to describe it. But as our applications grow beyond the traditional monoliths, being able to observe what is happening in your applications is a huge part of what we need to do when we are building these modern, interconnected systems. So maybe you could just give me a minute or so on what's the difference between your traditional monitoring and observability systems, and where we are now and what's changed ?

Ran: Definitely. So it starts with the change in our infrastructure in our way we code. So if we used to have this monolithic application running on our on-prem servers, so the things that you wanted to monitor are like: what is the network throughput, and what is the CPU usage, and hard disks and so on. And, you know, just making sure the application, the process itself is alive there. But shifting to more modern application, which I think in my mind the modern application is something that you don't mess with the infrastructure around it. You get most of the services out of the box working for you in a matter of configurations that you just can, you know, tick some boxes that I want this feature and so on. And you just built your own business logic through that where it can run — I don't care whether it's in your server, something like a function of the service or other thing. So this is modern application, and in this kind of modern application, there's a big difference in what you want to monitor. Honestly, we're doing monitoring to make sure our business works. Our application is our business. I want to make sure it works, so things like how much CPU is being consumed or network throughput, or all these kinds of metrics that just show me charts about infrastructure are getting irrelevant over time. Like, for example, if I used to have a chart of how much CPU usage my database is consuming, so now we don't really care if I'm going to a managed database - a fully managed one, not like a semi-managed - I don't really care about the CPU or anything else. I just want to make sure it works, and my application can speak to it at the right timing, at the right performance, and it gets the right results. So that's the first thing. The second thing is mostly about the nature of these kind of applications, which we broke them from being a big monolith, a big single monolith, to multiples of microservices, you can call it microservices, service, nanoservices, but the fact that there was one giant thing that broke into 10 or hundreds of resources, suddenly presents a different problem. A problem where you need to understand what is the interconnectivity between these resources, that you need to keep track of messages that [are] going from one service to another, and once something bad happened, you want to see the root cause analysis. This is like a repetitive thing that you can hear over and over. This root cause analysis, so the ability to jump from the error - the error can be like a performance issue or like exception in the code - all the way to the beginning. The beginning can be the user that clicks on a button on your business website that caused this chain of events. So these are the kinds of things that you want to see where, in traditional APMs, in traditional monitoring solutions, you don't have it. And in the future, once you'll find it more and more like that.

Jeremy: Yes, so you mentioned, you said logs. You said metrics. You said interconnectivity. You talked about a couple of different things, and I think it's probably important for listeners who maybe aren't 100% familiar with what observability is. There's this thing called the three pillars of observability, which are logs, metrics and traces. So maybe we can talk about that and you can sort of tell us why each one of those things is important.

Ran: Yeah, I'll start first with the metrics. So metrics are like the key component that you can ask questions about. Let's say how [many] events that I got per day, how [many] purchases, how [many] events of [this] kind that promote my business [are there] per day or per timeframe that you want to see. These kind of metrics are the base unit that you want to monitor. Now, when something bad happens in this metric, sometimes you need to see something bigger than just a number that will tell you, "Hey, we found out that the amount of transactions that you're seeing per day is lower than 100." So you want to see a trace or, in my opinion, a trace is more like a story. What happened — like tell me the exact event where this metric was below that 100 or the KPI that I've measured. I want to see what you talked with, which resources were being involved, how long each kind of these operations took, and why or how is it different from being a good trace. Now, when you wanna dive even deeper, so you need to get to your logs. Logs are like the ultimate developer utility to troubleshoot problems. I mean, regardless, what you'll see in metrics or in traces, logs are the core thing that developers put in their code in order to troubleshoot and debug their applications and often, you want to see correlated to one another. So, for example, I want to ask these questions: "How many purchases did I have on my website in the specific day?" Now we'll see there is a spike, or the opposite of a spike, some down in their registration. It's probably going to be because of a problem. So I want to see all the traces that correlate to these specific events, and I want to see all the logs that correlates to the traces that have found to this metric. So all three are connected to each other and all this observability, which is a nice buzzword, it's a just a translation of being able to monitor our business in production. That's for me, the things logs, metrics, and traces are just different way to look on my observability.

Jeremy: So maybe we can talk about why these things are a little bit different in monitoring a distributed system versus monitoring a traditional application. You had mentioned breaking things into microservices or nanoservices, which I'm not a huge fan of that word but it's okay, um, but breaking things down into smaller parts and they're disconnected or they're you know they're distributed, right? So what sort of the, maybe what are the options that you have with these modern applications to track that kind of stuff?

Ran: So let's offer some AWS alternative to each one of them. Probably the first thing that each serverless developer or every serverless developer thinks of logs is CloudWatch logs, which is great because it comes out of the box and you're getting the logs shipped. Every print that you'll do, every stdout and stderr will come out to the logs, which is perfect. Honestly, it's great up to a certain scale, but once you're hitting millions of requests, it's really hard to navigate through and try to find what you're looking for. The log that you're looking for. So searching might be a bit of a pain in the logs, but honestly, it works out of the box, so there's no reason not to start with it. The second thing we talked about is metrics. So metrics, there's obviously CloudWatch metrics that can build on top of logs, but also can build on top of custom metrics, which also is great, you get it out of the box. For example, for Lambda or for any resource that you'll use in AWS, you'll get CloudWatch metrics already defined. So, for example, for Lambda, you'll be able to see the amount of invocations, the amount of errors, duration statistics and so on. But honestly, they are distributed across hundreds of metrics. And sometimes you want a single dashboard that will just show you all of these metrics and will tell you when something bad happens. You don't want to configure, if I cross this threshold, that if I'm getting an error in here and if I'm getting something there, it's an endless amount of alerts that you'll need to configure and you want something out of the box that will work for you. And the last one regarding traces, or distributed traces in modern application, we've got X-Ray, which is great for tracing AWS resources. It tracks down almost any request that you'll make using the AWS SDK. However, it doesn't track anything external to AWS, and it doesn't do distributed tracing yet. Hopefully they'll get there soon because that's  X-Ray distributed tracing, but at the moment it's still limited. So these are the options that AWS provides. There are tons of other resources that you might, that you can find outside of AWS.

Jeremy: So with X-Ray, though, in terms of being able to get alerted on slow-running processes or resources that are taking a while, is that is that possible with X-Ray?

Ran: So it doesn't come out of the box. Like AWS does build the best infrastructure that you can build applications on top of it. So X-Ray will collect these traces for you, and then you can build your own application that says "Scan all my traces. Scan all the operations against my database." And when I find an event in my trace that crossed my threshold, send [me] a Slack alert, or on whichever platform I'm feeling most comfortable. So it doesn't come out of the box, but it gives you the infrastructure to build great things on top of it.

Jeremy: And you also mentioned that it's great for tracing AWS resources, but what about third-party calls? Or if you're trying to interact with a third-party resource?

Ran: Exactly. So such as every application, it's almost completely hybrid. I, for my years in doing cloud, I haven't found any application that is only serverless or only containers or only on-prem. You find mixtures of all kinds of applications and you can find yourself using, like Redis, which is not part of the AWS. And instead of DynamoDB, MongoDB. And instead of Kinesis, Kafka, that you own because you needed to configure something specifically for you. X-Ray wouldn't be able to trace these kind of things and definitely not distributed kinds of these things. For example, a message that goes through Kafka, it wouldn't be able to trace you from one service to another.

Jeremy: All right, so you mentioned third party apps, and obviously, or third-party products, and obviously Epsagon is one of those, but there are others besides Epsagon. But in this context, how do these third-party monitoring or observability tools, how do they extend what the cloud provider does?

Ran: The main difference between SaaS services and infrastructure solutions is that SaaS service comes out of the box prepared for you with all the integrations and all the needed configurations already plugged and played just for you, so you can run quietly and make it work. Just like you're using a managed Kinesis, instead of building your own Kafka. So you want to a managed monitoring solution, just so you want me to build your own Elasticsearch and build things on top of X-Ray and build another dashboard outside of CloudWatch metrics, it comes much better outside of the box for you and sometimes it brings some more value, more application-wise value like, for example, Epsagon provides cost monitoring or monitoring things that are not necessarily provided by the cloud provider. For example, if you own other services, which are not running on AWS, Epsagon will monitor them as well.

Jeremy: So let me ask you this question, because again you think about CloudWatch logs and you think about, you know, metrics and X-Ray and you're right. There's a little bit of setup involved there. The searching on CloudWatch logs can get kind of slow, but obviously, you know, some people just transport their logs to like Logz.io, or whatever it is, or put it in an ElasticSearch cluster or something like that. But if you're just shipping the logs or trying to aggregate some metrics, you're not getting the whole picture, right, because you're not seeing, like you said, where all these things sort of tie together. So maybe you can tell me or help me answer this question. Why is this such a hard problem with distributed systems?

Ran: That's actually great point, Jeremy. I think that the main issue here is that just shipping out logs, unstructured logs, wouldn't tell you a lot of information about what you're looking for. Because logs are things that developers wrote for themselves in their codes, so once there will be a problem, they'll be able to investigate it. But it's not a thing that you can ask questions on top of it. I can't ask how many transactions, how many purchases did I have on my website today, because it's not a metric. It's a log line. So building things that will do instrumentation and distributed tracing, and will give you out of the box this ability to do custom alerts on frames that you would like and this cost monitoring that I've mentioned doesn't come out of the box and building it, it's really hard. It takes a lot of time, especially when you're doing things in scale, so you need to manage that as well. So you want a service that will be managed for you to do all of these things.

Jeremy: All right, so let's move on to sort of the next, I think that's a good segue. So let's move on to this idea of actually enabling your application to do some of this tracing and logging and things like that. So you mentioned unstructured logs. So obviously, if I'm in Node and I just use console.log I can write texts to the log, or I can create my own structured JSON object and send that in. So I could do that. But that sort of requires me to manage that myself. Again, you mentioned the CloudWatch metric sort of captures the data after the fact and you can kind of parse through the logs and X-Ray, there is some instrumentation that needs to be done there in order to make sure that it's tracing calls to MySQL or other services. So maybe we could just kind of go into this whole idea of instrumenting services and code in general and maybe we can start by, you know, sort of explaining what exactly we mean by instrumentation.

Ran: Perfect. That's a technical question that I like to take. Instrumentation is the way or a technique which allows a developer to, let's call it hijack, or add something to every request that he wants to instrument. For example, if I'm making a calls using Axios to a REST API for my own code to an external or third-party API. I want to be able to capture each and every request and response that is coming in and out from that resource, from that Axios request. Why would I like to do that? Because I want to capture vital information that I'll be able to ask questions about later on. For example, if my Axios is calling Stripe to make a purchase or to send an invoice to my customer, I wouldn't know how long it takes, because I don't want my customer to wait on this purchase page or wait for his invoice to get into his email. I want to make sure of how long it takes so I can measure that, put that as a metric in CloudWatch metrics or in any other service. And then I'll be able to ask, "Well, was there any operation against Stripe that took more than 100 milliseconds?" If so, it's bad, and this is only accomplished using instrumentation. I mean, the other way around is just to wrap my own codes every time that I'm calling Stripe or every time that I'm calling any other service. But with the amount of annotations that you'll have to add to your code, it's almost unlimited, so you won't get out with it without a proper instrumentation in your code.

Jeremy: Yeah, and I can imagine too if you're wrapping every request or you have to do something custom for every request, that's obviously an easy thing for for developers to forget or potentially get wrong. So how do you do it then? So if you're using Python or you're using Node or, you know, one of the other languages that maybe Lambda supports, what exactly do you do? How do you instrument these HTTP requests and SNS requests and things like that?

Ran: Yeah. So you mentioned console.log so I'll give the example in Node. In Node, there's a fantastic library called Shimmer. Shimmer allows you to, since it's a dynamic language as it's not compiled to anything, it can just alter a function in the memory. So, for example, I'm altering Axios.get() to my own function. I'll make sure to get all the details that you've sent to Axios.get(). I'll extract information from it, tell which kind of information do I want it to get for me. I'll send this request to the real Axios.get, and then I'll capture the response, get everything needed from the response and I'll put the response back to you. So it's almost as transparent to the actual operation, but in the meantime, I have collected information both from the request and from the response. This can be done, for example, for AWS SDK library, for any HTTP request library like Axios, Got, Fetch, HTTP and so on, or any thing. Even I can instrument myself into Console.log. So every time that you run Console.log, I'll capture the log and I'll stream it to where it was originally originated.

Jeremy: But how do you do that? Is that something you have to do manually for every call to Axios?

Ran: Yes. So I'm doing it once. I'm doing generic for every call that there will be to Axios and then I'm collecting from Axios, for example, if it's an HTTP request, the URL, the params, the headers, the status code from the response, the headers of the response and every metadata or fingerprint that I would like to collect as part of it that I will be able to ask questions or filter or seeing the trace afterwards.

Jeremy: And so the the information that you collect, what do we do with that?

Ran: Many kinds of things. One of them is to put as a metric. As I mentioned, a metric can be how long it took or how many error codes, a type 501 I get or larger than 500 in the HTTP response code. It could be something from for traceability. So, for example, I want to capture the headers of the request because I know that there is a user ID there, and sometimes I would like to ask questions about: tell me or show me all the traces that belong to that user ID that I've sent to Stripe. So it's good for tracing. And it's also good for logging. I mean, everything that I capture will be used for myself then to explore the logs themselves. Like show me all the headers. Show me the body that I've sent to Stripe because I know there has been some error, and now with the body, I can actually see what kind of payload that was sent to Stripe and what was the response and understand what went wrong in this specific request.

Jeremy: So you take you take all that information and you write that into CloudWatch logs essentially, right? You're not making ⁠— I'm just thinking, obviously CloudWatch logs are asynchronous, so it doesn't slow down your Lambda function at all. Whereas if you were making synchronous calls and you had to write back somewhere, that could slow down the execution time. So you're just doing the asynchronous stuff.

Ran: Yeah.

Jeremy: Okay, that makes a ton of sense. So what about information? So you mentioned capturing like a Stripe API call. What if that contains a credit card number? Like we don't want to write that to a log, right?

Ran: Exactly. So when you're taking care of the instrumentation, you also need to take care of data scrubbing or sensitive information omitting. For example, well, it's defined by the user. For example, because for dev environments, I do want to capture that because I want to be able to troubleshoot faster my dev or staging environment. But for production, for example, I don't want to capture any sensitive information: any passwords, any emails, anything that is, you know, PII or PHI, like information about the health or the identity of my user. So instrumentation needs to be aware of the data its collecting or allowing the user to omit every [piece of] sensitive data, so I won't capture any headers. I won't capture any payloads. I just want to capture the metadata. I want to know that I had an operation to Stripe, it took this amount of time, that the response code was this, like the status code of the response, and so on. So it's more about meta data. So it depends really on the scenario, but instrumentation needs to be aware that it can collect sensitive data and to give it the ability to omit that data.

Jeremy: And I would think that if you were building, I mean, for some of these things too where you're maybe building an interaction into Stripe, you would want to sort of build your own module in between that. So when I wanted to charge a credit card or I want to send an invoice or I want to do some of these things I would write a module that kind of handled that for me like a data layer, right? That I would then wrap that so that that my developers, when including the Stripe component in their system, or in their code or their scripts or their Lambda functions, they wouldn't need to do this instrumentation again.

Ran: Exactly. So what we're doing in Epsagon is instrumenting each and every library in an essence that will give the developer the ability to omit any sensitive data so it won't be collected and won't be sent outside of the runtime.

Jeremy: Alright, so that's really cool stuff. But what about auto-instrumentation? So Lambda layers are very cool feature that I think some people know about that allow you to include or run code before every Lambda function. And I'm pretty sure that's how Epsagon does some of the auto-instrumentation, but so what does that do? What does that mean to auto-instrument something like that.

Ran: So the layers, it's a pretty cool technique that applies currently only to Lambda. I wish we could do this in containers or in EC2, for example, that every spawned EC2 will get some of this structure ⁠— some of this data. What we're doing behind the scenes is adding our layer, so that includes Epsagon, already prepared for every invocation that the Lambda is running and we're hooking ourselves into the runtime and changing the handler to us. So it means that the request that invokes the Lambda will first come to Epsagon and we will ship it back to the original code. So the other instrumentation brings us the ability to let developers or ops guys just to mark a function in the Epsagon page and say, "I want to instrument it," instead of adding even the minimal amount of code, like two or three lines. Just say, "I wanna monitor that. I want to see traces out of this function because I had these metrics and it's not enough. I want to see traces, distributed traces and instrumentation comes as well."

Jeremy: And so if you're trying to instrument Axios or SNS or DynamoDB or any of the other AWS SDK components so you just include your layer and that wraps all of it. Now, does that automatically get instrumented? Or do I have to do something in my code to say, capture all the SNS calls, capture Axios, capture DynamoDB?

Ran: We offer this auto-instrumentation comes totally automated, so you won't need to configure. I want to instrument PG or I want to instrument Axios or SNS request. Everything will be instrumented for you. You can manually specify I don't want to instrument this and that. But you know, for auto-instrumentation, it's just about frictionless onboarding, having the best experience at no time. So that's what we're aiming to do.

Jeremy: So that's really cool. But if you're capturing all of this tracing data, that's a huge amount of data that you're writing to CloudWatch logs. One, that sounds expensive to capture all that information, but maybe more importantly, what do you do with all the data? Like what happens next?

Ran: That's exactly where distributed tracing comes. So the first part of traces is to get all the information via the instrumentation. And then comes the part of correlating all this data. Obviously [then] comes the third part of presenting all this data, which is a problem for itself. But the second part is to trace or distributed analysis for all these events, all these traces altogether, so we'll be able to correlate a message that's going through one service to another or through a message queue through any other third-party. You want to correlate between all these kinds of event and that's exactly distributed tracing

Jeremy: Perfect. So alright, so you mentioned distributed tracing. We talked about distributed tracing. I think we've covered a lot of these topics, but maybe we can just kind of go down this path of why it's necessary, right? So we kind of know how it works. We talked about a little bit of correlating the events and capturing all this log data, structured log data, putting it all together. I think that makes a ton of sense and I think most people kind of get that idea. But what are we gonna be able to know? Like you know why is it necessary that to trace all these things when you're building these traditional ⁠— I say traditional ⁠— but when you're building serverless applications?

Ran: Up until recently, like recently, like two or three years, I would say that distributed tracing is not a mandatory thing that each R&D team needs to have as part of its arsenal of tools. Today, I think it's almost like a crucial or vital thing that you need to have. The main reason is that we already know that applications are becoming more and more distributed. So, for example, once a user is buying something at your store, you want to make sure that it gets the email to him with the receipt and the invoice and so on, as soon as possible because otherwise he's hanging there, waiting for confirmation or waiting for something to get to him. And in a monolithic way, it's been pretty easy because you had something specific, a single thing that will take care of everything. But now we've got, like between 3-300 services that might take care of this operation: one that will get the API request from the user, from the Web server, the other one that will parse the user request. The third might be something regarding billing that will charge through Stripe or through another service. The fourth one could be something that is mailing users, and all of them are connected to each other, with some messages that are running from one to another. It could be like a star, or it can be like 1-to-1, all the way up until it gets to the email service. And without distributed tracing, you wouldn't be able to ask yourself this question: how long does it take for a user once you buy something until the moment he gets his confirmation. Because if it takes, let's say, for example, a ridiculous number. Let's say one minute. It's not good. I don't want my user to wait one minute in my website for confirmation. I want it to be, let's say, sub-second or let's say sub-five seconds. Other than that, it doesn't meet my SLA. And only with distributed tracing can I really measure end-to-end traces and not just a single trace every time.

Jeremy: Well, yeah, and I think that part of the reason why this idea of distributed tracing is necessary is just because of the way that we're building serverless applications now. And even with microservices, it was a little bit ⁠— things were still a little bit more contained, right? So what happened within a microservice? There were a lot of things working together, but it was still sort of a mini-monolith. I always call it that and I get criticized for it, but I'm going to say it anyways. It's sort of a mini-monolith, right? It does a lot of different things. It has a lot of subroutines that that interact with different parts of system. Whereas when you start breaking things up into serverless, now you've got all these small little functions that do one thing well, and you are using event-driven in this event-driven approach where, like you said, somebody places an order on the website. You aren't going to then call this subroutine, then call this subroutine, then call this subroutine. You're going most likely parallelize that, right? You're going to send it out in SNS. You're gonna have it queued with SQS queues. You're going to use something like the event fork pattern. You're going to have messages flying all over the place. Maybe you're going to use step functions to process them, you know, and now this is the new callback pattern with step functions where maybe something has to happen and confirm before it can move on. So you just have a lot of things happening that are all disconnected, and either they're orchestrated or choreographed. But either way, knowing how all that stuff flows through the system and more importantly, whether all of those things succeeded, it's a hugely important thing in order for you to run your business.

Ran: Yeah, it's as you mentioned, everything becomes event-driven and on top of event-driven, it's asynchronous type of event-driven. So I throw a message to an SNS. I don't care about the response. And I know that someone will take care to charge the user for me. And I send an SQS to something that will trigger another Lambda that will send email through a SES. I don't care. I sent the message. Someone take care to send the receipt to this user at this email. So it's becoming more of a problem to do distributed tracing where everything is asynchronous.

Jeremy: Yeah, and if you subscribe to that idea of using asynchronous transactions or asynchronous messaging, splitting things up and then you know, dividing up your teams, splitting up your team, so one team is in charge of sort of managing that Stripe API and all the billing requests and things that happened there; and another team that, you know, that does the inventory; another team that does the the ordering components and things like that. Being able to see that whole picture, especially when you're not familiar with some of these components in the system, I mean as things start to scale, it becomes very, very confusing if you don't have distributed tracing in place.

Ran: You'll hear blames flying out from one team to another. Everyone says it's okay for me. Maybe it's the other team's responsibility.

Jeremy: The "Works for me" trademark. You know, that's the one I love the most. Alright, so I think there's just a couple more things I want to talk to you about. Maybe things like OpenTracing and OpenCensus. So maybe you can give us a little bit of background about where those fit into this idea of distributed tracing.

Ran: Yeah so OpenTracing, I think it was the initial draft for how to do distributed tracing, more about specification on how to collect and what is the protocol between services to be able to build these distributed traces. And then came OpenCensus, which was like a mix between instrumentation and distributed tracing. It's bounded up together to be OpenTelemetry, so it's no more OpenTracing, neither OpenCensus. It's called OpenTelemetry, and the main thing that it brings is how to track a message when it spans over multiple processes or multiple services even if they're asynchronous. The main thing that it does is to let you know that you need to inject an ID, once you're leaving the process and to extract that ID, once you're coming from another process away from sending a message through a Kafka, I'll inject to that message, "Hey, I'm trace #123." So when the second service will get it, it will extract this information from the Kafka and we'll see, "Oh, I'm part of trace 123." I know that everything is clear for me. I'll continue with that trace along the path. So even if it's asynchronous, this inject-and-extract mechanism that will work along the way.

Jeremy: And these, OpenTracing and OpenCensus, or OpenTelemetry, now they don't actually do anything with the data though, right? It's just more of the standard. It's just sort of how they're supposed to interact with one another?

Ran: Right, so there are some implementations on top of them for some more automated, some more data collection, data shipping to somewhere, but out of the box, neither of them  - neither of both or this single new one - will do anything. It's more about: This is the standard. This is the way you should capture traces and transfer them between one service to another. And actually I want to say it's not that easy to build on top of that something because you've got a standard. Now you need to work out your own way how to do instrumentation, and you need to capture this, all events, all the information that you need. These annotations might be endless. I mean, capturing every event in your system. You need to build it in your own. Then injecting and extracting ID through HTTP requests, through message queues, through pub subs, through API gateways, through anything. It's almost an endless list that you need to take care for. Then just handling all this data. I mean, if you're a company that's handling billions or tens of billions of events per month, you need somewhere to store all these events. And as I mentioned before, you need to build something that will present and you'll be able to ask all these kinds of questions on top of this data, which is a problem for itself.

Jeremy: Awesome. So last question here because I love AWS. I'm an AWS Serverless Hero, big fan of the things that they do. But CloudWatch and X-Ray and CloudWatch metrics: they are not turnkey, right? I mean, there's a lot of things that I still have to do when I'm trying to do that. And I have used Epsagon and I really like the the added functionality that it gives you with the ability to look at the logs and just putting everything together, seeing all your applications together. It's really good. And this is not an advertisement for Epsagon. But I do appreciate you coming on, and maybe you can tell everybody just what it is about Epsagon, and, you know, maybe third-party libraries in general or third-party services in general. How does it go beyond what CloudWatch and X-ray does?

Ran: Yeah, I like the question because it's the more broader term, regardless of Epsagon. Having a managed solution will give you a peace of mind that you know that you handed over the monitoring problem ⁠— or let's say a different problem, ⁠but specifically about what we're talking, the monitoring and troubleshooting and distributed tracing problem ⁠— to some other third-party. It will take care of giving all the information for you, for your teams, to be able to make the right decisions. This applies to any third-party that is doing right service. Now you can build everything on top of AWS. We're built on top of AWS as well, so it means that anyone can build Epsagon. However, it's not that easy. It's not that easy to build something that scales to that much of information, that comes out of the box with everything you need, that gives you all the ability to trace and instrument all these kinds of events whatsoever, even if they're in AWS or external to them. And that's the differentiation between having an infrastructure solution to more application-wide solutions that are ready for you. Just plug and play.

Jeremy: Perfect. Alright, well, thank you so much, Ran, for being on the show today and for just sharing all of your knowledge with the community. How can people get in touch with you?

Ran: First is on Twitter. My Twitter handle is @ranrib. Feel free to ping me there, I'm with direct messages open, so just, I'm looking forward to hear some of the interesting serverless, microservices and other cloud environment stories. I really love them. Also on Epsagon.com, obviously, which is where I work. And the last one is the Epsagon blog, and my biggest fetish is benchmarking, so I really love to benchmark resources, services and make sure how they work internally. So things like how AWS Lambda is built behind the scenes or which is the best way to send a message from one service to another like SNS, SQS, Kinesis, a direct call or HTTP — all of these kinds of things are things that I'm writing about, so feel free to get into Epsagon.com/blog. Other teammates write some cool things there as well, but mine are the best.

Jeremy: Well, at least you're modest about it. Alright, well we will put all of that in the show notes. Thanks again, Ran.

Ran: Thank you very much, Jeremy. It's been a pleasure

Episode source