Future

Serverless Chats

Episode #62: The New Relic One Platform with Nica Fee

About Nica Fee

Nica Fee is a Serverless Developer Advocate for New Relic. She's worked with and written about serverless for the last two and a half years. She recently spoke at Deserted Island DevOps, which you might know as the tech conference that happened in Animal Crossing. She writes regularly for The New Stack.


Watch this video on YouTube: https://youtu.be/yM4q0NSFz0M

About New Relic
New Relic One is an observability platform built to help engineers create more perfect software. From monoliths to serverless, you can instrument everything, then analyze, troubleshoot, and optimize your entire software stack. All from one place.
Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today, I'm speaking with Nica Fee. Hey, Nica. Thanks for joining me.

Nica: Hey. Thanks so much for having me, Jeremy. Longtime fan, so it's great to be on.

Jeremy: Well, thank you. So you are a developer advocate at New Relic, and I'd love it if you could tell the listeners a bit about your background and what led you to New Relic and sort of what's new with New Relic.

Nica: Sure, yeah. That's great. I was actually at New Relic back in the day. I was at New Relic as a support engineer until about 2015 I believe, and left to go and become a full-time developer and full-time coder. And my path took me back sort of... As I was sort of coding full-time and just clearing queues and writing features and fixing bugs, I really started to miss some of the community building that I'd done previously. Especially actually when I was at New Relic back in the day, I was one of the people who was starting meetups and doing that kind of community building. And so I started trying to pursue that as a job which is how I got into dev advocacy. Dev advocacy, you get to tinker and you get to play and build stuff, and you also get to try to get other people excited about it and try to show it to people.

So I was doing that for Stackery, which is a serverless deployment tool, for two years, and had some success there and built some skills and really enjoyed it, and that's where I got kind of very into AWS and cloud engineering. So yeah. Now I'm back at New Relic, and it's such an interesting time to be at New Relic, and be looking at how we can go and talk to developers. Something that is interesting about being here is that everybody is talking about the time I was last here. Everybody's talking about, "Hey. There was a time when New Relic was something that lone engineers would install on one server and something would go down." They'd be like, "Well, I can see right here what the problem is." And then some exec was saying, "Hey. Let's go use this tool. It sounds great."

And right now, the question is, can we get back there? Can we get back to the place where it's a tool that developers love and that they're the ones saying, "Hey. We got to use this," rather than... As are so many developer tools being something where most people know it from the CTO coming in and being like, "We're using this tool. I met this guy on the golf course. He's told me great things about it. It's got a great spec sheet. We're using it. Everybody's going to use it now," right? So the question is, can we get back there to being in that space? So that's sort of what I'm doing New Relic because I'm trying to go talk to actual engineers about what it does and how it can help them.

Jeremy: Awesome. Well, I mean, one thing about New Relic is that they just released the New Relic One platform. I want to say the new, New Relic One platform, but it seems kind of hard to say new twice. But first of all-

Nica: We actually do that every year. We should do New Relic but then a starburst at the side. It's new this time.

Jeremy: It's even newer. Well, first of all, I want to thank New Relic because they are sponsoring this episode which is amazing, which again shows their incredible amount of support towards the community as well. So I do think that this is a great opportunity.

Nica: Can I give a quick shout out on that one, actually?

Jeremy: Absolutely.

Nica: As a dev advocate, I am actually really actively looking for stuff that is exciting in the community that we can help support. And so, obviously, you were very high on my list. I said, "Hey. We got to do this." But I don't see everyone. I don't know everything. So if you're listening to this and you have either an open source project on observability or you're doing community events or running a podcast that maybe is a little bit less famous than this guy, get in touch with me. Hit me up on Twitter. Show me your stuff. I would love to hear about it. My situation right now is I don't know enough people to support, not that I can't do that. So yeah. I want to hear from you all, if possible.

Jeremy: Oh, that's awesome. That's a great offer, and anyone listening, please take up that offer because I think it could be quite amazing. So anyway. So-

Nica: ...keep expensing GitHub sponsorships. But for the moment, that's just one that's just one like, "Well, just do it." And I'll just fight with AR after I get it done. And I'm sorry. Go ahead.

Jeremy: No, no. Not at all. I appreciate that. All right, so let's get back to New Relic for a second and this New Relic One platform. I do want to go through this because it is actually pretty cool. I mean, the entire thing has been completely rewritten. It's all new, right? Not all rewritten. I shouldn't say it that way, but-

Nica: Yeah. I had hopped into... So I came on five months ago now, and I got into New Relic. And I had kind of... I was excited. There was obviously tons of new stuff since I'd last been there five years, but I was a little confused. There was some stuff that looked the same as what I'd seen five years ago. There was other stuff that was like, "Oh, this is kind of a nice little interface." And there were things in this new interface which was sort of part of the site at the time. You could do stuff like every chart, you could go see what is the actual query that's building this chart. You could go and edit it. You could go and facet it, and you can make it more sophisticated, save it out. Oh, it was really neat. But that was only kind of part of the site. And it was like, "Hey. This doesn't feel 100% cohesive." And I'm like, "Maybe it's just me. Maybe I'm not trained or something."

But what's been happening and what's been released in the last few weeks is the whole site is the same very clean, cohesive experience now so that you can do stuff like if you're monitoring AWS Lambda and you're maybe monitoring some other service, maybe something that's even self-hosted, but their performance is implicitly connected, you can tie them together very easily. You can even just rewrite the query to connect them both directly. Or I was just writing something that's just trying to do your own kind of basic cost estimation that just applies its own rate of, "Hey. We know for Lambda how much does it cost per request," but maybe for your self-hosted stuff, that has a certain cost per usage and so, it times them together and give you a nice price dashboard just kind of out of thin air. That's pretty nice, yeah. So yeah. That's the New Relic One platform which... We're working on dark mode, but right now it has a lot of quality of life improvements for developers. So we're enjoying that.

Jeremy: Yeah. And I think that if people don't know, I mean, New Relic... I always remember New Relic as being APMs and monitoring and that sort of stuff. Obviously, the buzzword of observability is the newer sort of thing. So maybe we take a step back and just in case people don't know, what is observability? How do you define observability?

Nica: Yeah. This is a really good one. It's that someone saying, "Well, you don't want monitoring. You want observability." Or you say, "APM, application performance monitoring," and now you want observability. What's the difference? And I think about how very often, it's very useful for our dashboards or for whatever else to kind of look at one metric that covers a lot of stuff. For example, we kind of want to combine how fast we're loading, how many requests are we serving, and is anybody seeing errors, and you want that in one number. Observability is kind of an attempt to do that as a organizational goal. Observability asks, "Hey. How fast can someone looking at a problem or a question come up with an explanation and a next step?"

So the classic of course is the service is failing or flopping or we have a bunch complaints from users. Everybody's reporting a problem. We know something's wrong. If we take from that time to the time that we understand what the problem is, that's our measure of observability. So everything has, right, it has a certain amount of observability, right? I would say the only thing where you maybe have an observability score... Not a real score, but your observability is very, very poor is when stuff goes wrong and it keeps going wrong and you end up just resetting the service and then it works and you don't know why, right? You might have a very poor observability situation even if your resolution was relatively quick, right, because you have a black box inside your system and you don't understand how it works.

Now, New Relic can help with part of the observability picture, and obviously monitoring is part of the observability picture, right? How well do you go in, see how your code is performing, and send that metric back to some kind of data warehouse to say, "Hey, show me how well we're performing." That's part of the picture. Other stuff like a new interface actually effects observability, right? Because if you're struggling to click through and see what's really going on, right... If you're clicking through thousands of lines of CloudWatch logs or sitting there and trying to write a regex to sort of try and maybe find a pattern in these logs, you may have all the monitoring in the world, right? You could add a log line to every single line of your Lambda code, right?

So you have all the monitoring in the world. It's all there, but your observability is very poor because the time it takes you to actually figure out the problem is quite long. And maybe you when you find the answer, it's in great detail. It's very interesting, right? But the time it takes it high. So that's why when you pursue observability, you have to think about everything from how data is being collected, obviously how it's stored, and how available it is, but then also just how it's displayed in a way that makes sense and can operate quickly.

Jeremy: Right. I do remember when I worked a help desk, a support desk, very, very, very long time ago, my favorite solution to everything was just reboot your machine. That always works really, really well. Unfortunately, we can't do that in the cloud quite as easily, especially with really large applications.

Nica: I was doing a practice for one of the AWS certs, and I noticed one of the questions involves how you might automate that on an EC2 instance. Well, I could answer that, but I hope that's not what's happening. I hope you're not just saying like, "Well, how would you automate rebooting every 24 hours to keep this, who knows what, from affecting you?"

Jeremy: Right, right. Yeah. No, and I think its interesting too, you mentioned about making customers happy because that's one of those things for me too where it's great if you can get some alarms. And I mean, you can set up CloudWatch alarms. You can even do some interesting tracing with things like X-ray and like you said, CloudWatch insights. There's all kinds of things that can give you data, but your ability to kind of pinpoint what's wrong and act on that data quickly, that's a big thing because even if your reliability is really high and you have a lot of sort of nines up there, I think if you don't have that resiliency built in where you can keep customers happy, that's a big thing.

Nica: Yeah. This is the Charity Majors thing. Charity Majors puts it as nines don't matter if your customers aren't happy. There's a couple ways to look at that. One is right, you may have great observability. You may have great metrics for performance, but you're not really seeing errors that are happening. Another one that I see pretty frequently, and I do get to talk about another feature I genuinely like which is awesome, which is way back in the days, one of my first development jobs, I was working for an online classroom system. And a lot of our users where home schoolers or people with just two or three kids in their little online classroom. And for them, the service always performed great. And they represented, in all their usage, they represented none of the operating cost of the company.

The operating cost of the company was brought in by these people who often had 30, 40 kids in a room, but then often hundreds of kids in the same virtual classroom. And for them, the interface performed very poorly. They were a much tinier percentage of daily page loads, so our average page loads were great, but what happened... Well, the system that we had therefore created was, once you really started to love our tool and give us a lot of money, we just gave you garbage. That was when we started to treat you very badly, and that's just... All the data that that was happening was there, but until you could facet that data by for example, customer ID, or even, God forbid, by sales size, by how much money that... You have page load, sure. How much money does each page load represent in revenue, right?

And so, that's something where again, you can have great instrumentation and great metrics, and I mean, that's in... For example, that's in your CloudWatch metrics somewhere, right? You have all those parameters on every transaction in CloudWatch, but the question is how do you get to that point? The other side of course, and this is the classic New Relic thing, and my title is actually, I'm a dev advocate just for serverless, right, so I'm very focused on the serverless stuff, is beyond, okay, yes. Here's how you're performing overall for whatever customer, right? Let's facet it or let's look at various metrics, is okay, how is this code inside of that actually performing?

That is an area where again, whatever great AWS or Google Cloud or whatever built-in metrics, they're just going to tell you how long that virtual code execution environment ran, right? When did it start? When did it stop? It matters for billing. It matters for everything, but the why of it, right, why are we hitting a problem here? Why are we not performing for some users? Or maybe you're doing something really complex and you're omitting a user response halfway through your runtime, and so climbing runtime maybe isn't a problem because the second part is you're just cleaning up data or gosh, no.

So for that, that's this thing. APM, it may be a classic, but it's a classic for a reason, right, where you say, "Hey. I want to know how much time is sent..." So you'll know how much time is sent inside the express library, and how much is my own code, right? And these are questions that you want to have real instrumentation, right, like code level instrumentation, and ideally you want to not have to sit there and add a bunch of timing points and call points. Adding observability, you really hope that it's not that your whole team can't ship features for two or three weeks while they go and add a bunch of code points, right?

So yeah. New Relic of course, got famous for doing this right out of the box, and New Relic Serverless offers a very similar performance where it will go in. It'll tell you, "Hey, this thing's running really long." "Okay, why?" "It's this function call," right? "That's the one that's taking so long," or, "No. Hey, all the Lambda code's running really fast, but it's sitting here waiting for the DB to come back for quite a while." And you can see that very easily.

Jeremy: Right. So I mean, in terms of adding observability to your application. I mean, I remember back in the day as you said earlier, where you could just install something on the server and then it would just start doing all that stuff for you, right? Yeah.

Nica: Yeah. This is a really interesting area because there's some stuff when you set out to do this that just doesn't make a ton of sense, right? You're just like, "Okay." When something happens, it's maybe... Okay. You load some kind of wrapper and you wrap a serverless code, just suppose you can wrap any other code to say, "Okay. Kind of watch the function. Maybe you look for function calls, and then announce when something took a long time." Okay, announce where, right? Because you're on the server environment, right, so you don't have an agent, right? There are things that just don't translate over like a common call that I would write and explain to people a thousand times a week when I was in support was, here's how you increment a metric, right? You say, make a call to the New Relic agent and you say, "Okay. Increase whatever metric by one."

And sure, maybe there were a few instances of the agent running on different servers but they would... We could work that out, right? We could add that up. But that's not meaningful at all. There's no observer, right, to get all those little requests on the serverless environment, and if you're doing something... Someone told me recently that I use the term naively in a way that's... I don't mean it pejoratively but it's that... I just mean you're just trying something out as a prototype, right? And prototype, say, instrumentation for Lambda might be, "Okay. When you get done running, don't just end," right? You've returned whatever you returned. Go and report your data somewhere and then shutdown, right? The problem is, I mean, the gift and the curse of serverless, right, is that it charges you by the second that that thing is running, or the millisecond that that thing is running.

So if you just... Oh, I just need to make a quick little call, well, that could very easily... Well, Lambdas run for a very short period of time, so that could easily double or triple the runtime up. So then your bill for Lambda has just shot up just to get observability, and that's not a great situation. You don't generally want to see your actual service cost shoot up to do observability. So what the New Relic agent does is it creates a wrapper which is the same code instrumentation that you see with our APM style. So if you're a running a Node Lambda, you'll get the same level of code instrumentation, but instead of trying to phone home every time it runs, it writes it out to CloudWatch and then uses another agent to just snarf that up from every single one of your Lambdas. And so it's a very clean install experience and also has very, very tiny overheads so that's quite nice.

Jeremy: Right. And then the performance of other things, I mean, you clearly know this with the way serverless works is that everything is so distributed, right? So you've got SQS queues and you've got EventBridge and you've got DynamoDB and you've got all these different things happening. Somebody uploads something into an S3 bucket and that kicks off. So what type of observability do you get with New Relic around those other components?

Nica: Yeah. I mean, before we even talk about New Relic, this is such an important thing. And I'm guilty of it. I'm sure if you run the tape back, you'll hear me do it here where I say, "Oh yeah, serverless." And then I say, "Oh, yeah. AWS Lambda or maybe Google Cloud functions." And I'm like, "Those are..." I almost said those are synonyms like those are the same thing. In fact, right, serverless... AWS Lambda is not particularly new but it is still kind of new, but the oldest and best thing from AWS, AWS S3, that's also serverless, right? You don't initialize your storage server there, right? You just give it the object and expect it to figure it out, right? And so, first of all, even the term serverless applies to a whole bunch of cloud services, and then also no... I say this all the time. I say, "Who can tell me how to find your serverless functions IP address?" Okay, okay. Right. So that was unfair. "How do you find a TRL, right?" You just want to go and get it, some of it. It's like, "Well, okay. It doesn't have those things, right?"

To even do that, you need to create an API gateway to have a connection even to your Lambda. So no Lambda exists in isolation, right? And of course, there's going to be at least, even for anything beyond Hello World, even for the to-do list app, right, you're going to need a gateway, probably some kind of file storage, and then you're going to need some kind of database probably, right? So one of the other things I hear all the time when I talk to developers who are working with serverless is that maybe they understand how their code is working, but when they go and hit their API gateway and get a 500 back, they don't know where the problem is, right? Is it permissions between components? Is it the Lambda code? Is it the databases that's returning something wrong? It's just not obvious, right? And so AWS is sort of... Their in-house solution to that is X-ray which is an effort to say, "Hey. Let's see what one request did all the way end to end, right, to give you that insight into saying, "Hey, let's see how this started and ended and what services it hit between.""

So you might even have some surprises there that you say, "Hey. I didn't realize this is relying on this other maybe queuing system or this Lambda always calls this other one." Well, there's a problem with what I just said always, right, where X-ray is just a little sampling of, "Hey. This is what one of these did." And very often, you'll been in the situation with X-ray where you'll say, "Hey. All my X-ray traces are pointing to the same problem but it's tracing only when something's going wrong." Or it's tracing in only certain situations so other stuff you're not seeing.

At New Relic, first of all, we do instrumentation on every single invocation. It's not sampled. You actually see every single invocation and what those code spans where inside of each invocation. And then we also integrate with X-ray data. So we pull X-ray data into our distributed tracing system to help you look at it in a unified place, and obviously, those are going to be sampled. We're not going to send you every single span for every single item because that would a very wild amount of data, but it is going to give you a really even sampling that shows you really broadly across your stack what's going on inside of those stacks. And then you can sit there and see something we were talking about like, "Hey. This much DB time," or, "These are the services that were called."

Jeremy: Yeah. And I think that's interesting.

Nica: So yeah. Some... Sorry, go ahead.

Jeremy: No. I was just going to say, I think that's interesting too where you're sampling everything and I don't know if you can say sampling everything. You're just recording everything and part of the idea behind that, and this is something I talk with actually Erica Windisch about a couple of weeks ago on this show. We were talking about how it's sort of really interesting where it's almost like you want to be able to see when your application is behaving correctly. That's some of what you want to be able to see because then you can do things like performance tweaking and you can say, "Okay. The service is running just fine, but this Lambda function's taking 600 milliseconds to execute. Why is that?" And you can dig down and you can do some optimizations.

Nica: Some of the best leadership I ever got was I had an engineering manager. I'm blanking on her name. Embarrassing. But anyway, I'll remember it and shout it in the middle of the... later I promise, but she said, "Were hunting down these errors but let's just sit down and look at our total number of requests and the number of errors here." And if we're erroring at one kind of request every time then yeah, we have a systemic problem. But I think maybe it's a time out or something else. And if it's one half of 1%, maybe we should be looking at when things go right and seeing how we can improve that experience.

Jeremy: Right.

Nica: In that case, we have some little user response section where they're supposed to put in a percentage, and everybody was taking a minute or something to get through that. And so we really had a user experience problem, right, for all users that we wanted to look at that was much more important than once in a while when people put unescaped SQL into their username that it would error. Okay, great. But that really wasn't... That wasn't the problem that most users were having. So yeah, you want to capture data when things are going right. That's a very smart thing to sort of keep in mind.

Jeremy: Awesome.

Nica: And obviously that's all doable in the serverless world. Though there are these situations, right, and this is the thing where distributed tracing becomes a big issue where in some serverless tools, right, you can write some code and you can go in and get real insight. And then in Lambdas, you can even use layers to grab big, large code packages and say, "I want to use this sort of outside my code." In others, you can do some configuration to say, "Hey. Please log this over here. I'm trying to watch those logs." But in others like queuing services you can't do any of that, right?

So if that queuing service experienced a scenario like... Right? Where does that go? Or especially, hey, when things are going right, how long does it take you to de-queue certain stuff? Well, there's no endpoint to say, "Hey. Queuing service, I want you to tell me about this." So that's why stuff like X-ray integration is super key because you have to figure out... You have to get insight into those things that by design, don't allow you to do any kind of that. There's no custom code you can run around the simple queuing service.

Jeremy: Right. Yeah. I mean, I like the idea too of trying to connect things automatically with tracing headers and correlation IDs and some of that stuff that you do not want to have to try to deal with yourself. And I know it's not perfect yet. I know it's getting there, but-

Nica: Yeah. We're actually, we're still sweating the AWS people to be like, "We want to implement some more open standards for these kind of tracing headers because we're trying to get to the point where it's a little bit easier." It's such common request and of course I understand it to say, "I need to connect all this stuff. I don't just want to be sitting here looking at here's how all my DB calls performed and then over here is how all this queuing stuff performed. I want to be able to see these together."

Jeremy: Yeah. Yeah, that's awesome. All right, so let's talk about the New Relic One platform for a minute because I do think there are some really interesting things in here and I'll read off the website right now. But essentially, it's one platform, three products. And I think that's interesting because... And we're going to get into a little more about this, but maybe we can just talk about each one of these things. So the first one is the telemetry data platform. So what's that all about?

Nica: Yeah. So we haven't gotten to talking much about open source which is probably a focus for me, and is certainly a focus for the entire company. And one of the things that we're seeing, especially since 2015 is that, you've seen a lot of great open tools to do some of the instrumentation that you need. Now, I'll be honest, in my experience, right, if you install a New Relic APM agent, you're going to get really detailed information about function calls, their names, their designations, right? But there are some open tools that can do similar or close to that performance, but also there's open instrumentation for stuff that we of course, never got to writing instrumentation for, right? So open instrumentation is a huge, huge component. There was a tool initially called OpenCensus for PHP but it's now I believe called OpenTelemetry. And you get great results with that.

Now, where does that data go, right? It's great to have open tools for instrumentation, but if you're then saying, "Okay. Well, now we've got to stand up a database and now we've got to standup a data front end to show people what's in that database, right?" You said, "Oh, we used open tools, but we just put ourselves in a difficult situation." So the Telemetry data platform is an attempt to be this omnivore for that performance data, and have a place where those open source tools have a home where they can send data and display it in a really clean and useful way, so that your maybe sales enablement people or other people who have coding skills and they want to write a SQL query to show you the data, but they don't want to sit there and configure your database themselves. They don't want to handle database permissions. They just want to write a few lines of SQL and get a cool chart, right? So the Telemetry data platform is a place that you can send that data and pull it out in a really effective way. And hopefully that helps your observability.

Jeremy: I would hope so because I don't think anybody wants to be setting up databases just to store telemetry data, especially-

Nica: Yeah. I mean, that's the thing is you have a hard enough time running your own databases for customer data, right? You sort of get to this meta point where you're like, "Yeah. I don't want to be setting up services to observe the services that observe the services," right?

Jeremy: Right, yeah. Well, that's the other thing, right? Now, you're going to observe your database platform in order to make sure that that's still up and running which... I mean, if you think about just the promise of serverless or the idea of serverless in general, it's hand off that undifferentiated heavy lifting. Collecting telemetry data is probably something you don't want to try to manage yourself.

Nica: Yeah, exactly. Yeah. And that is actually something I love about... I landed at working at Stackery and I loved seeing it of course as a value of New Relic which is this whole serverless ethos, right, is you're supposed to be focusing on business problems, right? And if you're sitting there and saying, "I got to learn this config value because one of my Kubernetes clusters failed. I got to learn this because again..." It's like, "Well, okay. How did this help the customer?" It's like, "Well, the service is back up so I suppose that's good," right? But the idea is you're supposed to be saying, "Hey, I don't think we're going to differentiate on becoming a platform company," right?

And I think again, saying, "Hey. We're the best at measuring our own performance, our own service performance. We have people here who are great at engineering an observability platform," it's unlikely that that's what's going to differentiate you if you want to be selling shoes online, right? So New Relic can handle a lot of that heavy lifting, right, and present an incredibly clean and incredibly cheap, in my opinion... I'm not a sales person and I'm not deep on these sales numbers but we can present something very, very inexpensive to store and retrieve that data.

Then the second piece is full stack observability, and that's very much like... That's the stuff that... It's what you sort of know and love about New Relic, but very often, I will... The interviewer will say, "Hey. I'm dev advocate in serverless for New Relic," and people will sort of be like, "What do you mean? Doesn't New Relic just do APM?" And it's like, "Well, we still do and we're the best at that, but also, yeah. We'll observe your serverless stack super good." So this is the stuff that we're very familiar with, right, is that you get this really deep insight into what you're doing. It's kind of what we've been talking about. So maybe there's less to say about that piece.

But then the last piece is AI stuff. And when I try to explain internal to New Relic why this is important or why we should take the time to document this or that or talk about it, I say, "I've been doing..." So when did I come on at New Relic? It was like 2012. People used to say to me in 2012, they said, "You have all our data. Why do I have to set up the alerts? When I have something that sees 10,000 requests a minute and yesterday it saw 6 all day, why can't you just email me?" And no. And I heard about it in 2012. I'm old. I heard about it the year after, the year after, and then on a call with a customer, who was a very advanced customer of ours using a lot of data features, I heard mention or say it again. They said, "You have all our data. Why can't you see, "Hey..." Why couldn't you maybe even message us when errors are normally at 10% for this service because maybe user behavior creates an error, and now there's suddenly 0%. Why couldn't you email us about that because it's just so unusual."

And another engineer at that company on the call said, "Yeah. We actually have that. Let's go look at the Slack channel." And the Slack channel was just, "Hey, this error rate dropped. I mean, this throughput dropped unusually low today." And you click through and you can go see a New Relic chart. That's pretty cool, right? That has real promise. And there are of course, as with any ML system or any linear algebra system, there are many times when it presents you things that maybe you don't care about, but just like with any well engineered system, you can go back and say, "Hey. I want to see less of these. I want to see more of that." Obviously there's a huge place for manual alerting. It's something I talk about all the time. But yeah, it can be very, very powerful.

Jeremy: Yeah. And I think the idea of even simple anomaly detection, right? When you have data that's collected over time and you can see your average error rate or your average throughput or whatever. And then also not sort of the cheap anomaly detection where you say, "Oh well, it averages this." Well, averages are great, but only for certain periods of time. Maybe in the morning it's higher. Maybe in the afternoon it's lower. Maybe we got a spike at lunchtime or whatever it is. Or-

Nica: If you're selling delivery food and you're getting just a sort of simple average getting three times a day, right, that says, "Oh my God," or at least twice a day. I don't know. Some people get delivery breakfast.

Jeremy: I don't know, maybe.

Nica: Let's not talk about that now, but yeah. At least lunch or dinner. It can't be a simple thing, right? It really needs to be at least, a second order system that can say, "Yeah, you..." For example, hopefully right, "Hey. You normally see a spike at lunchtime." And maybe you can go and say, "Oh well. It's Christmas day, so okay." It's Thanksgiving so it makes sense, not that people are going to order a pizza right now. But, yeah. You want that kind of at least a second order system that says, "Hey. It's not just the average. It's not just that you're breaking the average, but that something does seem off here."

Jeremy: Yeah. And I think the promise of AI and machine learning and all this kind of stuff, it's kind of funny because I think we're finally starting to see people implementing real AI/ML use cases. I think when you... In 2012 because I am also old, I remember every pitch deck having, "Oh, we do ML and AI," with no idea what that even meant. But I think if we go back to the conversation we were having earlier about observing your application when it is working, that this is the kind of thing where AI can really help because if you're not getting any errors but you are just seeing a huge slowdown for your lunchtime order spike, then there is a good reason to potentially go and look at that. There could be a reason why that's slowing down, right? I mean, especially what if all of a sudden your traffic dropped off but everything seems to be working correctly, that gives you insights where you can go and start investigating those other things. So it's not just about errors. It's also about just fluctuations in the normal operation of things.

Nica: Yeah, and that's actually... It's something I talk about a lot that I would argue... This is maybe a little extreme to actually implement but that everything you're setting a high alert for, you want to think about, would it be meaningful to set a low alert for, right? Now, it might be... And I think that some of the things that seem would be obvious knows like total response time, I think that might make sense. Maybe it's a very, very low threshold, but you'll say, "Hey. We're reporting that your total runtime for your Lambdas is 0.01 milliseconds." Something is wrong at that point, right? You know that something is wrong. So obviously high and low throughput are classics, but another one that covers a lot of these is actually low cost. If your cost just suddenly drops by 30, 40%, something's probably... Unless you really did just push out a big release, something's probably up. And so that's something that I think is really interesting as far as you really can...

When you're thinking about a crisis that is something that is not what you have predicted, something like, yeah... We talked about Charity Majors before but something I really thought about a lot that's stuck with me is how very often we create dashboards for problems that we've agreed we're not going to fix. And that's something like you have a huge system. It leaks memory sometimes, and you really just need to watch memory usage and reset the thing. And I don't think anyone would disagree with that. I said, yeah, this is a dashboard to monitor a problem that we're not going to fix, or whatever.

You've got a steam locomotive. It gets hot, right? You're not sitting there trying to make a cold steam locomotive. You're just saying, "Hey, yeah. It gets hot so we need to keep an eye on that," right? But then those are all the problems that you know about, right? So hopefully anomaly detection and some other observability tools can get you to a point where you get at least a clue, right? It's not that you're getting a text that says, "Hey. Steve just deployed some code and it used an incorrect sorting algorithm and that's not what you're going to get on your phone, right? You're just going to get an email that says, "Hey. I actually need to start looking into this and see that there might be a problem here."

Jeremy: Yeah. And you know what the other funny thing is too is that we've been talking about these metrics and we said how long a function runs for and things like maybe, I don't know, errors and things like that. And we're talking a lot about application sort of level things, right? I mean, there are certainly infrastructure components underneath that, but that's another great thing about serverless too is you can start focusing on a different set of metrics which are not, is my server running? It's where are my performance issues. And I think that's just another thing that's really great about what you can do with observability.

Nica: Yeah. Something I love diving into is I'll just take people through... Maybe they've done some of the New Relic instrumentation on parts of their cloud stack and I'll say, "Hey. Let's look at some of the parameters that you're gathering for every single indication that right now we're not doing anything with." They're the event parameters that are just available within AWS, and we'll step through them and there will be an awful lot there, right? Obviously there's stuff like the event source, where did this come from? Did the API gateway call this thing? Was it an event from some place else, right? And you can see stuff like stuff that would make my heart stop where it's sometimes this function is called by API gateway. Sometimes that's just being triggered by DynamoDB...

Jeremy: Right. That's not good.

Nica: You see like cans of soup and balls of cotton on the conveyor belt, and you're like, "I don't think... This doesn't seem right." But you can really... Often there's so much available on each of those events. They're very rich data objects that you can start looking into actual business logic where you can say, "Hey. Let's look at how one organization or customer or one sort of use type." Like, "Hey. This is a person making some kind of update request," right? And a lot of that stuff is available on the front end often, right, in front end monitoring, but I like to see that coming in more and more on the back end. So instead of just looking at kind of... I often sort of when I'm thinking, I think of it as these engine metaphors where you're sort of seeing, ah, the engines hot or the engines cold. How much memory are we using? Physically, how hot is the CPU, right?

It gets you to the point of saying, "No, this service is very critical for people updating their accounts, and see how that's taking longer or that's performing differently. And let's look at what that might mean," right? Something I think about is looking at the weight of DB information that's coming back, right, because one of my side things is looking at GraphQL and trying to encourage people to do these fully parametrized queries, right? It's, "Hey. Look at how this kind of request we're always sending half a megabyte back every time someone tries to do this one thing." That can be very, very insightful and again, that's much more in the business logic world than it is the world of sort of yeah, as you say, looking how the server is doing. Is the server up or down?

Jeremy: Right. Exactly, exactly. So you mentioned cost in there too, and of course, when you're using on demand or pay-per-use services, especially in the serverless world... I mean, even in cloud in general, cost is one of those sort of first class metrics. And we can talk more-

Nica: Yeah. It's kind of fascinating. Sorry, but I'm trying to get a cert right now and looking at... There's whole classes of AWS stuff that exists because some of the tools you're using only do host based pricing, and you're like, "Oh, you have..." I mean, not to the point of well, I just can't do that, right? You can't go on this cores. But with EC2, people are like, "Oh, I need to own certain cores because that's exactly how I pay. I pay by core ID." It's like, "Wow. That sounds pretty old," right? If I'm going to do... If I'm paying for my Lambdas by how many requests I get and the billing is scaling smoothly, shouldn't that happen for everything that works with it, right? So it's been nice to see that. Again, I'll connect you with great sales enablement people who will tell you all about the exact cost structure, but it is nice to see, "Hey, we're doing usage based pricing," which is very helpful.

For me, the part that I'm super passionate about is I love going and talking to bootcampers and meeting people. I do a Twitch stream that's just for people who are totally new with AWS. It's like, "Hey. Let's get your first web app on AWS. Let's get you to Hello world." And there's just such a weird thing about observability. Observability is this buzzword, very much like test driven development was or object-oriented programming or anything that's like, "Hey. This is good. You do this, it's good." But if someone was trying to pitch test driven development and they said, "Well, what's step one?" Well, step one, you sign an $800 a month contract, right? Step one needs at least $20,000 in sales you have to sign up for. And most of the tools for observability, they're not cheap.

And so this thing, it's called the perpetual free tier, and again, I'm not going to break it down into gigabytes and MIPS and MOPS, but I will say, if you're running a little hobbyist app or for me, I'll be setting up eCommerce apps for people and I just kind of want to set it and not really think about it, that perpetual free tier, that will just carry you. That will gather plenty of data, plenty of usages, plenty of requests. If you have a few hundred or a thousand users, you can use that free tier forever. Hop into New Relic and see how this service is performing. So that's so nice for me because when I started 5 months ago, it was like, "Well, I could sign a bunch of bootcampers up to the free trial on that and in two and a half weeks, they're just going to be out of luck." And so, that's been really nice.

I think there's some real power in that in the idea that just like testing, it's not that necessarily every single person's going to do it, but it's much more about, if you take the time to do it, there is a service available that is affordable, right? Or when I started within web development, a lot of people were... They'd gone pretty far, but they were doing all their hosting on their laptop because they could not go out and just buy hosting, right? So that's what New Relic is trying to do with observability is say, "It either costs you nothing or it costs something that's just very, very negligible on top of launching your business or launching your web work."

Jeremy: Right, yeah. And I mean, and the free tier is... I mean, I don't want to get into the numbers, but you're right. The free tier is very generous. There's quite a bit that you can do in that free tier, and you couple the free tier with AWS and you could probably run a good size application for quite some time before you start getting hit with charges.

Nica: Yeah. It was something... It was in the millions of transactions you can be measuring and you're still on the free tier. I was-

Jeremy: Yeah. 100 million app transactions per month and 100 gigabytes of data transfer per month which is pretty big.

Nica: Yeah. Yeah. So you're right. Of course, I love talking to small teams or agencies and stuff, and agencies is a big one where I think about where it's you would like to set up some observability tools on there so that when the client calls you six months later and says, "Hey. I'm having some problem." You're not having to say, "Well, we got to start billing to even try and figure out what's up," right? Now you just got to click through to your dashboard, but then I also don't want to be bugging them about a $25 bill that has to be paid so that we can keep up the observability. So yeah, that's pretty powerful. That I think is... It really opens up who I get to talk to which is fun because that means I can go and talk about Arduino stuff or talk about goofy CLI stuff instead of having to have these enterprise conversations.

Jeremy: Right. And trying to sell people on that stuff too. Yeah. So I mean, I think what's really great is again, not only do you get the free tier, after that again, it is usage based pricing. And one of the things I love about that because just, I remember I started a startup back in 2010 at one point, and we were building facial recognition as a part of what we did. And I had to go and buy a software that then I had to write a PHP shared object for, the shared object module, and write that so that we could tap into that with a PHP call in order to run facial recognition.

I think it was $5,000 just to buy the software for that, plus all the engineering time, things like that. This is what I love about serverless and this idea of usage based pricing where you just say, "Hey. I need to run facial recognition. I can hit AWS recognition servers or something like that. I need to translate a document or I need to do that." I just hit this one thing and it costs me a few pennies here and there. Extending that idea to something like observability I think is amazing. It's very useful for those small teams.

Nica: And it's so much this thing. Often when people ask me to define serverless, I say it's a goal. It's a goal like Agile, right? You don't buy Agile in a box, right? You don't say, "Oh well, because we're all clicking this Kanban board 20 times a day, now we're agile, right?" And actually, one of the ways that it really is related is here, right? The ability to say... Slack's the classic example. It's like, "Hey, we have something here and we think it could really be big," right? Well, that's great but before you get that huge interest and have those huge sales and have that huge growth, how can you make something that still performs well, but does something sophisticated, that lets you just say, "Okay. This one was successful and these 12 were not," right? And let's you just scale with the success of that product, right? And serverless is so much about that.

And I tell people all the time to say, "Hey. Just write this microservice serverlessly," because very often, you're an engineer. You have a good idea. You don't want to start with having a conversation about how you need to pay this extra AWS bill or you need to do this extra thing. And it's quite wild. You can see people who make... They're taking home 12 grand a month and they're having a conversation about an $80 a month bill that's taking 15 emails to justify why. And very often I see teams where the real message they take back after that is don't experiment, right? Me asking all these questions, right, you know not to experiment.

Now you can say, "Hey. You start this out. It's free or it's very inexpensive," and then you say, "Oh hey. We got a big bill we got to pay because we're taking off," right? Tons of use. People love this area of the site, right? Maybe you want to add facial recognition. Image recognition is a good one, or say, you want to add image editing. You want to add video uploading, and you just don't know how big it's going to be, right? S3 and Lambdas a very powerful way to do that, right, using something like using serverless based video pre-processing and then storing in S3. If nobody uses it, you don't pay very much for that hosting.

Jeremy: Right. Right. Yeah. No, I always suggest that too where I say, "Look. Start with serverless especially from the prototyping phase and if something gets so amazingly big that for some reason, you can't optimize it anymore with serverless and you have to go down, as you said, the Kubernetes cluster, a path or something like that, then that's great. But you don't need to do that when you've got 10 users. You need to do that maybe when you have 10,000 users or more."

Nica: And a big part of that is what expertise are you building on your team?

Jeremy: Exactly.

Nica: When you're using any observability tool... And much like deployment tools, I often say like, "Hey. Go use an observability tool," right? "Go use Gravada," right? That's a great open source toolkit, right? Just do something because what you want to build in your team is expertise at building your product and observability should give you insight into your product. You really shouldn't be building expertise in other stuff, right? You could say, "Hey. I'm becoming an expert in running a metric server in doing my CICD config." Well, some of that stuff is maybe necessary in certain use cases, but ideally, right, you're becoming an expert in your actual product that you're giving your users what they want, right?

Jeremy: Yeah, exactly.

Nica: I think about how every time I sort of struggle with the time zones and time spans, I think how, oh boy, the ladies at Airbnb must be so good at this by this point. They've probably got a team that's like, "Yep. How many fortnights between Memorial Day and Labor Day weekend?" or whatever. They got that. They got all that down.

Jeremy: Totally. Totally agree. All right. So let's talk about open source for a second. So you mentioned open source a little bit in the beginning, things like OpenTelemetry and some of those other services. But New Relic has gone all open source on all their agents.

Nica: Yeah. This has been very exciting. Yeah. Yeah, so this is something that I think was maybe overdue, it's probably overdue everywhere, is that if you're using this tool to get insight into your own code, it seems nice if you could actually look into what its doing. And then also of course, any instrumentation package, even New Relics great instrumentation packages for all these different language web apps, you're going to want to extend it, right? And most of the conversations I have when I talk to customers is about extending it in some way. And so through open sourcing our agents, we've opened a lot of that up to say, "You can take a look at this logic. You can look at where it can be extended as is appropriate for your tool set."

And then a big piece of that is we're doing a ton of contributions to the OpenTelemetry project, previously OpenCensus, which I can remember if I slipped in calling it OpenCensus at the start, but yeah. Tools are very important. Again, New Relic does great out-of-the-box instrumentation but of course, the open source community is going to build instrumentation for stuff that isn't even on our radar, right? There's a new web framework every week, right? So if someone's going to sit down and write some great Deno instrumentation, it would be great if they weren't doing that for just their own shop, right, if that was something that was shared everywhere, right? So yeah. So that's a big push and I want to plug again, if you have an open source project and it contributes to observability and has a code of conduct and it's repo, get in touch with me because I would love for us to backing that and helping build that.

So the other big piece of that, and I mentioned a little bit when we talked about Telemetry Data platform which is the name of the product is if you're using open source tool kit to do observation, we are going to be able to consume and display that data in a very, very powerful way. That part is not just this week. That part has been going on for a few months is, or sorry a couple years, is we've had really great end points to take that data in, and again, on this free tier, we can take and display a ton of that data without it really costing you anything to do that. And even once you grow beyond that, it's not expensive. So that's a very powerful set of tools to say, "Hey. Maybe there is an open project that gets you a lot of the data you need. Let us display that for you right in with the other stuff that we're instrumenting."

Jeremy: Right, yeah. And so besides just making the agents open source, which I think you're right, I think is really exciting, you're also contributing quite a bit to open source. I think you're the third biggest contributor to OpenTelemetry I think, right?

Nica: Yeah. I saw that at the chat the other day. That's really neat. There's some engineers who have gone so deep on how to truly instrument certain behaviors and do full instrumentation on some very big and complex applications that it's really great to see those contributions becoming more open source and seeing that stuff happen. That's been really cool. And there's also, there's some neat stuff happening with what we call programmability. This isn't my baby but there's a really smart guy, Jemiah on the team who... He runs New Relic Nerd Days which is coming up in October that we're all excited about, but where people can also create whole modules inside of New Relic.

So let's say you want to show our error rate or something but you want to do it in a fun way. You want to show it was a ring toss game, or you want to see an elephant that grows bigger and bigger for how many cars you sell or what have you. I'm just thinking of visual ones but whatever. But we have an open source tool kit to do that, so that you can actually build data components. So that's after all your data is already in New Relic and you're sort of in the New Relic architecture. So it's not a great first project but it is just fun to think about having them in your future.

Jeremy: Awesome. All right, so we've been talking a lot about the New Relic One platform and we've talked quite a bit about serverless too which is I think most interesting to the people listening to this podcast. So what are some of those features and some of the things that you can do with New Relic when you plug it into your serverless applications?

Nica: Oh, yeah. Yeah. That's great to talk about. That's where I start. So the first pieces again, is you're going to do a no code commit deploy to do your instrumentations. So you're not going to have to edit your code. You're not going to have to add... I say this with some instrumentation... Oh, just add a few lines of code at the top. Nope, not necessary. We'll do a tool layer and we're even working on better tools for that in the near future to deploy it in an even smoother way. But then what you're going to get is again, you're going to get that code level instrumentation on every single invocation.

So you're going to be able to see, at least for every single invocation like for example, how much time was spent in library code, how much time was spent in your own code, how much time was spent waiting for a database or another server to come back? So that's significant. And then with our distributed tracing, you're going to be able to zoom in to a large number of your transactions and see exactly which functions were taking so long and what were they waiting for, right? Did you have maybe API calls going out from that? You get this nice, I don't know what they call... It's called a waterfall chart or something?

Jeremy: Yeah, something like that.

Nica: Where you can see, "Hey. Was this maybe happening synchronously, so it was really... It was made asynchronously, but that you were holding up other requests to wait on it." And so, you can see that real detail. There's other stuff too which is just kind of quality of life stuff but it really matters is you can see cold starts and you can see memory usage and the memory cap on your Lambdas. So very often, that's critical. Sometimes it will reveal a problem. I actually just was talking to somebody who sure enough, they had tons of cold starts and so they really did have to think about how they were going to handle that.

But also it helps you eliminate that as cause. Hey, you're seeing this request time climb up. Do you need to dig into the tracing and the logging or can you just say, "Look. It's cold starts," right? So it's nice to be able to eliminate that. Using 50% of the memory, you're probably okay for CPU and IO as well. So now we can move on to the code performance. And then the last piece is because we're using this kind of CloudWatch step, you can actually have a Lambda sitting in your service that's grabbing that out of CloudWatch grabbing that logging and sending it up to New Relic, which means we can actually grab more logs if you want. You can define a pattern and grab really extensive logs and send them up. And so, New Relic logs is another way to connect those traces over and again, see really detailed performance information.

Jeremy: All right. And you can actually add additional things. If you wanted to capture specific business KPIs and things like that, you can alter your code and add some of that stuff in there, right?

Nica: Yeah. Yeah. So you can absolutely add stuff as a custom value. You also, again because you get all the event parameters that AWS is sending around, often that stuff is already in there and we have this really clean data explorer where... That's a big stumbling block I've noticed for myself as well. I'll say, "Oh, well. Let me just log out this whole event." And maybe it'll... Okay. It'll log out but it wasn't a complete object so there'll be a few ones. I'm like, "Okay. I'm sure there's others. That's fine. I've got enough detail." But just having a little explorer.

We have this thing called the Data Explorer where you can click through and be like, "What parameters were available on this event and are they on every other event," right? Does every event have a customer ID or is it only some? You can just see that, and that just makes it so much easier to figure out what you might be charting or what you might need to do a code change to see. I can see that saves so many people so much time, and so it's always very nice to show off. And we can do that because we're doing that level of instrumentation on every single transaction, you'll see it for every single invocation in Lambda, so you'll get a really nice consistent smooth data graph on that.

Jeremy: Right. And then you get the benefit of the entire New Relic One platform. So you get the AI monitoring there and the ability to do those alerts, but you can also set custom alerts if you wanted to as well.

Nica: Mm-hmm (affirmative). So you could see for example... Because you can do this right at the query level, so we use a SQL syntax to make all these charts, you can write a query that says, "Hey. Is my container running out of memory or are my Lambdas running out of memory?" And combine those together and get a unified alert for that. Now, that one didn't make a ton of sense, but let's say maybe you're using some kind of EC2 instance to handle some requests and you're using a Lambda to handle other requests, but they're both checkout cart actions, right? So you really want to unify an alert on that. You don't want to see it from one or the other. You want to alert everybody.

And so because you're monitoring all this in one place, you can write an alert that covers both of those together which is pretty nice. And obviously, even if you're not doing that, combining alerts, if you get an alert, you can very quickly click through and say, "Hey. How's the Lambda site doing? How's our monolithic self-hosted application doing?" It's very consistently for me... Anybody whose really successful, they have all those levels of abstraction exists, right? Maybe they still own some bare metal some place. They definitely have virtual machines. They have EC2 instances and they have the Lambda and so being able to click around and see that stuff all together, ugh. Such a quality of life improvement.

Jeremy: Right, yeah. And I think you just actually made a really good point. I mean, this idea of having Lambda functions and EC2 functions running side by side, I don't know many companies that are 100% serverless. I mean, I know a lot are going that way and they want to go that way. I know I would love to be 100% serverless, but even some of my applications still have things that aren't serverless, and being able to put all of those into one platform is really powerful.

Nica: Yeah. And especially as you see ML tool kits get bigger. I mean, there's ways to implement that stuff serverlessly, but I mean, it's not straightforward. So right. That's going to be a really good example where it's like, "Okay. We're doing all this basic CRUD action serverlessly. That's great." But then when we need to... Image recognition and put a fun mask over everybody's face because it's St. Swithin's Day, yeah. That's going to happen inside an EC2 instance. There's going to be those exceptions, right? When we talk about step functions or other stateful ways to do serverless, it's like, "Should I just do this from the start because I like using states?" So it's like, "No." But sometimes you get stuck. Sometimes you have to use those tools, right? Yeah. Trying to get a picture of that whole map, right, that whole system quickly, hopefully that's where New Relic comes in.

Jeremy: Yeah. Awesome. All right. Well, Nica, listen. Thank you so much for taking the time-

Nica: Yeah. This has been a good one. I really enjoyed it.

Jeremy: ...to talk not only about serverless... I mean, or not only about New Relic, but obviously just your insight to serverless is really exciting too and really interesting. And if people wanted to go and learn more about what you're doing with serverless, maybe some of your side projects, but also in New Relic, how do they find out more about that?

Nica: So I'm still pretty bought in on Twitter. You can go follow me on TikTok too. If you go search Nica Fee, you'll see me over there. But most of my stuff is going to go up on Twitter. I am on Twitch twice a week. I'm on on Tuesday and Fridays in the afternoon if you're a US Pacific time person, but I'll announce there. I do a lot of hands-on demos there. But yeah, those are two great places. You also see me on the New Relic blog and various New Relic stuff. I was just quoted in Forbes this week. That was fun.

Jeremy: Oh, wow.

Nica: Yeah. That was neat.

Jeremy: That's awesome.

Nica: It was about how you need to be able to throw stuff away in serverless. You need to be able to just say like, "This service is no longer running. You have to go through and delete stuff."

Jeremy: Right, right. That is a hard thing for some people to do.

Nica: Yeah. Hey, I have games that I've written in C# code where I have reams of lines that are commented out. It's just like, "I might need these some day."

Jeremy: Some day. I know. I always do that too. That's another bad habit of mine, but-

Nica: They're like my old style Apple white lightning connectors where I'm like, "I just-"

Jeremy: You never know when that old iPod mini's going to come-

Nica: Come back.

Jeremy: ... and you're going to need to charge again. So anyways. All right, Nica, thank you again. I will get all of this information to the show notes. I really appreciate it.

Nica: Ah, thank you so much. Thanks everybody for listening.

Jeremy: Awesome.

Episode source