Deep Dive into Forecast Accuracy to Drive Better Revenue Management Decisions
In this session, we revisit a question that revenue management (RM) researchers and practitioners have struggled with for decades: how do we best measure forecast accuracy? A PROS simulation study gives some surprising insights that will help RM researchers and managers make better decisions when designing their next forecast accuracy study. Also, learn from Etihad Airways Manager of RM Systems Hasan Abdul Tawab, who details a real-life application into forecast accuracy.
Full Transcript
Jonas Rauch: Just a quick intro, I'm Jonas Rauch, I'm a principal scientist at PROS. And before that, I spent about ten years at Lufthansa working on revenue management methods and strategy.
Ross Winegar: Yes. And hello, I'm Ross Winegar, I'm a data scientist at PROS as well. And I've led forecast science at PROS for many years. And I'm happy to be here. I know that you guys were either holding you between now and your flight, or now and the bar. And so, I hope you enjoy this last presentation today.
Hasan Abdul Tawab: Hi. Good afternoon, everyone. I'm Hasan, I'm from Etihad, I'm the manager of the RM system there. I'm happy to be here and thank you all for staying and coming to the session.
...
Jonas Rauch: All right. Then let's get started. And quick logistics note, the sign outside says we are done at 4:15. We won't be, it's planned until 4:30, so we have 45 minutes. If anyone needs to rush out earlier, then that's fine. But we're happy to have a little bit more time to really get into the details here. All right. So, the topic we're going to talk about is forecast accuracy. And it's one that we get a lot of questions about in our daily work at PROS. And to, kind of, set the stage, we want to talk about, quickly, what are the different reasons why we would measure forecast accuracy. We don't do that for its own sake, we do it because we want to achieve something. And, depending on what we want to do, we need to look at it in a different way.
Jonas Rauch: So, some of the potential use cases are research, so that's me in my daily job. I'm doing research forecasting, and I want to find a better model. So, I need to see, is my new model better than the old model? Or is this hyperparameter that I'm setting better than the other one? So, the research perspective, where you go to a pretty low level of detail. Model tuning is kind of related. Just slight changes to the model parameters. You have success measurement, that's where it gets more into practice, right? You have analysts working with the model and with the system. And they make decisions, and they want to see, okay, was this decision actually a good one? Did it improve forecast accuracy, or should I have done something different?
Jonas Rauch: So, this is still relatively low level, because you go to individual decisions, and you want to know directionally, what do I need to change? And then you have the more high-level process control view, where, for example, executives and management wants to see, is my whole process working the way it's intended. Is our forecast as good as it was last year? Is it better than last year? Not necessarily saying, what specific tweak do I have to do to improve it, it's more like, how's it going, to, kind of, get an alert and track whether we are in the right direction.
Jonas Rauch: So that's some of the different perspectives. There's even more, but just keep in mind that Forecast Accuracy isn't something you do just for fun, it's for one of these specific goals, and then depending on what you want to do, you have to approach it differently.
Ross Winegar: Okay. Great. As I get started here, has anybody in the room ever done a forecast accuracy study before? Raise your hand. The majority of the room, nine out of 10 people in the room. Okay. All right. So, we get asked this question quite a bit on how to measure accuracy, and I actually like to think of it as efficacy. And, there's really two main ways to do it. The first one is to just look at the forecast, what is the history, what is the forecast? And the other one is what everybody likes to do, to put a number on it. And, actually, what is my error, what is that accuracy number, and such. But for the first one here, let me give an example. And this is real airline data. It's a little bit older but some things are timeless.
Ross Winegar: And here we have an example where you have the historical bookings. And so, you have the history here, this dark blue line, and then I have shielded off what happened after that. But if we were to look at this picture here, and we were to say, what do we think a good forecast would be? Well, I've seen a lot of forecasts and analyzed them, so I hand drew this to be up here. So, it's going to take into account the recent trends, we've seen a rise in bookings, we have some seasonality components in there, and they're some wavy sines and cosines. But what happened was this customer said the forecast is too low, and they had snapshot it at that date there. And what happened is the bookings had dropped down quite a bit since that time. And so, the green line here is going to be where the forecast was at the time, where the bookings came in here.
Ross Winegar: And so, if you had snapshot it at this point here, in the CPD, you would thought it high, and it's actually down here. And so, I would say 90% of the time when there was a question about the forecast, by looking at the picture, you'll answer your own question. And so, it's a good way to start it. If you are going to go down and do a study, a lot of details go into the study. See everyone's phone's coming out for this photo here. Yeah. Get ready for Jonas and Hasan who go into much more detail on all this. But the start off is that your choice of metric matters. So, what are you actually measuring and how are you measuring it is important.
Ross Winegar: Here are some very classical methods that I'm sure a lot of people here have explored with. Root Mean Squared Error (RMSE), for a very long time, this used to be our preferred metric in order to look at, penalizing by the square. But recent research has shown maybe we shouldn't be. MAPE (Mean Absolute Percent Error) is the most common measurement. It just looks at what is the percentage error. This is probably the worst metric to look at, as reasons we'll get into here later. Because it doesn't actually tell you, if you're making more revenue or not. And it's BIASed upwards. You can always have error more than 100%, but you can never really have error less than zero. And so, it's typically BIASed. And it's even undefined if you have zero bookings, because you're dividing by zero.
Ross Winegar: Mean Absolute Deviation (MAD), this is a common one too. This actually minimizes toward the median of the bookings. And then, BIAS, so this is actually the simplest one. It's the KISS method, keep it simple, stupid. And interesting enough, as Jonas will show, it actually seems to be the best predictor of revenue. So, if you just minimize the average error across the flight compartment at the capacity constraint of the bid price, this is actually the best predictor of the revenue that you'll get. And so, choice of metric matters, many different options here. There's not necessarily a good one or a bad one, they're all a little bit different. But RMSE and BIAS are ones that I would recommend looking at.
Ross Winegar: Okay. So, you have gone off and you've done a big study. And you have come up with your BIAS, and it is 16.7, so what do you do with that? Well, what I would say is, go look at the forecast. So, is that high, is that low, is my error high, is it wrong? Well, the first thing you do is, "Okay, well, how does it look? What do I see, history forecast?" And this will typically answer your questions on what to do next. Okay. And so, I'm going to hand it off to Jonas now, to talk about the simulation studies on choosing the best metric, and then I'll come back and talk about doing a study.
Jonas Rauch: Thank you. So, this study, this is, it gets a little bit more theoretical here and technical, but I'm trying to do not too much math. The question is, as Ross mentioned, you have at least four different metrics. There's probably more. And there's a lot of more subtle choices that you have to make on what level to measure the forecast. You use weights in your forecast error metric and stuff like that. And so, depending on how you measure the forecast, maybe with one metric, this model is better, with the other metric, the other model is better.
Jonas Rauch: So, there's actually a PhD thesis by one of my former colleagues at Lufthansa who wrote a thesis on saying that there's no one metric that is best. It's always every metric has its upsides and downsides, and depending on the metric, one model is better than the other. So practically, which is the metric that we want to use when we do these studies. That was the question here. And so, we set up the simulation study and the idea behind the study was that, again, we don't really want to measure accuracy for its own sake. We want to really maximize revenue. That's what we are here for.
Jonas Rauch: So, why do we use forecast accuracy at all and don't just look at revenue? Well, the reason is that it's very hard to measure revenue for two reasons. One is, first of all, you have to then use those forecasts and the bid prices, for example, you get from the forecast in real life and then see how well it works. And then if it doesn't, well, it's pretty bad because you've lost a lot of money. So, it's pretty costly. Whereas forecast, you can just run 10 different models on the same data set and evaluate on the test set, and you can just compare. So, measuring forecast accuracy is free, and measuring revenue in the real world is certainly not free. So that's one big issue.
Jonas Rauch: The other thing is measuring revenue in practice is very difficult anyway, even if you are willing to do it just because there's so much noise. There's competition, there's economic effects, there's all kinds of things that change your revenue that might have nothing to do with your forecast and your RM system. So, it's very hard to actually isolate a signal and say, okay, this change I made led to this change of revenue. Whereas forecast accuracy is much, much easier to pinpoint and say, this change I made led to this change in forecast. So, measuring forecast accuracy is much easier, but it's not really the end goal. But because revenue is our main goal at the end, basically the best forecast accuracy metric for our purpose is the one that is the most correlated with revenue.
Jonas Rauch: So, we want to find this metric that if we know this forecast metric from our RM system, it's the best predictor for what we think the effect on revenue is. And if we have a metric that's completely uncorrelated with revenue, and I'll show you in a minute an example, then it's not really helpful in solving our ultimate goal. But if we have a strong correlation, we can say, "Okay, we minimize this forecast error, that's going to lead to better revenue." If we see that strong relationship, that's a great metric to use. So that's the idea behind the study. Identify which of these forecast metrics are most correlated with revenue, that we are really interested in.
Jonas Rauch: So, the way the study was set up, we created a lot of simulation runs, in this case, a quarter million runs. We created random scenarios of demand, and this is all single flights. So, keep that in mind. This is a very simple setup, all single flights, but with a bunch of different demand scenarios. Demand is high, is low, is high... High demand in the higher classes, more demand in the lower classes, late arriving, earlier revenue, a huge mix of different scenarios. For each one, we create artificial forecast error, so we don't actually run a forecast model in this study. We just add some random noise to the forecast in various ways. And then that gives us our forecast, so to say, it's not really a forecast but again, but it serves as a forecast.
Jonas Rauch: And then we can compute what's the optimal revenue we can achieve in this scenario if we had known perfect forecast. If we know the perfect demand, forecast error zero, how much money can we make? And then we can compare that to how much revenue do we get by using the flawed forecast that does have an error. So, we can compare that. And so, the random error that we introduce in our forecast will make us lose some revenue, sometimes 1%, sometimes 2%, 5%, depending on how strong the error is and what it looks like. And then we can compute all of these different error metrics here.
Jonas Rauch: So, then we want to see how do these error metrics, across all these scenarios, how do those relate to the revenue? And we want to find the one that's most directly related to the revenue, that we are interested in. So that's the kind of relationship we want to find. And we do this by training another little machine learning model. In this case a so-called GBM model. It doesn't matter exactly what it does. The idea is we are using these error metrics that we compute and we try to use them to predict the revenue loss that the forecast error introduces. And if we have a good predictor there, then our metric is useful and is informative. And if we can't use this metric to predict revenue, then it's not a useful metric. That's the idea.
Jonas Rauch: So, this prediction, this is kind of what the data set looks like in an abbreviated form. So, we have all these different error metrics, BIAS, Mean squared error, so on and so on. There's different levels of aggregation on which we can compute these. And then we have the revenue difference. So, this is the actual revenue versus the optimal. And we want to use one of those, for example, BIAS to predict this response. We train a model that uses this feature, wants to predict revenue, we create the prediction, and then in an out of sample test set. Not an in sample, but out of sample test set, we look at the correlation between the predicted revenue that we get from this model and the actual revenue that we see in the simulation. And if that's a high correlation, we are happy. If it's a low correlation, the metric isn't useful for what we are trying to do. So that's how we set it up.
Jonas Rauch: And so, this is the high-level summarized results, and we could spend an hour on the slide. There's so much going on there and I'm going to try to go through quickly. So, the first thing to notice is on the vertical axis, we have these different metrics. And here I'm looking at Mean Squared Error, MAPE which is the Mean Absolute Percentage Error. It's like a 10% error, let's say. And then BIAS, which is, like Ross said, it's just the average. And then for each one, we haven't weighted in an unweighted version where the weighted version says, We use the fare amount of each fare class as a weight, essentially saying that getting the forecast right in the higher classes is more important than getting the forecast right in the lower classes, which intuitively makes sense. And we see here that actually for all the cases weighting helps.
Jonas Rauch: Then on the horizontal axis we have the level of aggregation on which we compute this. So, the very lowest would be like a DCP class level. You can also aggregate across the DCPs, just look at the class level, or the other way around, you can aggregate first across the classes. Just look at total forecast by DCP, ignore the class dimension, or you could aggregate across everything. Just look at total demand forecast versus total bookings. So, this is the highest level, and you see some cells here are empty, and the reason is that BIAS really is the same as aggregating everything to the highest level. So, once you aggregate, you compute total bookings and total forecast, and then you compare at the end, that's essentially what BIAS is.
Jonas Rauch: And so, any of the other metrics computed on that level is essentially the same thing. So, if the BIAS, for example, could say, I'm 10% over forecasting on average, you can then take the absolute value, then you have a MAPE as well on that level, it's 10%, but it loses the directionality. So, that's important to note that anything, I guess as you aggregate stuff to higher and higher level, you're getting closer to this BIAS metric by definition. And that's why it's kind of separated out here on the slide.
Jonas Rauch: And the important thing about BIAS and aggregation is that errors cancel out. So, if I overforecast one class and I under-forecast another class, then if I aggregate those first, I sum them up, those errors will cancel out. And in the BIAS I could say, Okay, I have a good prediction on average, maybe zero error on average, even though on the lower level I do have an error. Some forecasts are too high, some are too low, but they cancel out. And so, you could think that that's a bad thing, right? If the errors cancel out, that's bad because there's an error and you don't see it, that's bad. But it does turn out that you actually want that to happen, and I'll show you in a minute why that is.
Jonas Rauch: So, looking into this a little bit with more detail, first of all, BIAS we see is the best predictor. So, this is again, this R squared, the correlation between predicted revenue and the actual one. BIAS, up here, highest level of aggregation is the best predictor for revenue. Higher aggregation level is better. So, if you go from left to right, and this is again the same effect. Aggregating more, makes things more similar to BIAS. So, it's kind of the same finding. Weighting with price is always good, which again, intuitively makes sense. If you compare, for example, right here, you look at MSE and you use weighting, it's still bad, but it's at least better than without weight. So, this intuitive idea, I want to get the forecast in the high classes right, is shown here in the data as well.
Jonas Rauch: And then the last one, and this is kind of an artifact here, measuring this MAPE on the lowest level looks really great. These numbers here look great. It's like it's almost the same or exactly the same as BIAS for some reason, even though I said MAPE is bad and lower level of detail is also bad, but in combination is good for some reason. So, this is a technical artifact, and I would spent some time talking about it. This is kind of a technical thing that can lead to very misleading results. And so, this is really not a good thing here. So, this is not a good metric, even though it looks good over here.
Jonas Rauch: So, to explain this in a little bit more detail. Why is Mean Squared Error, for example, compared... Measured on the lowest level, such a bad metric? The reason is that if you plot this, you have the Mean Squared Error on the horizontal axis and the revenue on the vertical axis. You have cases where your error is really high, but you have really good revenue right up here. This is really good. You also have cases where your error is really low, but your revenue is terrible, so that's pretty bad. You have some error where this is, where the forecast looks great, but you're not making money up here. The forecast looks terrible measured in this metric, but it's actually okay. The outcome is good.
Jonas Rauch: So, that means that there's not a lot of information where this MSE doesn't really show us much about revenue. So, it's still generally the case that lower error is more revenue. That's good. But in principle, the issue here is that mostly what you're doing is you're measuring random noise. If your prediction is 0.1 passenger on average, most of the time you'll see zero passengers. Sometimes you'll see one passenger on, and then if you measure the difference between 0.1 and the one passenger that arrived, that looks like a huge error. But it actually, it was just random chance, someone showed up. And that's kind of an outlay already.
Jonas Rauch: So, you're measuring the noise in the process of the customers themselves. The demand is random and you're measuring that, but it doesn't have to do anything with your forecast itself. So that's the problem with MSE. That you're measuring the noise and not the signal. And so, the Poisson variance from the process dominates the error. The opposite is true when you look at BIAS and you see this super nice picture where if the BIAS is zero, we are making the most money. If we have a negative BIAS, we under forecast, we lose money. If we have positive BIAS, we also lose money. It's very intuitive and there's a very clear shape here and you know here that if my BIAS is low, I'm really not too far off from the optimum and my BIAS is too negative, too positive, I am losing money.
Jonas Rauch: So, this is very clear that if this number is in a certain range, I'm happy. And if it's outside of the range, I'm not happy, which is very different from the MSE. So, this is very interesting. And the reason here is, and like I said, the errors can cancel out here in the BIAS. We can under-forecast here, over-forecast here, the error cancels out and BIAS looks good. The thing is that in the optimizer, technically, when you look at how we optimize, these things also cancel out. If I over-forecast maybe one point of sale and I under-forecast the other point of sale, in the process of optimization, these also kind of average out. And so, the bid price in the end is much more interested in the total demand versus your capacity and not as much interested in individually, like how many people in this class at this CCP on the fall loss level.
Jonas Rauch: So, that's kind of interesting, that in the optimizer things can also cancel out and that's why this is fine. And so, then I don't want to spend too much time talking about this one because it's kind of technical, but this MAPE on the lowest level, why does it look good? The reason is, that this absolute error when you use it on very small numbers. In my example here 0.2, the best prediction you can have is zero usually. So, if you have the time at some point to look at the presentation afterwards, you can kind of calculate by hand this, but essentially what happens is if you have an error, sorry, a prediction of zero, you have a certain error. If you increase your forecast from zero to the true mean, which should be better, you actually increase this MAPE if you measure it on this level. And so that's the bad thing, right?
Jonas Rauch: So, if you minimize this metric, you are somewhere on the left here, and you actually have very low revenue, and this picture looks the same as BIAS, and it's because, this MAPE essentially is a linear function of BIAS and it looks great. But the problem with this is that you don't usually know where the maximum is. So, if you had this whole picture, you'd be happy. If you only know one point here, of this, you don't actually know whether you're good or bad because you don't know where the maximum of the curve is. In this case, it happens to be at 1.8, but it could be that the maximum is at 2.3 or 1.0. o, if you measure this number and you see a value of 1.5, you have no idea whether it's good or bad, because you don't know where the maximum is.
Jonas Rauch: So, that's the difference to BIAS. It looks the same, but for BIAS, we know zero is good. Left of zero is bad, right of zero is bad, zero is good. Here, any number could be good, could be bad, depending on whatever, so it looks nice in the chart and technically it's okay, but in practice it doesn't help us, because we don't know the target where we want to be. So, it's a pretty technical thing, but it's important to know. Now again, MAPE, when you aggregate up, and Hasan is going to talk about that, this small numbers problem goes away, and this becomes a useful metric.
Jonas Rauch: So again, because aggregating stuff and allowing errors to cancel out and getting rid of the noise, now suddenly you're seeing some signal and this small numbers problem goes away, so that's important again to note. MAPE on the lowest level is a bad idea. As you aggregate to higher levels and you measure it, let's say on the full whole flight level instead of class DCP then it becomes useful, so it's always important to keep in mind.
Jonas Rauch: So to summarize, BIAS is much more important than variance, because that's also what matters in the optimizer, then we should measure the forecast error not on that level that it's estimated on, not on the class DCP point of sale, tiny level, but on some more useful level where we can actually have numbers that are meaningful or ideally on the level on which the capacity constraint lifts, like, on the flight, right? On the compartment level, for example.
Jonas Rauch: Absolute error on small numbers is very dangerous because of the reason I talked about, and we should always, if possible, use the price as a weight, if we have that available and we do anything by class and we can say, okay, use the price as a weight if it's available, let's do that because it puts more emphasis on the classes that really matter, which is the top of the fare structure. So that's generally a good thing and I can always recommend doing that, and with that, I'll hand it back over to Ross.
Ross Winegar: Okay, thank you Jonas. So, two really interesting things from what Jonas said is, like, if you were just thinking what's the right level to measure accuracy? Is it the level at which the optimization consumes it, or is it the level at which the optimization provides outputs? And so, if you go back five, 10 years ago, PROS or me especially, were saying, you should measure accuracy at the level which it's consumed by the optimization. And then Jonas did this study, and I was like, oh my gosh, maybe, I've been saying the wrong thing for five years, so it was very interesting for me to hear that. And funny to hear that BIAS is the best predictor of revenue. We get lost in these complicated metrics, and such like that, and sometimes the simplest way is really the best.
Ross Winegar: Hasan will also talk about the list of different metrics and the pluses and minuses for all of them. So, I mean, whatever works for you, everything is different, but it depends what you're going for. So, a big fan of looking at pictures. If you come and you say, my forecast is high and my forecast is low, let's go look at the picture, but that's not practical to do for an entire network or especially on any time basis. So, it is necessary to have a monitoring of the forecast and look at the study, put the numbers on it, and quantify it, and help know where to go look.
Ross Winegar: So, the, first thing to do if you're going to do a study is to first look at who's the audience for it. And so, you have an executive, who wants to know how accurate is my forecast, they're going to speak in conversational language, they're not going to necessarily understand the details of the math, maybe some do, with engineering backgrounds and such, but they're going to be looking for ways to improve the business in, kind of, more high level in strategic operational. Where, if you talk to an analyst, they also are deeply concerned about the forecast, but they're looking at it a completely different way. They're saying, here's my market, I know my market in and out. Where should I be making a change and what should that change be? So, if you're designing a study for an analyst, then you're going to be looking at lower-level compartment, fare class, the time-of-day window and things like that.
Ross Winegar: And then, over here on the side, we have the data scientists or the operations research who oftentimes will do the study, and they'll get lost, on calculating these metrics and errors and things like that but they're doing it in order to help the other people on the team. So, a study done for data scientists presented by data scientists isn't necessarily the most useful for an analyst or an executive. And so, I had to keep in mind that, quantifying the error for others and to help with operations, for the bigger picture, for the executive or for the analyst.
Ross Winegar: By the way, I created all these pictures using Microsoft Copilot. I put in there a generated picture of an airplane surrounded by apples and hold-back sets and random world, and it comes up with this picture here. So, if you're doing a study, just make sure that you're doing it correctly. We have a lot of experience seeing how other airlines have done studies. We've done a lot of studies ourselves on behalf of airlines and, some of the most important things is make sure you're comparing the same thing. So, if you are looking at unconstrained forecast, make sure you're comparing it to unconstrained bookings, not net bookings. If you're doing a willingness-to-pay forecast, you want to look at the conditional forecast based on the price actually offered and the actual bookings that came in.
Ross Winegar: You want to make sure that what you're comparing is actually, the same thing, which happens quite often. You know that you'll be comparing, for instance, a P6 forecast to an RME where P6 is going to do the net, pick up bookings, where RME is going to be the incremental DCP, at the gross unconstrained, and so you have to make sure that it's the same.
Ross Winegar: Next thing is BIAS towards the hold-back set. So, in that example that I gave, they snapshotted the forecast at the very top, the bookings came low, and they said that, the forecast was off. Well, it's off because there was a change in the market. And if you just had that one snapshot, you're going to have a BIAS set, you're doing the very peak of the market, and so we like to look at many different snapshots over time and monitor how it's evolving and changing, and make sure that there's not some issue of market change going on in there.
Ross Winegar: Something else very common is the unconstraining issue, is that the unconstraining algorithm is in itself an output of the forecaster. And so, airlines will be like, well, I'm measuring the unconstrained bookings against the unconstrained forecast and measuring a forecast against a forecast and so oftentimes they'll say, oh, well, why don't I do an area where there's no unconstraining going on? Or basically when the classes that are open, or the cabin's not going to go full in and the bid price is zero and everything's open and in that case, you're actually biasing your whole back set as well because you're only looking at low demand flights, or you're only looking at the top fare classes of them.
Ross Winegar: And funny enough, that's actually when your RM system is less important because your RM system is important when your bid price is non-zero, when you are opening and closing classes, and so it's not necessarily the best indication of how accurate is my forecast for my RM system.
Ross Winegar: Next is level of detail of comparison. So, what level are you measuring the error? It sounds intuitively right that you want to measure the error at the level of the optimization consumes, the forecast, but maybe not. You actually want to do it at the level of the capacity constraint, so the flight compartment level is, seems to be the best predictor of revenue for measuring error.
Ross Winegar: We live in a random world, so if you have a great forecast, you've calculated your BIAS and it's 16.7, that sounds high, but for that market, it might be very volatile and it might be great, or it might have been good before, and now it's no longer good. And so, there's not necessarily, like, a threshold or a number that you can define, that is the right metric to look at. It is always changing, right? It's always variational and just know that, something that might be good today might not be good tomorrow. And then lastly, metrics matter. Lots of choices of metrics. I showed at the beginning just the four most simple common ones, but as you tweak in, what makes sense for you to look at for your business, it's not necessarily one that's right or wrong, just understand what you're looking at, and how to work with it.
Ross Winegar: Okay, and so to tie it all together here. When we're doing a study, make sure that we address all the audiences, that we have done it in a way that for the executive, easy way to know if forecasts are accurate or not, conversational reports that are easy to interpret and understand, where the analyst is going to want something very different, very low-level recommendations, where to do it and how. And the data scientist works together as part of the ecosystem here to create the reports, maintain them, and make sure that everybody's getting value, and it all matches up together. Okay. All right. So, let me hand it to Hasan for an actual example.
Hasan Abdul Tawab: All right, thank you, Ross and Jonas. So, this is kind of, I would say something we did in parallel to their simulation study. These are, kind of, some of the lessons we learned, and you can maybe look at implementing some of these in your particular airline. What we are going to talk about first is, like, okay, we said the audience is critical. So, before we get to the audience, it's important to understand, all the audience understands that the forecasting is an iterative process, you have your forecast, sorry, you have your bookings that gets unconstrained. That unconstrained bookings is used to create a forecast for the future. The user adjusts that forecast. And then, that goes into the optimizer, results in bid prices. There's some rules on top of that, that generate a final availability and then that availability drives the intakes going forward.
Hasan Abdul Tawab: So, it's important to understand that this whole thing is cyclical, but then talking about the audience. So, for us, there are three main audience groups, which closely parallels what Ross talked about. First one is for the demand analyst, to guide them, where do they need to actually influence the forecast? And for that, what we found is that the level of detail was departure month, point of sale, OD, class block, DCP group, day of week, and departure time worked for us. The second audience is their managers, and executives to some extent, is to understand which markets, which points of sale are doing better, kind of compare them and see who's doing a better job, which regions are doing a better job. And here, the level of detail was departure month and point of sale.
Hasan Abdul Tawab: And then, the last audience is kind of the data scientist or person and for them, like, what matters is, how is the system overall doing it at network level, so you're just doing it at, departure month and network level. So, metrics matter again, but here we realized we needed something more robust, right? If you look at the common error metrics, BIAS, Mean Absolute Error, RMSE, MAPE, all of them have issues. Here's some of the common ones that, I'm sure if you've done a forecast calculation, forecast accuracy calculation, you've also encountered, division by zero, explosion of errors that when, the denominator is small.
Hasan Abdul Tawab: So, what we had to come up with is our own kind of custom metric. And for us, that was a weighted symmetric Mean Absolute Percentage Error. And I'll talk about how that was calculated in the next slide. But this is an example, like, of how those metrics, for a given forecast entity that's over here, how that looks like, but it's also important to remember that there's no perfect metric. I'm a huge fan of, well fried chicken, but I can say that there's no secret sauce here. You have to find something that works for you. And it's really challenging to design a perfect forecast error metric. You have to, kind of, meet user expectations of what the error should look like, but at the same time, it has to be something that's, like, mathematically rigorous. It has to be easy to interpret, it has to be something that indicates revenue.
Hasan Abdul Tawab: I mean, when you look at all of these requirements, it's like a laundry list that's really, really difficult to, for a given metric. But with the metric that we came up with, it kind of does okay with all of these, I think and then, if you were to make it more complicated, that comes with its own sets of problems as well. So, we decided to go with this, weighted symmetric Mean Absolute Percentage Error. Okay, so how is this metric calculated? So, you are probably familiar with the Mean Absolute Percentage Error formula, just your forecast minus your actuals divided by the actuals, and you, kind of, take an average of that.
Hasan Abdul Tawab: To make it symmetric, and why do we need to make it symmetric? It's because if you over-forecast by, say, two bookings, or you under-forecast by two bookings, if you use the MAPE, you'll get different errors depending on which direction it's in. So, to make it, kind of, symmetric, if the amount of BIAS is the same, whether you over forecasting or under forecasting, what we decided to do is to divide that by the sum of the unconstrained forecast plus the unconstrained actuals. And that way it's kind of balanced.
Hasan Abdul Tawab: So, we have the symmetric MAPE. Now how do we get the weighted? Simply multiply it by the weight. Here, for us, you can choose the weight you want. For us, it was the total demand level. By total demand level, I mean the sum of the unconstrained forecast and the unconstrained actuals. So, this kind of controls for the cases where you forecasted a high, but your actuals are zero or your forecast is zero, and your actuals are high, so it's kind of... By summing both of them, you're giving weight to both the forecast and the actuals.
Hasan Abdul Tawab: And then finally, accuracy. So, calculating MAPE is good, but then it's also... We found that particularly for executives, error is not the best, like even MAPE is not the best metric. Sometimes they look for an accuracy, a number between zero and 100 because I think our brains are conditioned to think in terms of gauges where you go from zero to 100 whereas error can be, like, arbitrarily high. So, in order to convert your symmetric MAPE to accuracy, you have to set a threshold. If your error is above this threshold, you are inaccurate. If you are below this threshold, you're accurate. And how do we select this threshold? I will come to that in a bit. So, this just, explanation, so just to summarize, this kind of approach reduces low value explosions. You focus on the entities which have the highest forecast and it's relatively balanced and scaled.
Hasan Abdul Tawab: Okay. So, next thing is selecting the level of detail. Now, this here is important because different audiences... You don't want to calculate your error at the same level of detail. If you look at the lowest level, which is your 11 dimensions plus the DCP, that's 12 dimensions, you have a lot of zeros, at least we do, we're not sure about other airlines. When you have that many zeros, you're not going to get something that's very useful when you aggregate it up. So, what we found is for the audience which are the demand analysts, they want guidance on where they need to influence the forecast. The level that works for us is departure month, point of sale, OD, DCP group and class block, where your zero rate is little over half and you have relatively okay number of bookings per entity, something like 10 20, which is good. It's not like 0.5 or 0.2.
Hasan Abdul Tawab: And then for the management, where we are looking at things like departure month, point of sale, DCP group, and class block, the zero rate is 13%, and here we have bigger numbers, so it's easier to again interpret. So yeah, this is just an illustrative example of what the numbers look like for a given market. And yeah, I think this is kind of... You have to, kind of, play with which dimensions to include in your forecast, how to aggregate, how to roll up your dimensions. Obviously, for us, we have class blocks based on our fare families. Maybe that's different for you. We have DCP groups based on our typical booking intake cycle. It could be different. You could define different DCP groups or different markets if you have different booking cycles. We didn't decide to go that way. So, we have the same kind of DCP group structure.
Hasan Abdul Tawab: Okay, so the next thing I'll talk about is how do we select that threshold when we get to accuracy? So, we know that, let's say we are fortunate that in areas like on time performance, there's an industry standard, right? If you're delayed by more than 15 minutes, you're delayed. If it's less than 15 minutes, you are on time. We don't have that luxury when it comes to measuring forecast accuracy. So, we have to use some sort of heuristic here, and the heuristic we used is, kind of, looking at the distribution of the errors. If you think about it, the purpose of having an accuracy is to convey information. What percentage of forecast entities are accurate? What percentage are not?
Hasan Abdul Tawab: So, if you set your threshold very, very low, you're going to get everything as inaccurate, which might be something you might take as a grand statement, but how is that useful? Where do you need to act? If you set your threshold too high, everything is going to be accurate, and then you might be like, oh, amazing, I'm doing a great job, but then again, it's not really useful. So you want to set your threshold at the place where it gives you the maximum information gain, and for us, that's like when we plot the cumulative standard deviation of these errors, you want to kind of hit near that peak where that's kind of gets maximized and for us, that was around 35%.
Hasan Abdul Tawab: So, obviously what'll happen is if you do a good job and over time the analysts improve the errors, the distribution shifts to the left. Once the distribution shifts to the left, you need to move your threshold as well. Now, some analysts may not be happy about that, but that's not really my problem.
[laughter]
Hasan Abdul Tawab: I love to move goal posts. So yeah, so that's basically what we, kind of, found in our learnings over the last six or seven months while we were implementing this. If you were there in my previous talk, we, by implementing this, we’ve noticed a significant improvement in our forecast accuracy when we calculate forecast error using this method, and we've also seen an increase in revenue. So that's all I have. Thank you very much.
[applause]