Error Prediction

          Next, we try to use these relationships to predict the values of the error. In this case, our inputs will be the predicted outputs of the model for a specific time and location to feed into the "error prediction" model. Hence, the model will exactly give us an estimation of the NWP's anticipated error. To avoid naming confusions we call the model "bias prediction" rather than error prediction. For this purpose we use a RandomForest of 450 regression trees to predict the signed value of the temperature error (bias) based on six attributes of the NWP model's output like the predicted temperature, precipitation, etc.  and also based on the height of the location and the number of hours after the start of the forecast. 
        To better understand the application of bias prediction for the NWP model and its consequences one can take a look at these formulas and sample numbers:

           ObsvTemp PredTemp + Bias
           ErrorOfPredBias = Bias – PredBias
           ObsvTemp = (PredTemp + PredBiasErrorOfPredBias
12 = 8 + 4
= 4 - 3
12 = (8 + 3) + 1

        As seen in the formulas, the NWP model's output of temperature (shown in blue) for an specific time and location will have a bias (red) compared to the real observed value of the temperature (black). The random forest will use the features mentioned above to predict the value of this bias (orange) for each particular forecast case. Of course our prediction of the bias will have an error on its own (green). 


        Consequently, whenever we have an output of the NWP we can use its outputs to feed the RandomForest and gain a prediction of the Bias as well. By doing this, we will now have a new prediction of the temperature which is equal to the sum of the value already predicted by the NWP with the value of the bias predicted by the RandomForest (the parentheses in the last formula). And what is the error of this new method now? it will be equal to the error of the bias prediction model as can be seen in the formulas and the numerical sample above.


Picture
        Based on the provided discussion, the randomForest's error will be our target of analysis since this error is going to be the error of the weather forecast when we use the output of the randomForest to correct the NWP's prediction. In Fig. 10 we can see the distribution of the NWP model's temperature forecast. Also the new temperature error distribution after adding the bias predicted by the randomForest is shown in the bottom of this figure. For this purpose, 66% of the data was used for training and the rest (about 4000 predictions) for testing.

        The distribution of the NWP shows a mean of -1.41 and an standard deviation of 3.84. However, when the bias prediction is use the error's distribution will be centered around zero and will have a much smaller standard deviation equal to 1.87. To get a better understanding of this change we can return to the uncertainty goal in which we would like to have a confidence interval for our forecast values. 

        Statistical theory on prediction interval enables us to gain the 95% confidence interval of the output which will be equal to:    mean +/- 1.96*standard deviation 

        The normal distribution curves shown in the figure also show that the error distributions follow a normal generation model. For the RandomForest case the errors are even more populated near the zero center compared to a normal distribution.


Picture
        Finally, in Fig. 11 the 95% confidence intervals computed based on the mentioned formula are depicted. As can be seen, by using the RandomForest to predict the bias and correcting the WRF output by adding this bias we will gain a considerably narrower margin of confidence interval and hence a more reliable forecast of temperature. 


Picture
        We have applied the same technique for the wind speed and we gain a zero mean rather than -1.77 and an std of 2.59 rather than 4.05 when we use the RandomForest. However, the wind speed errors (in both cases) do not follow a normal distribution and hence the above formula can't be applied (Fig. 12). Nevertheless, we can use the Chebyshev's Inequality to gain the confidence intervals. As a result, the confidence intervals will both be longer, however the RandomForest based intervals will still be shorter about 36%.