Data Exploration

        The data set contains various attributes of the weather which are observed and for which we also have prediction form the NWP model. Each record is for a specific station which has its own location and height. Every observation-prediction tuple has a time with hourly steps and we also have the number of hours which have passed since the beginning of the forecast start time. We currently focus on some of the more important attributes from the power grid optimization view which include temperature and wind. The temperature units are in Celsius degrees and the wind has north and east components in meters per second.
Picture
Picture


        In the first column of Fig. 5 consisting of scatter plots we can generally see how the predictions and observations are related to each other for the three major attribute of temperature, wind speed and wind direction. For example the temperature attribute shows to me more accurately predictable and hence more correlated with the observations. 

        A more detailed comparison of the predictions and the observations is provided in the second column of Fig. 5 which is box plots. In this graph we can see the %50 quantiles around the median values for both predictions and observations. This plot also confirms the high accuracy of the temperature prediction as the medians and intervals of prediction and observation well match with each other in this case. For the other two predictions which are for wind speed (in meters per second) and wind direction (source degrees) we can notice that the prediction model has a much more conservative prediction and does not expand the forecast values as it should be based on the observations. There is also a considerable minus bias in the median value of the wind direction forecast.

        The histograms in the last column provide some insight to the way that the forecast error frequency is distributed through its domain. One can see that the temperature has a very close Gaussian distribution while the other histograms show the biased distribution of the forecasts for wind elements.


Picture
        In Fig. 6 the temporal curves show how the errors and the 95 percent confidence intervals change as we try to forecast the more distant future from the start of the prediction process. The curves show that the model somehow becomes more stable and hence accurate through the first hours. There is an evident spike in the 18th through 21st hours. Then the model improves and then the same increase of inaccuracy is repeated about 24 hours later. One possible speculation about this spike can be the unstable and hard to predict periods of evenings. We can see roughly the same scenario in the error trend of wind speed. However, for wind direction the behavior of the model is significantly different which will be investigated through the project.


We use the following abbreviations for the data fields:


HGT: Height
PTM: Predicted Temperature
PWS: Predicted Wind Speed
PWD: Predicted Wind Direction
PGP: Predicted Grid Precipitation
HAF: Hours after the Prediction



PSP: Predicted Surface Pressure
PMR: Predicted Mixing Ratio
TER: Temperature Error
TAE: Temperature Absolute Error
WSE: Wind Speed Error
WSA: Wind Speed Absolute Error