Ozone forecasting and UH Statistical Model

Statistical models are based on the fact that day-to-day air quality in a region is mostly controlled by the weather. Because weather is hard to predict, ozone prediction is harder. To perfectely forecast ozone, one has to know everything about the emissions (pollutants emitted into the atmosphere by factories, cars, vegetation etc), the reactions and interactions of all chemistry species, in addition to weather.

The emissions are generally stable in a time frame of a few years. Despite there are occasional emission "upsets" or "events" which release extra pollutants into the atmosphere, the amounts are usually low and affect only a very limited area. Large emission upsets are rare, occuring typically once in a few months. Another major factor that contributes to air quality is the pollutants transported from other regions. Currently there is no reliable model that can provide good estimation of transports, although there are models by NASA and by Harvard University that are designed to fill the void. Fortunately, short-term AQF only needs recent air quality readings from surrounding monitoring stations to estimate the transports.

For Houston, the chance that ozone are transported into the region is quite low simply because itself is the biggest emission sources in East Texas. It is known that the ozone produced in Houstion were frequently carried away by winds to other places such as North Texas.

There are several types of ozone forecasts. The "alert" forecast advises on whether future (typically one or two days down the road) ozone will exceed the national standard. Its forecast is simply a "Yes" or "No". A "Yes" means ozone will rise above the acceptance level and "action" is needed. A "No" is opposite - meaning ozone is contained and no "action" is necessary. Therefore, it is also called "Action Day" forecast. Current EPA standard is that the maximum 8-hr ozone concentration in a day cannot exceed 80 ppb. 8-hr concentation is calculated by averaging ozone concentration in a consecutive 8-hr time window. For example, the 8-hr concentration at noon (12PM) is the average of 8AM to 3PM. Presently, both EPA and TCEQ give 'alert' forecast for all Texas metropolitans.

Another type of forecast gives the actual number for future ozone concentration. Because the ozone can vary in a wide range in summer, typically from 20 ppb to 200 ppb for hourly ozone, 'number' forecast is more difficult. For 'number' forecast, there are maximum 1-hr ozone and maximum 8-hr ozone forecasts. Right now, 8-hr ozone forecast is the standard and most efforts are directed to it. 1-hr ozone forecast, though obsolete in standard, is useful as a valuble reference in developing sophisticated numerical air quality models, such as the forecasting systems that UH are using. Ideally, a good air quality model is everything we need to know about air quality. Practically, there are many uncertainties in the numerical models and people are working diligently (like people at UH) to resolve the uncertainties. A statistical model can work in tandem with numerical model to identify issues and find solutions.

Since weather is difficult to predict (though some elements are not, such as temperature), forecasting ozone entails greater perils. Ozone is more closely affected by the winds, clouds and precipitation. All of them are "tough nuts" to predict. All ozone forecasts come with a string attached - uncertainties are inherent. For 1-hr ozone forecast, typical uncertainty is about 16 ppb in Houston. However, for different weathers, the uncertainties are different. The uncertainties are the highest when ozone is on the high side. When ozone will be high? Look out for a day with sunny or partly cloudy sky and wind is light or variable. It is also known that wind reversal plays an important role in high ozone events.

Currently, UH statistical model gives 1-hr ozone forecast for metro Houston only. Max 1-hr ozone are calculated using data from 40 CAMS montiors in HGB, therefore this forecast is valid only for the region the 40 sites cover. The statistical model is built on the max ozone data from CAMS and forecasted weather issued by National Weather Service (NWS).

Each day, NWS issues so-called point forecasts for more than a thousand weather stations across the US. Most of these stations are located at airports and they measure weather conditions hourly. NWS's forecasts are generated from numerical weather prediction (NWP) model output. Since the direct output from the NWP models are not quite accurate, NWS uses a technique called Model Output Statistics (MOS) to post-process the NWP output. There are several NWP models currently operated by NCEP. For short-range forecast, there are GFS, NGM and ETA (NAM). Accordingly, there are three versions of MOS forecasts based on the three NWP models, known as GFS-MOS, NGM-MOS and ETA-MOS to local forecastors (or meteorologists). The MOS forecasts are sent out in a compact form similar to a matrix, knows as PFMs (Point Forecast Matrix) to forecastors.

To build the statistical model, the GFS-MOS forecasts for KIAH (Bush International Airport of Houston), KHOU (Hobby Airport) and KGLS (Galveston regional airport) from NWS are used as the weather input. Since summer is the busiest in ozone activities in Houston, the current model is constructed for forecasting the summer ozone (June to September) only. Three years' data (2004-2006) are used to train the model. The model is essentially a MOS type model because it uses model output to train. The benefit is that it can correct system errors in the NWS forecast. A "PP" (Perfect Prognosis) type model would use observed weather at the 3 stations to train. For a PP model to predict the ozone, a perfect weather prediction (hence the name PP) is desired. Given that model output is always biased one way or another, the "imperfect" model prediction makes PP a less attractive approach in weather forecast.

There are many ways to build a statistical model. Most popular ones include GAM (Generalized Additive Model), Neural Networks (NN), Cluster Analysis and Tree based methods. A GAM is basically a souped-up regression model with non-linearity introduced. Neural Networks are powerful tools to detect predictand-predictor relationships in a "blackbox" way. Classic cluster analysis used to be "unsupervised", in which a set of values are classified based on the closeness of the values - therefore it is more an exploratory tool than a predicting tool (though it can have some indirect predicting power). Newer developments in cluster analysis can handle "supervised" clustering (by UH/CS!). Tree-based methods excel in the interpretability of their end models, in which rules and nodes are labeled. As for predicting accuracy, all above four are excellent - though some earlier experiments in ozone forecasting put GAM slightly ahead. But given the many variants inside each of NN, Cluster Analysis and Tree-based methods, the earlier findings are likely to be inconclusive. But one thing is clear, all four methods should do well when properly implemented.

The UH statistical model is based on regression-tree which belongs to the tree-based methods family. The correlation of intial model is about 0.85 (R-square=0.72) which is decent considering it is a "pure" forecast model. For 1-hr ozone 'number' forecast, no comparison study is found so far.

The statistical model forecasts are issued twice a day. First around 7 PM LST for next day, then there is an update around 6 AM in the morning. The first forecast uses NWS 1800 UTC GFS-MOS forecast, and 2nd forecast (update) uses 0000UTC GFS-MOS.

Page last updated on 7/18/2007