How I used Classification to Predict Rain

EDA and Classification of Weather using Weather Station Data

Andrew Auyeung
5 min readNov 13, 2020

For my third project at Metis Data Science Bootcamp I decided to attempt to predict the rain with classification. The outcome of these two weeks is called rainOne, a classification algorithm that compares my prediction against the weather prediction of OpenWeatherMap.org.

Meteorologists generally determine the forecast by looking at multiple sources of data. They look at movements of pressure systems, analyze cloud cover, and aggregate information from many different sensors and stations. What if they only had the information that one individual weather station provided? Can the weather still be predicted with only a station’s record of the past conditions? rainOne compiles the data from one weather station to attempt to predict if it will rain or no tomorrow with 70% accuracy.

To train rainOne, I collected data from the Weather Underground station at Newark Liberty International Airport (EWR). I built a web scraper to collect the daily summaries from 1990 to 2020, totaling 11200 rows. I also cross referenced the days with data from the National Oceanic and Atmospheric Administration (NOAA) to confirm rainy days at EWR.

Once the data is collected, any information containing if it rained that day was shifted one day earlier. I am not interested if the current conditions align with rain. Rather, I am interested if today’s conditions align with a rainy classification tomorrow.

From some light reading on indicators of weather, I determined that a stationary average value of the conditions was not a good indicator of rain. What is important, however, is how the conditions change.

The first feature I looked at was the one day change in humidity, wind speed, and pressure.

The data is separated in terms of the current conditions that align with rain or no rain on the next day.

Humidity tends to increase by 7% on days before rain and wind speed tends to be nonvolatile.

Now you may look at this and notice that on days of rain and no rain, wind speed still is decreasing on average. How can wind speed always be decreasing?

Well, we need to look at the distribution of the change in wind speed. The distribution is skewed towards heavier daily decrements in wind speed.

In other words, when wind speed increases, it tends to happen more gradually but when wind speed decreases, the one day change is more dramatic.

Another takeaway from the pressure analysis is that there does not seem to be a comparison to be made when talking about change in pressure before a rainy event or a dry event. However, a small change in pressure is still significant. A 5% change in pressure is equivalent to a 1500 foot difference in altitude. We can actually see a better distribution in comparing the change in pressure to the rain if we look at the absolute change in pressure. Generally once we see a decrease in pressure of more than 0.5 in. Hg, we will witness a rainy event.

Feature Engineering

I wanted to highlight these differences in one day trends to see if I could capture information from the previous week as well. To do this, I looked at the previous days to see how much they differed from the moving average. This would capture dramatic changes that are not usual for the season.

Model

When building rainOne, I trained four different classification algorithms. Each had their own perks and would perform better at different things: KNearestNeighbors, Logistic Regression, Random Forest and an XGBoost model.

KNearestNeighbors is a classification algorithm that classifies new data using its proximity to the data that is stored in the model. To train the model, it takes the features that I had generated (Wind Speed, Changing Humidity, etc.) and stores them along with the associated label of rain or no rain. When the model decides to classify rain or no rain, the model takes in the weather information of the new data point and compares that data to the information that it has stored. The model then makes a decision on how it should classify the new point on based on the data that shares similar characteristics.

For example: If the new point matches 5 days with similar conditions where it rains in the past and 3 days where it does not, the model will return that it is raining. You can tune this by deciding how many of the closest points the model should consider when making its decision. In other words, you can choose how many of the nearest neighbors to consider when classifying new data.

Logistic Regression is another method that can be used for binary classification. Given the training data, the model will fit coefficients to each feature and perform some form of regularization. There is an option to remove this bias by using no penalty. This method will generate a linear boundary when classifying points to separate the rainy days over the dry days.

Random Forest is a decision tree based method that splits the training data at different nodes until some form of a stop condition is reached. As the name suggests, it generates random subsets of features to sample when building each tree from the training set.

In terms of classifying non-rainy days, version 1 was great! But each individual model would perform poorly on different things. Logistic Regression for my model generated a larger proportion of false positives and XGBoost had a larger number of false negatives! I would not want to listen to a forecast that would predict rainy days on more dry days or vice versa.

The solution to this was to put the models together using a soft voting algorithm. This gave me the best of both worlds and I was satisfied with the results.

In terms of implementation, I built a Streamlit to compare the predictions to OpenWeatherMap.org predictions. You can see a video of it in action here.

--

--