This week I was fortunate enough to work with some peers on exploratory data analysis (EDA) looking at the publicly available Metropolitan Transportation Authority (MTA) Turnstile Data from New York City. Our mission was to analyze the data to provide insight to an imaginary entity WomenTechWomenYes (WTWY) on where to solicit email engagement for an upcoming Gala. Before we could do any recommendation, we had to engage with the data.
The dataset is reported in weekly updates containing roughly 200,000 rows of data. A quick trip to their site helps to decode the information from provided on each row.
Each row of the dataset represents the recorded by a singular turnstile. Individual turnstiles have unique identifiers built from a combination of its Station, Remote Unit, Control Area (C/A), and Subunit Channel Position. To enter or exit a subway station, one usually needs to pass through a turnstile. The turnstiles record a running tally of the number of entries and exits once every four hours.
- One assumption was that WTWY would be looking to target the busiest stations regardless of demographic or location. We used this to drive most of our analysis.
- Being an imaginary event, we made the assumption that the date of event would be on October 2020. We used this to narrow our focus on collecting only the last quarter data from the MTA.
- We made an assumption that the evening and morning time blocks still coincided with conventional rush hour times. In reality, COVID-19 has changed the professional demographic in NYC and in turn affected the commuter demographic.
- Subway riders also have the option of leaving the subway through elevators or emergency exits. The turnstiles will not account for this so that sample of our population will not be accounted for. We assumed this would not procure a large affect on our total traffic analysis.
Keeping in mind the goal for the engagement, we started to clean the data to determine the number of people who had entered and exited in the four hour window. The idea is to find the difference in the turnstile ticker between audits.
First we wanted to focus on regular audits rather than the recovered audits. To accomplish this, we apply a mask to the frame.
mta = pd.DataFrame(mta[mta.DESC == 'REGULAR'])
Next, we need to make sure the data was sorted in ascending time. To do so, the DATE and TIME columns were concatenated and converted into datetime format.
To get the difference in turnstile counts, we called on the pandas built in .diff() function to determine the difference between a certain row and the previous row.
Combining both the net in and out data would yield the total traffic that had gone through that specific turnstile in the past four hours.
A quick glance at our data to check on our work showed some peculiar outliers. In particular, we found some negative turnstile traffic counts and some turnstile traffic counts in the billions!
After doing some fishing, we realised that the large outliers occurred when the .diff() function reached different turnstiles. With the limited time for the project, rather than replace the data with averages or medians, we decided to omit them from our analysis. We decided to make an assumption that the maximum turnstile in and out traffic for four hours would be around 2400 people (10 people per minute). We also determined that the negative values contributed to less than 1% of the total data.
To omit the data, we simply applied a mask to the dataframe:
Analysis / Feature Engineering
Once the data had been cleaned, it was onto the analysis and visualization. Each of these needed similar functions of engineering the data to provide a necessary visualization.
The first step was to look at the distribution of the data. Out of the 379 reported stations, we decided to focus on the top turnstile traffic.
To address current COVID-19 affects on travel through the MTA, we compared the top ten stations of 2020 to the top ten stations of 2019. While we saw changes in the rank order, the top traveled stations showed little volatility.
With this in mind, we focused on the top 10 stations.
A team member mapped the GPS of the stations using Tableau and from a visual analysis, we decided to focus on the stations in Manhattan, south of Central Park. This would give us a more targeted analysis for our client.
Furthermore, we wanted to continue narrowing our scope by sorting by the day of the week. To do so, we called the .weekday() function on our datetime object we created earlier.
Leaving the value as an integer quantity allows for easier sorting of the data. When we wanted to graph the data with the daily analysis, we needed to apply a quick map with a dictionary to translate the numbers to their corresponding days.
We wanted to use a graphic to show the trend between the Station, day of the week, and the average daily ridership. To do this, we needed to create a pivot table and plot it as a heatmap.
The next step was to look closer with our scope and compare the morning and evening rush traffic. This was surprisingly simple to do with Seaborn’s Categorical Plot method (sns.catplot()). It was easy to pass in what the desired columns, rows and individual plot type was. The output was a facet grid filled with the appropriate graph. A must use for classification comparisons.
After discussing with the team, we decided that while the categorical plot was visually pleasing, it did not really get the point across. It is not as intuitive to compare morning and evening when looking going back and forth between rows.
We decided to pick one of the days towards the end of the week and map the station traffic, separated into the morning and evening time blocks.
As our final recommendation to WTWY, we were able to narrow down the top seven stations by morning and evening. At minimum, they would need to target the top stations towards the end of the week in the evenings as total in and out traffic were at their maximum. Based on any staffing restrictions WTWY may have, the code is easily scalable and is written to be nimble. A simple mask to select the station, day, or time of day can change the visualizations to show the historical patterns in question.
As a whole, this project was a unique experience and entry into EDA. I had fun playing with all the different ways to show the comparisons in ridership.
- Introduce a regression algorithm to predict which stations would be busy at future dates.
- Apply a classification algorithm to pair stations with target demographic data and predict engagement.
- Provide a weighting scheme to target based on demographic for desirable subway stations.