Project 1: Exploratory Data Analysis of MTA Turnstile
As of this week, we have completed our first project in Istanbul Data Science Academy Bootcamp training.
According to the scenario in our project, the association named WomenTechWomenYes organizes a gala every year in the summer. In order to increase participation in this gala, they have teams collecting e-mails on the street. As the number of teams is limited, we will identify the most heavily used metro stations and optimize their placement.
In this article, I preferred to explain the project in main headings without going into the technical codes. I will share the details on my GitHub profile.
MTA Turnstile Data
In this study, March-May 2022 transition data were used as the main data source. New York subway turnstile data consists of cumulative entrance and exit data on the basis of station, turnstile, date and time. There are 2,739,787 rows and 11 columns in the date range we selected. Data is collected every 4 hours and shared at the address in the link below.
In addition to this data, Google Maps startup data and US Census Bureau data were also used.
Analysis Tools We Use
We used Python language and various libraries while preparing our project. We used Matplotlib and Seaborn libraries to analyze and visualize data with Pandas and NumPy on a Jupyter notebook.
Project Flow
First of all, we determined the problem/needs according to the flow chart we visualized. Then, we identified solutions and resources. After analyzing the selected data sources, we shared our solution suggestions.
Our aim to solve the problem:
- To detect the most crowded stations.
- Identifying the busiest days and times.
- To determine the busiest hour and day of the busiest station.
Let’s Explore Our Data Sources
We identified the heavily used stations with the turnstile data. In addition, we also made use of additional data sources.
We aimed to identify stations close to these companies in order to access technology-related company employees who are considered to have high e-mail usage, especially female university graduates.
We tried to reach the most potential people by analyzing the education and income information with the Census data, which records the social and economic data of the American people.
Data Analysis Process
In this project, we analyzed the data in 5 main stages.
Brief Data Insights
After receiving the turnstile data, we made it into a single dataframe and removed unnecessary data by performing the necessary analyzes with Pandas and NumPy libraries.
Since the entrances and exits at each turnstile are cumulative, we determined the individual traffic by taking the difference from the numbers in the previous time zone. In this way, we determined the most frequently used turnstiles.
We added a column as there are no records per day in the main dataset.
In the next step, we determined the hour intervals with the busiest traffic.
In the image below you can see the summary of our dataset.
Station Insights
There is a decrease in the number of passengers on weekends.
Most Used Station Insights
The top 10 most used stations according to the date we analyzed.
Total Traffic Insights
At all stations, the number of passengers on weekends decreases compared to weekdays.
Solution Proposals
Most crowded station: 34 ST Penn Station
Busiest Day: Wednesday
Time Slot: 20.00 00.00
Weekdays are more busy than weekends
Tuesday, Wednesday and Friday are most crowded days.
The day, time and station to be selected to reach the maximum number of people: 34 ST Penn Station, Friday, 16.00 20.00
As can be seen from the graph, 34 ST — Penn Station stands out as the most ideal station.
In addition to MTA data, let’s take a look at Google Maps and Census data.
Google Maps & US Census Bureau Data
With Google Maps, we can detect crowded stations close to start-ups, fintechs and other big technology companies.
New York is home to more than 9,000 startups, and that number is constantly growing. Most of these startups are located in the Manhattan area.
The busiest stations, which we marked on the map with the blue train logo, are also very close to these startups and fintechs.
Below you can see the map where we marked the startup companies and metro stations.
5 of the 10 busiest stations we saw in our analysis are located in this region.
Especially, 34 St — Penn Station and 23 ST stations are very close to workplaces.
Short Brief About Unites States Census Bureau Data
The United States Census Bureau (USCB), officially the Bureau of the Census, is a principal agency of the U.S. Federal Statistical System, responsible for producing data about the American people and economy.
• The Bronx, Staten Island, Brooklyn, Manhattan, and Queens are the places we analyzed.
• Brooklyn is NYC’s most populous borough with a population of 2.5 million
• 1.3 million people live in Manhattan.
Although Brooklyn is the most populated borough, we will examine data for this area in our analysis , as the busiest stations are located in Manhattan borough.
• There are 807,536 women are living in the relatively Manhattan area
• Average female earnings $62,177, this average earning rise to the $155.000 in Manhattan borough.
• Most educated women population are living in the Manhattan and Brookly borough.
The Manhattan borough, where the busiest stations are located, also stands out as the region with the highest income and the most educated female employees.
‘’For this reason, better results can be obtained if a large part of street crews are placed in 34 St — Penn Station, 42 ST and 23 ST stations in this area. ‘’
This is how we completed our analysis. If we had more time:
Analyze the NYC MTA data by separating it from tourist visits.
Examine location reporting data in cafes, restaurants and hotels around metro stations.
Check the usage hours of free wi-fi networks and the most used hotspots.
After making the project presentation, I will also share the notebook on my Github profile: https://github.com/mehmethasanalici
Thank you for reading. Hope it’s useful.