Washrooms in Vancouver Public Parks
The Question
In this project, we aimed to use open source data from the City of Vancouver Open Data Portal to understand the amenities, features and attributes of parks where a washroom was constructed or not.
The Pipeline
Before getting to predictive modelling, we built a data validation pipeline using Pandera to enforce schema constraints on the input data. This included checks for missingness, expected data types, and valid category levels. By building data checks in at the ingestion stage, we were able to prevent data issues downstream and making our overall pipeline more robust.
Reproducibility
To ensure that our analysis could be run again and again producing the same results, we spun up a Docker container with stable python versioning and pinned versions of each of our packages. We additionally, included a docker-compose.yml file which is automatically updated with the latest version of the container using a GitHub Action workflow. This had the added benefit for us to collaborate more easily with fewer conflicts as we built out our model, making sure we were all coding in the same exact environment. Furthermore, we wrote a Makefile to allow the entire analysis to run with only one command. To run the analysis yourself, clone the repository and follow the setup instructions in the README; the entire pipeline can be run with a single make all command.
The Model and Findings
We compared the performance of a k-nearest neighbors (k-nn) algorithm and a Support Vector Machine with Radial Basis Function (SVM RBF) algorithm against a baseline to build a binary classification model predicting the existence of a washroom in each park. We included neighbourhood name, park size (hectare), whether the park is official, whether there are any advisories, whether there are additional facilities and whether the park has special features to fit the model. While neither model was able to substantially improve on the baseline, this was informative in and of itself, suggesting that washroom presence is more a function of historical infrastructure decisions or other neighborhood characteristics that were not included in the dataset we used. However, we did find that park size and the presence of other amenities were stronger signals than the other features in predicting the presence of washrooms, supporting our initial intuition that washrooms would be associated with larger, more developed parks.
Report: View the rendered Report
Code: View on GitHub
Tools: Python · scikit-learn · Pandera · Docker · matplotlib
Collaborators: William Song and Chung Ki (Harry) Yau