Linear Regression Model of King County Housing: Obtaining Geographical Data with Open GIS Software

For my second project as a new student of Data Science at Flatiron School, I was given a large dataset of King County, Washington houses and some associated features. Using this data, I was tasked with building a multiple linear regression model in order to explore what relationships affect the Seattle area housing prices. After a brief investigation I found the dataset to be the following popular dataset from Kaggle.com: House Sales In King County, USA.

Exploring the discussion section of the above website was very useful in solidifying a better understanding of the data before diving into the project. This initial curiosity quickly turned into full blown excitement and obsession after a spark was ignited inside of me upon the discovery of a blog posted in the very discussion section I was casually scrolling through. Author Juco Bowley does an amazing job at introducing newcomers to the wonders of QGIS. From the QGIS Website:

“QGIS is a user friendly Open Source Geographic Information System (GIS) licensed under the GNU General Public License. QGIS is an official project of the Open Source Geospatial Foundation (OSGeo). It runs on Linux, Unix, Mac OSX, Windows and Android and supports numerous vector, raster, and database formats and functionalities.” qgis.org.

This is the link to Juco Bowley’s blog on beginner’s GIS using the same dataset as I was using: Feature Engineering with QGIS

Once upon a time, I was a student of Anthropology/Archaeology while simultaneously pursuing my BS in chemistry. I was first introduced to the wonders of GIS software and it’s capabilities from the collection side rather than the exploration and analysis side. I spent nearly every summer I was a student at the University of Texas at Austin in the rainforest of Belize either digging stuff up, sifting through dirt, or mapping new locations to potentially dig and sift dirt at. This involved lugging around what could essentially be boiled down to a bunch of stop signs in order to easily reflect lasers and create geospatial relationships to process later with GIS software. However, the grad students and professors did the bulk of the processing and analysis while sending us grunts back into the literal rainforest with lightning poles to go collect more geodata.

I have to say, the other side of that relationship is pretty nice. As a younger man it was tons of fun to put in the leg work for field collection of geodata, but sitting behind a computer screen and making sense of the collected data is way more fascinating than I could have ever initially imagined. Downloading the free open source QGIS software felt like opening a Christmas present and got me excited to start my project. After graduating from undergrad, there was never a need for me to explore my old mapping ways as I spent all of my time conducting clinical trials as a benchtop pharmaceutical chemist. To be reintroduced to my old love in the context of my new field of Data Science was definitely a blessing and helped me speed through tons of work. Finding when to cut myself off was honestly the hardest part of the project.

The thing is, there is a ton of easily accessible open source data available with just a few clicks at kingcounty.gov. The possibilities seemed endless! Exploring it all in the scope of my project was not very realistic, so I had to pick and choose what made the most sense and what data was most likely correlated with King County housing prices. The following data collected from kingcounty.gov was utilized to create my linear regression model:

-school districts
-park locations
-museum locations
-golf course locations
-distance to shoreline
-Washington Environmental Health Disparities*
-Income information of surrounding neighborhood/area collected through survey**

*Information on health disparity rankings and what they mean can be found here: Health Disparity Index

**Information collected through household surveys over many years from 400 different subsections of king county was readily available on kingcounty.org. This includes information for median rent of each area, median household income, and income per capita.

All meta data for all datasets utilized can be easily accessed at kingcounty.gov. For more information please feel free to visit my GitHub linked at the end of this blog and see how this data was collected and transformed and finally incorporated into a linear regression model.

In this blog, I want to discuss how I was able to retrieve school district information and then process it with a combination of QGIS software and python in order to fit it into a linear regression model.

Here is all of relevant metadata associated with the school district shapefile that I used in my project:
Metadata for School District Shapefile

Here is what the combined data looks like in the QGIS UI:
Map of Housing and Associated School District

I will now like to create a step by step process on how I achieved this.

Open the QGIS software. It is free to use and free to download from QGIS.org.

On the left side, navigate to street map under the "XYZ Tiles" header. Double click this to add a basemap.
street map

Once you have your basemap, it easy to either navigate to the area of interest or type in coordinates to navigate there. For this project, I typed in the coordinates for downtown seattle and zoomed up a bit to encompass King County, WA.

The Next step is importing your data. This walkthrough assumes data in the form of a CSV file, but there are many more options available.

From the top menu bar, navigate to Layer>Add Layer>Add Delimitated Text Layer
add CSV data

The following popup will display. Expand the section for "Geometry Definition" after entering the appropriate file pathway for your data at the top. Select the appropriate columns for your data for the software to interpret as latitude and longitude values. You can preview the data you are importing at the bottom of the popup to verify that everything looks right.
import data popup

You should now have a map of your designated area with your designated points of interest. In this case, King County houses.

Next up, importing my newly downloaded shapefile is as simple as dragging and dropping!
drag and drop shapefile

You may need to reorder your layers to better visualize your layers. You should have a layer of points of interest (houses) and a layer of polygons (school districts).
points and polygons

Now it is time to add the school district information in the shape file and its associated polygons to their corresponding points. This is extremely easy with the QGIS software.

Navigate to the "toolbox" icon at the top menubar. It looks like a gear. You can either navigate to the tool you want to use or simply start typing in the navigation bar. In this case we want to add polygon attributes to points.
Add poly attributes

Once you click on "add polygon attributes to points" the following popup tool will appear. Just select your appropriate points layer, polygon layer, and attribute you want to add. I chose "NAME" to add school district names to my property listings. After that, just hit run and let QGIS do its thing.
Add poly popup

After the tool finishes running, you should have a new temporary layer with your intended result.
result

You can right click this layer and click "Open Attribute Table" to peak at your new dataset with the new feature you just added.

From there you just simply export your result as a CSV file and load it into any text editor you want to manipulate it further. I chose to import my new CSV file into a Jupyter Notebook for easy manipulation with python and the Pandas library.

Now I have a pandas DataFrame that has a school district name associated with every given property as well as all of the features I had downloaded from Kaggle. However, my school district data still needs further processing. While school district names are interesting on their own, they aren't entirely useful for creating a linear regression model. Multiple linear regression models require numerical data to run. Therefore, I decided to conduct some research to determine the quality of the school districts in order to convert the names into a numerical format based on school district rankings.

Luckily, a website called Niche.com has already done all of the heavy lifting for me. From the website:

"The School District Academics grade is based on rigorous analysis of academic data from the U.S. Department of Education along with test scores, college data, and ratings collected from millions of Niche users."

Please feel free to explore the information in the following links to get a better understanding of Niche's rating system and data collection processes.

Using the data from Niche.com I created a school ratings dictionary

This is all I have time to cover in the scope of this blog. Please feel free to check out the full project by visiting my GitHub. Also, please reach out to me with any questions or suggestions and I will do my best to respond in a timely manner. Thank you!

12