Complex Data Loading: Basic ETL Tasks with Redfin Real Estate Data and Python

Using precise data aggregated from across the US to develop better real estate predictions and a more accurate snapshot of a local market or region.

This project is a modification of a similar analytics project by Analytics with Ariel. To access the GitHub for her project, please go here.

Python can make a great tool for data loading and processing, especially if the data is in a format that isn’t easily readable or can’t be easily parsed into established categories. With the ability to seamlessly integrate with web scraping tools, Python allows users to maintain consistent database connectivity, allowing for cross-source analyses to maintain up-to-date market comprehension. Pandas excels at organizing and cleaning data, and was tailor-made to handle complex spreadsheets and databases commonly found in mixed quantitative-qualitative data analyses (prevalent with real estate research). Furthermore, Matplotlib and Seaborn facilitate the creation of powerful visualizations, enabling analysts to pinpoint discrete trends and patterns in property values, rental rates, and market fluctuations. Python's powerful statistical and machine learning libraries (Scikit-learn, TensorFlow, PyTorch, Statsmodels) provide the architecture needed for predictive modeling and forecasting, allowing for more accurate projections and risk analysis.

This project focuses primarily on ingesting data from Redfin’s Data Center to deduce insights about the market as a whole and to allow for multiregional drilldowns— if a user wanted to assess a certain zip code, county, state, region, etc.

Redfin is a real estate brokerage, meaning we have direct access to data from local multiple listing services, as well as insight from a vast number of real estate agents across the US. Because of this ubiquity, Redfin has the resources to aggregate a vast amount of data across hundreds of domestic real estate markets, allowing for precise and timely predictions. Using the data made available for free, users can visualize and download housing market data across hundreds of different metropolitan areas, cities, neighborhoods and zip codes.

In order to take in data from Redfin’s data center, click the link here and follow these steps:

  • Choose the category of data you require for analysis.

    Beneath each tab, you have the option to refine outcomes by metropolitan area, property type, month-over-month change, year-over-year change, and temporal range.

    Select the visualization, then press the download button located at the lower right-hand corner to obtain the displayed data.

Below is the Github Gist for this project. Given that the Redfin data feed can be massive, I chose to filter down by my state’s market (Virginia). While this may seem like a simple data ingestion task, the ability to extract, transform, and load the data needed can require some serious computing power, and can serious hamper a machine’s performance if the proper steps are not heeded and parameters are not set. From a high level, the code is organized as such:

  • Import - bring in all necessary packages, set warnings

  • Optional: Add more ram to machine (if using Colab Pro)

  • Data - Set device, applicable url (from Redfin), and set timer parameters to measure data load performance

  • Transformations - Filter by state (in my case: VA), read in external data source for ZIP code, find if any invalid zip codes are throwing errors, filter by valid zip codes for chosen state, join data

  • Add Features - add timeframe variables, flag for most recent date, remove Redfin city name and add external city name, fix median days on market

  • Output - Name by date_time, download file locally

Introduce your brand

Previous
Previous

Assessing Proportional Efficacy of Microfinance Institutions As a Function of Consumer Protections

Next
Next

AD//HOC: Artificial Intelligence Qualitative Framework Mapping for Corporate Application