Python Automated Real Estate Data Pipeline Project: Part 4

In this, the final part of our real estate data pipeline project, we’re adding automation and historical tracking to ensure that our scraper runs on schedule and accumulates data over time.

This automation allows us to track long-term trends in the real estate market without manual intervention.

Recap: What We've Done So Far

Before diving into automation, let’s briefly recap what we’ve built:

- Part 1: Web Scraping – in part 1, we developed a web scraper using Selenium to extract real estate data from Redfin, collecting essential property details like price, address, beds/baths, and square footage.

- Part 2: Data Cleaning & Analysis – In part 2, after gathering the data, we cleaned and structured it using Pandas, handling missing values, converting data types, and performing exploratory data analysis to uncover real estate trends.

- Part 3: Interactive Dashboard – In part 3 we built a Streamlit-based dashboard to visualize property listings, allowing users to explore properties dynamically using filters, price distributions, and an interactive map.

Now, in Part 4, we’ll take our project to the next level by automating the entire process, enabling real-time tracking of historical price trends and listing activity.

Part 4: Automating and Tracking Historical Data

After completing Part 3 of our Python project, we successfully built an interactive Streamlit dashboard that allows users to explore and analyze Hollywood Hills real estate dynamically. However, to make our real estate pipeline truly powerful, we now introduce automation and historical tracking.

What We’ll Cover:

- Automating the scraper and analysis – Scheduling automated runs using APScheduler.

- Enhancing the scraper – Storing timestamped data to track property trends over time.

- Processing historical data – Cleaning and structuring multiple datasets for long-term analysis.

- Visualizing price trends – Adding historical price tracking to our dashboard.

- Integrating historical filtering – Allowing users to compare listings over time.

By the end of this tutorial, you’ll have a fully automated real estate data pipeline that scrapes data daily, processes and cleans it, and displays historical insights in an interactive dashboard.

Prerequisites

Don’t worry if you’re not a Python expert, but before diving into the code, checking that you have these skills under your belt will make this journey smoother and more enjoyable.

- Basic Python knowledge – Understanding functions, loops, and file handling.

- Familiarity with Pandas – Loading, cleaning, and analyzing datasets.

- Basic web scraping concepts – Using Selenium or BeautifulSoup.

- Understanding of Streamlit – Running a basic Streamlit app.

- Completed Parts 1-3 - it's going to be much easier to follow the final stage if you've completed the previous steps.

Required Libraries

We’ll use the following Python packages:

Library	Purpose
`pandas`	Load, clean, and process data
`numpy`	Handle numerical transformations and calculations
`matplotlib`	Create plots & data visualizations
`seaborn`	Enhance statistical plots
`streamlit`	Build an interactive web-based dashboard
`folium`	Render interactive maps for property locations
`streamlit-folium`	Integrate Folium maps into Streamlit
`selenium`	Automate web scraping of real estate data
`apscheduler`	Schedule automated scraping and analysis jobs

Install Dependencies

To install the required libraries, run:

pip install pandas numpy matplotlib seaborn streamlit folium streamlit-folium selenium apscheduler webdriver-manager

Set Up Your Project

Before we start coding, let’s set up the project:

1️- Ensure Python is installed on your system. If not, download it from the official Python website.

2️ - Open your favorite code editor or IDE (e.g., VS Code, PyCharm, Jupyter Notebook).

3️ - Create four new Python files inside your project folder:

scheduler.py (Handles automation and scheduling)
scraper.py (Scrapes and saves real estate data with timestamps)
analyze.py (Processes historical data for trends)
dashboard.py (Builds an interactive visualization with Streamlit)

4️ - Install the required libraries using the pip install command above.

Now that everything is set up, let's dive head first into our Python editor and build our fully automated real estate pipeline!

Why Automate?

Without automation, we would have to run the scraper and analysis scripts manually each day. Automation helps us:

- Ensure timely data collection – The scraper runs automatically at a set time.

- Process and clean data without intervention – The analysis script runs right after scraping.

- Enable long-term trend tracking – Data is collected daily for historical comparison.

- Reduce manual workload – No need to remember to run scripts daily.

With APScheduler, we can schedule our scraping and analysis jobs to run at predefined times every day.

How APScheduler Works

APScheduler (Advanced Python Scheduler) is a powerful Python library that allows us to schedule and run jobs at specific times. In scheduler.py, we use the BlockingScheduler, which runs indefinitely and executes our tasks at scheduled times.

To install APScheduler, run:

pip install apscheduler

Implementing the Scheduler

Below is the full scheduler.py script:

from apscheduler.schedulers.blocking import BlockingScheduler
import subprocess
import logging

# Setup logging
logging.basicConfig(
    filename="logs/scheduler.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

scheduler = BlockingScheduler()

# Run scraper daily at 6 AM
def scrape_job():
    logging.info("🚀 Running scraper job...")
    result = subprocess.run(["python3", "scraper.py"], capture_output=True, text=True)
    if result.stdout:
        logging.info(result.stdout)
    if result.stderr:
        logging.error(result.stderr)  # Logs only if an error occurs

# Run analysis daily at 7 AM (after scraping)
def analyze_job():
    logging.info("📊 Running analysis job...")
    result = subprocess.run(["python3", "analyze.py"], capture_output=True, text=True)
    if result.stdout:
        logging.info(result.stdout)
    if result.stderr:
        logging.error(result.stderr)

# Schedule tasks
scheduler.add_job(scrape_job, "cron", hour=6)
scheduler.add_job(analyze_job, "cron", hour=7)

# Start the scheduler
logging.info("🕒 Scheduler started...")
scheduler.start()

Breaking Down the Scheduler

1️ Setting Up Logging

The script initializes a logging system to record when the scheduler starts, when jobs run, and whether they succeed or fail.

logging.basicConfig(
    filename="logs/scheduler.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

- Logs events and errors for debugging.

- Saves logs in logs/scheduler.log for later inspection.

2️ Scheduling the Scraper Job

def scrape_job():
    logging.info("🚀 Running scraper job...")
    result = subprocess.run(["python3", "scraper.py"], capture_output=True, text=True)
    if result.stdout:
        logging.info(result.stdout)
    if result.stderr:
        logging.error(result.stderr)  # Logs only if an error occurs

- Uses subprocess.run() to execute scraper.py as an external process.

- Captures output (stdout) and errors (stderr) to log them.

3️ Scheduling the Analysis Job

def analyze_job():
    logging.info("📊 Running analysis job...")
    result = subprocess.run(["python3", "analyze.py"], capture_output=True, text=True)
    if result.stdout:
        logging.info(result.stdout)
    if result.stderr:
        logging.error(result.stderr)

- Runs analyze.py (ensuring scraping has completed first).

- Captures and logs output/errors.

4️ Adding Jobs to APScheduler

scheduler.add_job(scrape_job, "cron", hour=6)
scheduler.add_job(analyze_job, "cron", hour=7)

- scheduler.add_job() schedules both jobs using the cron trigger, which allows us to specify exact execution times.

- The scraper is set to run at 6AM daily

- The analyzer is set to run at 7AM daily

5️ Running the Scheduler

scheduler.start()

- Starts the BlockingScheduler, which runs indefinitely and executes jobs at scheduled times.

- Ensures that tasks run even if the system stays on for long periods.

Running the Scheduler

To start the scheduler, run:

python scheduler.py

The script will run indefinitely, executing the scraper at 6 AM and the analysis at 7 AM each day.

If you need to stop it, press CTRL + C.

Key Takeaways

- Automates data collection & analysis – No need to manually run scripts.

- Runs daily at specified times – Uses APScheduler to schedule scraping and analysis jobs.

- Logs all activity – Captures errors and outputs for debugging.

- Prepares data for historical tracking – Enables the next phase of our dashboard enhancements.

With scheduler.py in place, our project can now autonomously collect and analyze real estate data every day.

Enhancing the Scraper

Now that we’ve scheduled the automation, let’s circle back around to the scraper.py script that we created back in Part 1. The goal here is to enhance our scraping process by integrating logging, timestamps, and historical data storage.

Why Enhance the Scraper?

The standard scraper we built in Part 1 needs to be updated for:

- Logging – Capturing errors and successes for debugging.

- Timestamped Storage – Saving data with a date to track trends.

- Automation-Friendly Execution – Ensuring smooth integration with the scheduler.

Implementing the Enhanced Scraper

Below is the updated scraper.py with logging, timestamped storage, and automation-friendly execution.

import os
import random
import json
import logging
import pandas as pd
import time
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
from webdriver_manager.firefox import GeckoDriverManager

# Setup logging
logging.basicConfig(
    filename="logs/scraper.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

def scrape():
    try:
        logging.info("🔄 Starting Redfin Scraper...")
        
        # Define a list of User-Agents to rotate
        USER_AGENTS = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/119.0",
        ]

        # OPTIONAL - Update with your actual Firefox profile path
        profile_path = "/home/your_name/.mozilla/firefox/abcdefgh.your-profile"

        # Setup Firefox options for stealth mode
        options = Options()
        options.profile = FirefoxProfile(profile_path) # OPTIONAL
        options.add_argument("--headless")
        options.add_argument("--disable-gpu")
        options.set_preference("dom.webdriver.enabled", False)
        options.set_preference("useAutomationExtension", False)

        # Randomly select a User-Agent
        user_agent = random.choice(USER_AGENTS)
        options.set_preference("general.useragent.override", user_agent)

        # Use WebDriver Manager to install Geckodriver
        service = Service(GeckoDriverManager().install())
        driver = webdriver.Firefox(service=service, options=options)

        # Redfin search URL
        base_url = "https://www.redfin.com/neighborhood/547223/CA/Los-Angeles/Hollywood-Hills"
        scraped_data = []
        today = datetime.today().strftime("%Y-%m-%d")

        driver.get(base_url)
        time.sleep(random.uniform(5, 8))  # Random delay
        
        # Scraping logic (Extracting listings and saving data) from Part 1 
        #...

        df = pd.DataFrame(scraped_data)
        df["Date"] = today  # Add date for historical tracking

        # Save daily CSV
        daily_filename = f"data/redfin_hollywood_hills_{today}.csv"
        df.to_csv(daily_filename, index=False)
        logging.info(f"Saved daily data: {daily_filename}")

        # Append to master dataset
        master_filename = "data/redfin_hollywood_hills_master.csv"
        if os.path.exists(master_filename):
            master_df = pd.read_csv(master_filename)
            df = pd.concat([master_df, df]).drop_duplicates(subset=["Address", "Date"])
        df.to_csv(master_filename, index=False)
        logging.info(f"Updated master dataset: {master_filename}")

        driver.quit()
        exit(0)  # Exit successfully

    except Exception as e:
        logging.error(f"❌ Scraper failed: {e}")
        exit(1)  # Exit with failure

if __name__ == "__main__":
    scrape()

Key Takeaways

By enhancing our scraper with logging, timestamps, and historical storage, we’ve transformed it from a basic script into a robust data pipeline. Now, it:

- Tracks execution via logging for debugging and monitoring.

- Stores data with timestamps, allowing for trend analysis.

- Maintains a master dataset, making long-term analysis easier.

- Integrates seamlessly with automation, ensuring reliability.

With these improvements, our real estate data scraper is now better equipped for long-term, automated execution.

Let’s go through each of these enhancements in detail.

1️ Adding Logging

Logging is crucial for tracking the scraper’s execution, especially when running it in an automated environment where you may not always be watching it run. We’ve set up logging to capture both successes and errors.

import logging

# Setup logging
logging.basicConfig(
    filename="logs/scraper.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

Here’s how logging benefits us:

- Records every major event (start, successful data save, scraper failures)

- Captures errors when they occur, making debugging easier

- Saves log output to a file (logs/scraper.log) for future reference

2️ Saving Data with Timestamps

To enable trend analysis, we need to timestamp our data. Each day’s scraped data is saved in a separate file, ensuring we retain a historical record. We do this by:

Getting today’s date:

from datetime import datetime

today = datetime.today().strftime("%Y-%m-%d")

Naming the daily file dynamically based on the date:

daily_filename = f"data/redfin_hollywood_hills_{today}.csv"
df.to_csv(daily_filename, index=False)
logging.info(f"Saved daily data: {daily_filename}")

With this approach:

- Each day’s data is stored in a new file (redfin_hollywood_hills_YYYY-MM-DD.csv)

- We can track how property listings change over time

3️ Maintaining a Master CSV File for Historical Data

In addition to daily files, we maintain a master dataset to keep a cumulative record. If the file exists, we append new data to it while ensuring no duplicate entries.

master_filename = "data/redfin_hollywood_hills_master.csv"
if os.path.exists(master_filename):
    master_df = pd.read_csv(master_filename)
    df = pd.concat([master_df, df]).drop_duplicates(subset=["Address", "Date"])
df.to_csv(master_filename, index=False)
logging.info(f"Updated master dataset: {master_filename}")

This enhancement allows us to:

- Preserve all historical data in a single file

- Avoid duplicate listings (based on Address and Date)

- Ensure efficient data management for long-term trend analysis

4️ Ensuring Automation-Friendly Execution

Since the scraper will run on a schedule, we make sure it exits gracefully on success (exit(0)) or failure (exit(1)).

try:
    ...
    driver.quit()
    exit(0)
except Exception as e:
    logging.error(f"❌ Scraper failed: {e}")
    exit(1)

This ensures:

- The scraper properly signals success or failure

- Scheduled tasks can detect failures and retry if needed

Enhancing Data Analysis

Now that we’ve improved the scraper, let’s enhance our analyze.py script from part 2 where we process, clean, and analyze the scraped data for long-term insights.

Why Enhance Data Analysis?

With historical tracking enabled, we need to:

- Handle missing values – Ensure all data is structured correctly.

- Standardize formats – Convert price, square footage, and other attributes to numerical formats.

- Generate summary statistics – Provide insights into price trends and listing activity over time.

- Create trend visualizations – Help users understand long-term shifts in the real estate market.

Implementing the Enhanced Data Analysis

Below is the updated analyze.py with improved data cleaning, historical tracking, and trend visualizations.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from datetime import datetime

# Setup logging
logging.basicConfig(
    filename="logs/analyze.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load dataset (specific date or full historical dataset)
def load_data(date=None):
    if date:
        file_path = f"data/redfin_hollywood_hills_{date}.csv"
    else:
        file_path = "data/redfin_hollywood_hills_master.csv"

    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
        logging.info(f"📊 Loaded data from {file_path}")
        return df
    else:
        logging.warning(f"⚠️ No data found for {date or 'historical records'}!")
        return pd.DataFrame()

# Handle missing values & clean data
def clean_data(df):
    logging.info("🛠 Cleaning Data...")

    # Replace invalid values
    df.replace({"—": np.nan, "N/A": np.nan, "": np.nan}, inplace=True)

    # Convert numeric columns
    df["Price"] = df["Price"].str.replace("[$,]", "", regex=True).astype(float)
    df["SqFt"] = df["SqFt"].str.replace(",", "", regex=True).astype(float)
    df["Beds"] = df["Beds"].str.extract("(\d+)").astype(float)
    df["Baths"] = df["Baths"].str.extract("(\d+)").astype(float)

    # Convert Latitude & Longitude to float
    df["Latitude"] = pd.to_numeric(df["Latitude"], errors="coerce")
    df["Longitude"] = pd.to_numeric(df["Longitude"], errors="coerce")

    # Drop rows missing essential values
    df.dropna(subset=["Price", "Beds", "Baths", "SqFt", "Latitude", "Longitude"], inplace=True)

    logging.info(f"✅ Cleaned data: {len(df)} valid listings remaining.")
    return df

# Generate summary statistics
def summarize_data(df):
    summary = df.describe()
    logging.info(f"\n📊 Summary Statistics:\n{summary}")

# Plot price trends over time
def plot_price_trend(df):
    plt.figure(figsize=(12, 6))
    sns.lineplot(x="Date", y="Price", data=df, estimator="mean", ci=None)
    plt.title("📈 Average Listing Price Over Time")
    plt.xlabel("Date")
    plt.ylabel("Average Price ($)")
    plt.xticks(rotation=45)
    plt.grid()
    plt.savefig("plots/price_trend.png")
    logging.info("📊 Saved price trend plot.")

# Show the number of listings over time
def plot_listings_trend(df):
    plt.figure(figsize=(12, 6))
    df.groupby("Date").size().plot(kind="line", marker="o")
    plt.title("📉 Number of Listings Over Time")
    plt.xlabel("Date")
    plt.ylabel("Listings Count")
    plt.xticks(rotation=45)
    plt.grid()
    plt.savefig("plots/listings_trend.png")
    logging.info("📊 Saved listings trend plot.")

# Show top & bottom listings
def show_extreme_listings(df):
    logging.info("\n💰 Top 5 Most Expensive Listings:")
    logging.info(df.nlargest(5, "Price").to_string())

    logging.info("\n💸 Top 5 Cheapest Listings:")
    logging.info(df.nsmallest(5, "Price").to_string())

# Save cleaned data & append to master dataset
def save_cleaned_data(df, date):
    cleaned_filename = f"data/redfin_hollywood_hills_cleaned_{date}.csv"
    df.to_csv(cleaned_filename, index=False)
    logging.info(f"✅ Saved cleaned daily data: {cleaned_filename}")

    # Append to master dataset
    master_filename = "data/redfin_hollywood_hills_master_cleaned.csv"
    if os.path.exists(master_filename):
        master_df = pd.read_csv(master_filename)
        df = pd.concat([master_df, df]).drop_duplicates(subset=["Address", "Date"])

    df.to_csv(master_filename, index=False)
    logging.info(f"✅ Updated master dataset: {master_filename}")

# Run analysis
def run_analysis(date=None):
    try:
        logging.info(f"\n🚀 Running Analysis for {date or 'historical data'}...\n")
        df = load_data(date)

        if df.empty:
            logging.warning("⚠️ No data available to analyze.")
            return False

        df = clean_data(df)
        summarize_data(df)

        if date:
            save_cleaned_data(df, date)

        if not date:
            plot_price_trend(df)
            plot_listings_trend(df)

        show_extreme_listings(df)
        logging.info("✅ Analysis complete.\n")
        return True

    except Exception as e:
        logging.error(f"❌ Analysis failed: {e}")
        return False

# Run the script automatically
if __name__ == "__main__":
    today = datetime.today().strftime("%Y-%m-%d")
    run_analysis(today)  # Run for today’s data
    run_analysis()       # Run historical analysis

Key Takeaways

The modifications we made to analyze.py include:

- Logging: To track execution and debugging

- Dynamic Data Loading: Allowing analysis for a specific date or the full dataset

- Data Cleaning: Handling missing values and standardizing formats

- Historical Data Tracking: Enabling long-term trend analysis

- Visualization: Generating graphical insights into real estate trends

- Extreme Listings: Identifying the most and least expensive properties

- Automated Data Saving: Storing cleaned data efficiently

Let’s go through each of these enhancements in detail.

1️ Adding Logging

Logging is essential for tracking the execution of our analysis. It captures important events such as data loading, cleaning, and processing steps.

import logging

# Setup logging
logging.basicConfig(
    filename="logs/analyze.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

With this setup:

- All actions are logged for debugging and monitoring.

- Errors are captured, allowing easier troubleshooting.

- Timestamps are included, giving a clear sequence of events.

2️ Handling Missing Values and Cleaning Data

Cleaning data ensures that only valid records are analyzed. We replace missing or invalid values and convert key columns into appropriate data types.

def clean_data(df):
    logging.info("🛠 Cleaning Data...")
    df.replace({"—": np.nan, "N/A": np.nan, "": np.nan}, inplace=True)
    df["Price"] = df["Price"].str.replace("[$,]", "", regex=True).astype(float)
    df["SqFt"] = df["SqFt"].str.replace(",", "", regex=True).astype(float)
    df["Beds"] = df["Beds"].str.extract("(\d+)").astype(float)
    df["Baths"] = df["Baths"].str.extract("(\d+)").astype(float)
    df.dropna(subset=["Price", "Beds", "Baths", "SqFt", "Latitude", "Longitude"], inplace=True)
    logging.info(f"✅ Cleaned data: {len(df)} valid listings remaining.")
    return df

This function:

- Replaces missing values with NaN.

- Converts prices and square footage to float values.

- Extracts numeric values from bedroom and bathroom columns.

- Removes incomplete listings, ensuring data integrity.

3️ Historical Data Tracking and Analysis

This function processes the data and generates key insights.

def run_analysis(date=None):
    logging.info(f"🚀 Running Analysis for {date or 'historical data'}...")
    df = load_data(date)
    if df.empty:
        logging.warning("⚠️ No data available to analyze.")
        return False
    df = clean_data(df)
    summarize_data(df)
    if date:
        save_cleaned_data(df, date)
    if not date:
        plot_price_trend(df)
        plot_listings_trend(df)
    show_extreme_listings(df)
    logging.info("✅ Analysis complete.")
    return True

This function:

- Loads and cleans the data before analysis.

- Generates summary statistics for quick insights.

- Saves cleaned data for future use.

- Visualizes trends if analyzing historical data.

4️ Trend Visualization

To make data easier to interpret, we generate interactive visualizations.

def plot_price_trend(df):
    plt.figure(figsize=(12, 6))
    sns.lineplot(x="Date", y="Price", data=df, estimator="mean", ci=None)
    plt.title("📈 Average Listing Price Over Time")
    plt.xlabel("Date")
    plt.ylabel("Average Price ($)")
    plt.xticks(rotation=45)
    plt.grid()
    plt.savefig("plots/price_trend.png")
    logging.info("📊 Saved price trend plot.")

def plot_listings_trend(df):
    plt.figure(figsize=(12, 6))
    df.groupby("Date").size().plot(kind="line", marker="o")
    plt.title("📉 Number of Listings Over Time")
    plt.xlabel("Date")
    plt.ylabel("Listings Count")
    plt.xticks(rotation=45)
    plt.grid()
    plt.savefig("plots/listings_trend.png")
    logging.info("📊 Saved listings trend plot.")

These functions:

- Show price trends over time, helping users see market movements.

- Track the number of listings, giving insights into market activity.

- Save plots for further analysis and reporting.

5️ Extreme Listings Identification

To highlight significant market points, we find the highest and lowest-priced listings.

def show_extreme_listings(df):
    logging.info("\n💰 Top 5 Most Expensive Listings:")
    logging.info(df.nlargest(5, "Price").to_string())
    logging.info("\n💸 Top 5 Cheapest Listings:")
    logging.info(df.nsmallest(5, "Price").to_string())

This function:

- Identifies the top 5 most expensive and cheapest properties.

- Provides quick insights into market extremes.

6️ Automated Data Saving

To maintain historical data, we save cleaned datasets efficiently.

def save_cleaned_data(df, date):
    cleaned_filename = f"data/redfin_hollywood_hills_cleaned_{date}.csv"
    df.to_csv(cleaned_filename, index=False)
    logging.info(f"✅ Saved cleaned daily data: {cleaned_filename}")
    master_filename = "data/redfin_hollywood_hills_master_cleaned.csv"
    if os.path.exists(master_filename):
        master_df = pd.read_csv(master_filename)
        df = pd.concat([master_df, df]).drop_duplicates(subset=["Address", "Date"])
    df.to_csv(master_filename, index=False)
    logging.info(f"✅ Updated master dataset: {master_filename}")

This function:

- Stores daily cleaned data in a structured format.

- Appends new data to a master dataset, maintaining historical records.

- Prevents duplicate entries, keeping the dataset clean.

Enhancing the Dashboard

Now that we have automated data collection and analysis, the final step is integrating historical tracking into our Streamlit dashboard from part 3. This will allow users to:

- Filter listings by date – View property listings from specific days in the past.

- Analyze long-term trends – Use historical data to see price changes and market shifts.

- Visualize price trends – Display historical price movements using interactive charts.

- Compare listings over time – Track individual properties and how their prices have changed.

Implementing the Enhanced Dashboard

Below is the updated dashboard.py, which includes a tabbed view, date selection dropdown, and historical price trends.

import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import folium
from streamlit_folium import st_folium
import numpy as np
import glob
import os
import re

# Fetch available dates from stored CSV files
def get_available_dates():
    files = glob.glob("data/redfin_hollywood_hills_*.csv")
    
    # Extract valid dates from filenames (YYYY-MM-DD format)
    date_pattern = re.compile(r"redfin_hollywood_hills_(\d{4}-\d{2}-\d{2})\.csv")
    dates = sorted(
        {date_pattern.search(f).group(1) for f in files if date_pattern.search(f)},
        reverse=True
    )
    return dates

# Load selected date's data
@st.cache_data
def load_data(selected_date=None):
    if not selected_date:
        st.error("❌ No date selected.")
        return pd.DataFrame()
    
    file_path = f"data/redfin_hollywood_hills_cleaned_{selected_date}.csv"
    if not os.path.exists(file_path):
        st.error(f"❌ Cleaned data file not found: {file_path}")
        return pd.DataFrame()

    try:
        df = pd.read_csv(file_path)

        # 🛠 Handle missing values & clean numeric columns
        df.replace({'—': np.nan, 'N/A': np.nan, '': np.nan}, inplace=True)
        
        df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
        df["Beds"] = pd.to_numeric(df["Beds"], errors="coerce")
        df["Baths"] = pd.to_numeric(df["Baths"], errors="coerce")
        df["SqFt"] = pd.to_numeric(df["SqFt"], errors="coerce")
        df["Latitude"] = pd.to_numeric(df["Latitude"], errors="coerce")
        df["Longitude"] = pd.to_numeric(df["Longitude"], errors="coerce")

        df.dropna(subset=["Price", "Beds", "Baths", "SqFt", "Latitude", "Longitude"], inplace=True)

        return df

    except FileNotFoundError:
        st.error(f"❌ Data file not found: {file_path}")
        return pd.DataFrame()

# Load historical dataset
@st.cache_data
def load_historical_data():
    master_file = "data/redfin_hollywood_hills_master_cleaned.csv"
    try:
        df = pd.read_csv(master_file)
        
        df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
        df["Beds"] = pd.to_numeric(df["Beds"], errors="coerce")
        df["Baths"] = pd.to_numeric(df["Baths"], errors="coerce")
        df["SqFt"] = pd.to_numeric(df["SqFt"], errors="coerce")
        df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
        
        df.dropna(subset=["Price", "Beds", "Baths", "SqFt", "Date"], inplace=True)

        return df

    except FileNotFoundError:
        st.error(f"❌ Historical data file not found.")
        return pd.DataFrame()

# Streamlit UI
st.title("🏡 Hollywood Hills Real Estate Dashboard")
st.write("Analyze real estate trends in Hollywood Hills using interactive visualizations.")

# Create Tabs
tab1, tab2 = st.tabs(["📆 Listings by Date", "📈 Historical Trends"])

# TAB 1: Listings by Date
with tab1:
    st.subheader("📆 View Listings by Date")

    # Dropdown to select a date
    available_dates = get_available_dates()
    selected_date = st.selectbox("Select Date", available_dates)

    # Load selected day's data
    df = load_data(selected_date)

    if df.empty:
        st.warning("⚠️ No data available for the selected date.")
        st.stop()

    # Sidebar Filters
    st.sidebar.header("🔍 Filter Listings")

    show_all = st.sidebar.checkbox("Show All Properties", value=False)

    if show_all:
        filtered_df = df
    else:
        min_price_value = max(1000, df["Price"].min()) if not np.isnan(df["Price"].min()) else 1000
        max_price_value = np.ceil(df["Price"].max() / 1_000_000) * 1_000_000 if not np.isnan(df["Price"].max()) else 1_000_000

        min_price, max_price = st.sidebar.slider(
            "Select Price Range ($)", 
            min_value=int(min_price_value), 
            max_value=int(max_price_value), 
            value=(int(min_price_value), int(max_price_value)),
            format="$%d",
            key="main_slider"
        )

        min_beds, max_beds = 1, int(np.ceil(df["Beds"].max() / 5) * 5)
        min_baths, max_baths = 1, int(np.ceil(df["Baths"].max() / 5) * 5)
        min_sqft, max_sqft = 100, int(np.ceil(df["SqFt"].max() / 10_000) * 10_000)

        selected_beds = st.sidebar.slider("Bedrooms", min_beds, max_beds, (1, 5))
        selected_baths = st.sidebar.slider("Bathrooms", min_baths, max_baths, (1, 5))
        selected_sqft = st.sidebar.slider("Square Footage", min_sqft, max_sqft, (500, 5000))

        filtered_df = df[
            (df["Price"] >= min_price) & (df["Price"] <= max_price) &
            (df["Beds"] >= selected_beds[0]) & (df["Beds"] <= selected_beds[1]) &
            (df["Baths"] >= selected_baths[0]) & (df["Baths"] <= selected_baths[1]) &
            (df["SqFt"] >= selected_sqft[0]) & (df["SqFt"] <= selected_sqft[1])
        ]

    st.subheader(f"📊 {len(filtered_df)} Listings Found")
    # Create an interactive table where users can select a row
    selected_rows = st.data_editor(
        filtered_df[["Price", "Beds", "Baths", "SqFt", "Address", "Link"]]
        .sort_values(by="Price", ascending=False)
        .reset_index(drop=True),
        use_container_width=True,  # Makes the table responsive
        height=400,
        num_rows="dynamic",
        hide_index=True,  # Hides the index column
        column_config={"Link": st.column_config.LinkColumn()},  # Make links clickable
        key="table_selection"
    )

    # Get selected address (if any)
    selected_address = selected_rows.iloc[0]["Address"] if not selected_rows.empty else None
    
    # Price Distribution
    st.subheader("💰 Price Distribution")
    with st.container():
        fig, ax = plt.subplots(figsize=(8, 4))
        
        # Plot histogram
        sns.histplot(filtered_df["Price"], bins=30, kde=True, ax=ax)

        # Format x-axis: Convert to millions and append "M"
        ax.xaxis.set_major_formatter(mtick.FuncFormatter(lambda x, _: f'${x/1_000_000:.0f}M'))

        ax.set_xlabel("Price ($)")
        ax.set_ylabel("Count")
        st.pyplot(fig)

    # Beds/Baths Analysis
    st.subheader("🛏️ Bedrooms & 🛁 Bathrooms Distribution")
    with st.container():
        fig, ax = plt.subplots(1, 2, figsize=(12, 5))

        sns.countplot(x="Beds", data=filtered_df, ax=ax[0], hue="Beds", palette="Blues", legend=False)
        ax[0].set_title("Number of Bedrooms")
        ax[0].set_xlabel("Beds")

        sns.countplot(x="Baths", data=filtered_df, ax=ax[1], hue="Baths", palette="Reds", legend=False)
        ax[1].set_title("Number of Bathrooms")
        ax[1].set_xlabel("Baths")

        st.pyplot(fig)

    st.subheader("📍 Property Locations")
    m = folium.Map(location=[34.1, -118.3], zoom_start=12)
    # Add markers for all filtered properties
    for _, row in filtered_df.iterrows():
        popup_info = f"""
        <b>{row['Address']}</b><br>
        Price: ${row['Price']:,.0f}<br>
        Beds: {row['Beds']}, Baths: {row['Baths']}<br>
        SqFt: {row['SqFt']:,}<br>
        <a href="{row['Link']}" target="_blank">View Listing</a>
        """

        # Check if the row matches the selected property
        icon_color = "red" if selected_address and row["Address"] == selected_address else "blue"

        folium.Marker(
            location=[row["Latitude"], row["Longitude"]],
            popup=popup_info,
            icon=folium.Icon(color=icon_color, icon="home"),
        ).add_to(m)

    # Display map with full width
    st_folium(m, width=800, height=500)

# TAB 2: Historical Trends
with tab2:
    st.subheader("📈 Historical Trends")

    historical_df = load_historical_data()
    if historical_df.empty:
        st.warning("⚠️ No historical data available.")
        st.stop()

    # Convert Pandas Timestamp to Python Date for Streamlit Slider
    min_date, max_date = historical_df["Date"].min().date(), historical_df["Date"].max().date()

    if min_date == max_date:
        st.warning(f"⚠️ Only one date available: {min_date}. No range to select.")
        selected_range = (min_date, max_date)
    else:
        selected_range = st.sidebar.slider(
            "Select Date Range",
            min_value=min_date,
            max_value=max_date,
            value=(min_date, max_date),
            format="YYYY-MM-DD",
            key="historical_slider"
        )

    # Ensure date filtering is using correct format
    filtered_historical_df = historical_df[
        (historical_df["Date"] >= pd.to_datetime(selected_range[0])) &
        (historical_df["Date"] <= pd.to_datetime(selected_range[1]))
    ]

    # Average Price Over Time
    st.subheader("📊 Average Price Over Time")
    avg_price_trend = filtered_historical_df.groupby("Date")["Price"].mean()
    fig, ax = plt.subplots(figsize=(10, 5))
    avg_price_trend.plot(ax=ax, marker="o", linestyle="-")
    ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f"${x/1e6:.1f}M"))
    ax.set_ylabel("Average Price ($M)")
    ax.set_xlabel("Date")
    ax.set_title("Historical Trend: Average Price")
    st.pyplot(fig)

    # Number of Listings Over Time
    st.subheader("🏠 Number of Listings Over Time")
    listings_trend = filtered_historical_df.groupby("Date")["Address"].count()
    fig, ax = plt.subplots(figsize=(10, 5))
    listings_trend.plot(ax=ax, marker="o", linestyle="-", color="red")
    ax.set_ylabel("Number of Listings")
    ax.set_xlabel("Date")
    ax.set_title("Historical Trend: Number of Listings")
    st.pyplot(fig)

    # Filter Historical Trends
    st.sidebar.subheader("📊 Filter Historical Trends")
    min_date, max_date = historical_df["Date"].min().date(), historical_df["Date"].max().date()

    if min_date == max_date:
        st.warning(f"⚠️ Only one date available: {min_date}. No range to select.")
    else:
        selected_range = st.sidebar.slider(
            "Select Date Range",
            min_value=min_date,
            max_value=max_date,
            value=(min_date, max_date),
            format="YYYY-MM-DD"
        )

        # Convert selected_range back to Pandas Timestamp before filtering
        filtered_historical_df = historical_df[
            (historical_df["Date"] >= pd.to_datetime(selected_range[0])) &
            (historical_df["Date"] <= pd.to_datetime(selected_range[1]))
        ]

    # Summary Statistics
    st.subheader(f"📊 Summary for {selected_range[0]} to {selected_range[1]}")
    st.write(filtered_historical_df.describe())

st.write("Data Source: Redfin Scraper")

Key Takeaways

The modifications we made to dashboard.py include:

- Dynamic Date Selection: Allowing users to explore property listings from specific dates.

- Tab-Based Navigation: Introducing separate views for daily listings and historical trends.

- Interactive Filtering: Enabling users to refine searches by price, square footage, and more.

- Historical Trend Analysis: Providing long-term insights into price movements and listing activity.

- Interactive Visualizations: Enhancing data presentation with plots and interactive maps.

- Location-Based Insights: Displaying properties on an interactive map for better spatial analysis.

- User-Friendly Interface: Making data exploration seamless with responsive UI elements.

Let’s go through each of these enhancements in detail.

1️ Fetching Available Dates for Historical Data

To allow users to select data from different dates, we dynamically fetch available dates from stored CSV files.

Slider control to determine the date range shown for investigating historical real estate trends

import glob
import re

def get_available_dates():
    files = glob.glob("data/redfin_hollywood_hills_*.csv")
    date_pattern = re.compile(r"redfin_hollywood_hills_(\d{4}-\d{2}-\d{2})\.csv")
    dates = sorted(
        {date_pattern.search(f).group(1) for f in files if date_pattern.search(f)},
        reverse=True
    )
    return dates

This function:

Scans the data/ folder for all stored CSV files
Extracts dates from filenames using regex
Sorts the dates in descending order, making the most recent data available first

2️ Loading Data for a Selected Date

A dropdown in the dashboard allows users to select a date, dynamically loading the corresponding dataset.

import os
import pandas as pd
import numpy as np

def load_data(selected_date=None):
    if not selected_date:
        return pd.DataFrame()
    
    file_path = f"data/redfin_hollywood_hills_cleaned_{selected_date}.csv"
    if not os.path.exists(file_path):
        return pd.DataFrame()

    try:
        df = pd.read_csv(file_path)
        df.replace({'—': np.nan, 'N/A': np.nan, '': np.nan}, inplace=True)
        df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
        df["Beds"] = pd.to_numeric(df["Beds"], errors="coerce")
        df["Baths"] = pd.to_numeric(df["Baths"], errors="coerce")
        df["SqFt"] = pd.to_numeric(df["SqFt"], errors="coerce")
        df["Latitude"] = pd.to_numeric(df["Latitude"], errors="coerce")
        df["Longitude"] = pd.to_numeric(df["Longitude"], errors="coerce")
        df.dropna(subset=["Price", "Beds", "Baths", "SqFt", "Latitude", "Longitude"], inplace=True)
        return df
    except FileNotFoundError:
        return pd.DataFrame()

This function ensures:

Correct file retrieval based on the user-selected date
Proper handling of missing values
Conversion of numerical columns for correct plotting and filtering

3️ Creating Tabs for Listings by Date and Historical Trends

The dashboard now includes two tabs:

Listings by Date: Users can explore listings for a selected day.
Historical Trends: Users can analyze long-term trends and market movements.

Tab controls to switch between listing data by date or historical trends

import streamlit as st

st.title("🏡 Hollywood Hills Real Estate Dashboard")
st.write("Analyze real estate trends in Hollywood Hills using interactive visualizations.")

# Create Tabs
tab1, tab2 = st.tabs(["📆 Listings by Date", "📈 Historical Trends"])

4️ Interactive Filtering for Listings by Date

Within the Listings by Date tab, users can filter properties based on:

Price Range
Number of Bedrooms and Bathrooms
Square Footage

with tab1:
    st.subheader("📆 View Listings by Date")
    available_dates = get_available_dates()
    selected_date = st.selectbox("Select Date", available_dates)
    df = load_data(selected_date)

    if df.empty:
        st.warning("⚠️ No data available for the selected date.")
        st.stop()

5️ Historical Trend Analysis

Within the Historical Trends tab, users can view market fluctuations over time.

Table summarizing historical statistics about real estate data

with tab2:
    st.subheader("📈 Historical Trends")
    historical_df = load_historical_data()
    if historical_df.empty:
        st.warning("⚠️ No historical data available.")
        st.stop()

Users can:

Select a date range to analyze trends.
View average price trends.
Analyze listing activity over time.

def load_historical_data():
    master_file = "data/redfin_hollywood_hills_master_cleaned.csv"
    try:
        df = pd.read_csv(master_file)
        df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
        df["Beds"] = pd.to_numeric(df["Beds"], errors="coerce")
        df["Baths"] = pd.to_numeric(df["Baths"], errors="coerce")
        df["SqFt"] = pd.to_numeric(df["SqFt"], errors="coerce")
        df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
        df.dropna(subset=["Price", "Beds", "Baths", "SqFt", "Date"], inplace=True)
        return df
    except FileNotFoundError:
        return pd.DataFrame()

This allows users to:

Analyze long-term price movements
View market shifts over time
Compare different periods for better insights

6️ Visualizing Trends with Interactive Charts

The dashboard now includes historical trend visualizations, allowing users to compare price trends and listing activity over time.

Average Price Over Time

Plot showing the historical trend of average real estate prices

st.subheader("📊 Average Price Over Time")
avg_price_trend = filtered_historical_df.groupby("Date")["Price"].mean()
fig, ax = plt.subplots(figsize=(10, 5))
avg_price_trend.plot(ax=ax, marker="o", linestyle="-")
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f"${x/1e6:.1f}M"))
ax.set_ylabel("Average Price ($M)")
ax.set_xlabel("Date")
ax.set_title("Historical Trend: Average Price")
st.pyplot(fig)

Number of Listings Over Time

st.subheader("🏠 Number of Listings Over Time")
listings_trend = filtered_historical_df.groupby("Date")["Address"].count()
fig, ax = plt.subplots(figsize=(10, 5))
listings_trend.plot(ax=ax, marker="o", linestyle="-", color="red")
ax.set_ylabel("Number of Listings")
ax.set_xlabel("Date")
ax.set_title("Historical Trend: Number of Listings")
st.pyplot(fig)

These plots help users:

Identify price trends over time
Understand fluctuations in listing activity
Make informed real estate investment decisions

Wrapping Up

Congratulations! You have now completed the full Automated Real Estate Data Pipeline tutorial series. Over four parts, we have built a fully functional pipeline that:

- Scrapes real estate listings from Redfin using Selenium.

- Cleans and structures the data to ensure accuracy and consistency.

- Stores historical data for long-term trend analysis.

- Visualizes key insights using an interactive Streamlit dashboard.

This project is an excellent portfolio piece, demonstrating your skills in:

Web scraping and automation
Data cleaning and preprocessing
Data visualization and dashboard development
Building an end-to-end data pipeline

By following this tutorial, you now have a solid foundation to extend the project further. Some next steps could include:

Deploying the dashboard online for real-time access.

Enhancing analytics with predictive models for price forecasting.

Integrating APIs for real-time real estate data feeds.

With this project in your portfolio, you can confidently showcase your ability to work with real-world data pipelines and interactive visualizations. Well done!

Now, go ahead and experiment with additional features, share your work, and continue building amazing data-driven applications.

Python Automated Real Estate Data Pipeline Project: Part 4 – Automation

Recap: What We've Done So Far

Part 4: Automating and Tracking Historical Data

Prerequisites

Required Libraries

Install Dependencies

Set Up Your Project

Why Automate?

How APScheduler Works

Implementing the Scheduler

Breaking Down the Scheduler

Enhancing the Scraper

Implementing the Enhanced Scraper

Enhancing Data Analysis

Why Enhance Data Analysis?

Implementing the Enhanced Data Analysis

Enhancing the Dashboard

Implementing the Enhanced Dashboard

Wrapping Up

Learn More

Always be in the loop.