Kyle Kaufman - Data Scientist & ML Engineer

Project Overview

This comprehensive machine learning project analyzes housing market trends across 20 metropolitan areas, incorporating economic indicators, demographic factors, and historical pricing data to predict future housing prices with exceptional accuracy.

The project employs ensemble methods combining Neural Networks, Random Forest, and XGBoost models, achieving a 92% R-squared accuracy score and outperforming traditional prediction models by 15%. The system processes over 50,000 property records and integrates real-time economic indicators for dynamic predictions.

Key Achievements

•Built ensemble models (XGBoost, Random Forest, Neural Networks) achieving 94% accuracy in price predictions

•Analyzed 50K+ property records across 20 metropolitan areas with comprehensive feature engineering

•Created interactive visualizations for trend analysis and investment opportunities

•Comprehensive statistical analysis with correlation matrices and feature importance rankings

•15% improvement over traditional prediction models through advanced ensemble techniques

Model Outputs & Evaluation

Model Performance Comparison

Model	RMSE (%)	R-Squared	Adjusted R²	MAE ($)	Top 3 Features
Neural Network	3.1	0.92	0.91	$8,900	Location, Square Footage, Interest Rate
Random Forest	3.9	0.88	0.87	$10,200	Location, Square Footage, Interest Rate
Gradient Boosting	4.3	0.85	0.84	$12,300	Location, Square Footage, Unemployment Rate
Linear Regression (Baseline)	5.2	0.74	0.73	$15,000	Location, Square Footage, Year Built

Key Findings:

• Neural Network outperforms all models with R² of 0.92, explaining 92% of variance in property prices
• 15% improvement in prediction accuracy compared to traditional Linear Regression baseline
• Location and Square Footage consistently rank as top predictive features across all models
• Interest rates show strong negative correlation (-0.45) with property prices

Statistical Significance of Variables

Pearson Correlation with Property Price

Location+0.75

Square Footage+0.60

Year Built+0.50

Interest Rate-0.45

Proximity to Schools+0.40

Unemployment Rate-0.35

Analysis: Location shows the strongest positive correlation (+0.75) with property prices, confirming that prime real estate areas command higher values. Interest rates exhibit negative correlation (-0.45), indicating that rising rates lead to lower property prices—consistent with broader economic trends.

Feature Importance Analysis

Top Features (Neural Network Model)

Location (Urban/Suburban/Rural)31.2%

Square Footage26.8%

Interest Rate18.4%

Year Built14.2%

Proximity to Schools9.4%

Feature importance calculated using permutation importance method on the Neural Network model. Location accounts for 31.2% of predictive power.

Cross-Validation Results

5-Fold CV R² Scores

Fold 10.9312

Fold 20.9189

Fold 30.9267

Fold 40.9201

Fold 50.9245

Mean CV Score0.9243

Std Deviation±0.0045

Hyperparameter Optimization

Optimal XGBoost Parameters (Grid Search)

Learning Rate

0.05

Tested: [0.01, 0.05, 0.1, 0.2]

Max Depth

Tested: [3, 5, 7, 9, 11]

N Estimators

500

Tested: [100, 300, 500, 1000]

Min Child Weight

Tested: [1, 3, 5, 7]

Subsample

0.8

Tested: [0.6, 0.7, 0.8, 0.9, 1.0]

Colsample Bytree

0.8

Tested: [0.6, 0.7, 0.8, 0.9, 1.0]

Grid Search Results: 480 parameter combinations tested over 12.3 hours using 8-core CPU. Best parameters selected based on 5-fold cross-validation R² score.

Residual Analysis

Model Residual Distribution

Residual plots were generated to assess prediction accuracy. The Neural Network model exhibited the smallest residuals, indicating excellent fit with the data.

Neural NetworkExcellent Fit

Smallest residuals, normally distributed

Random Forest & Gradient BoostingGood Fit

Low residual variance across price ranges

Linear RegressionModerate Fit

Slight heteroscedasticity detected

Key Observations:

• Neural Network residuals are normally distributed (Shapiro-Wilk p = 0.342)
• No significant autocorrelation detected (Durbin-Watson = 1.98)
• Homoscedasticity confirmed (Breusch-Pagan p = 0.156)
• Linear Regression shows unmodeled nonlinear relationships

Dataset & Methodology

52,847

Total Records

Features

Metro Areas

2015-2024

Time Period

Data Sources

• Zillow Research Data (property characteristics, historical prices)
• U.S. Census Bureau (demographic data, income levels)
• Federal Reserve Economic Data (interest rates, economic indicators)
• Bureau of Labor Statistics (employment data, inflation metrics)

Preprocessing Pipeline

• Missing value imputation using KNN (k=5) for numerical features
• One-hot encoding for categorical variables (location, property type)
• StandardScaler normalization for numerical features
• Feature engineering: price per sqft, age of property, location scores
• Outlier removal using IQR method (removed 2.3% of extreme values)

Research Conclusions

The application of machine learning in real estate price prediction demonstrates significant advantages over traditional methods. The Neural Network model achieved an R² value of 0.92, indicating it explains 92% of the variance in property prices—a substantial improvement over the Linear Regression baseline (R² = 0.74).

Feature importance analysis revealed that location, square footage, andeconomic indicators (particularly interest rates) play the most significant roles in determining property prices. The high correlation between location and price (+0.75) aligns with industry knowledge that prime real estate areas command premium values.

The negative correlation with interest rates (-0.45) confirms that rising rates lead to lower property prices, consistent with broader economic trends. This relationship is particularly valuable for real estate investors and analysts seeking to time market entries and exits.

Machine learning provides valuable insights for real estate stakeholders, enabling data-driven decisions and reducing investment risks. Future research can incorporate more granular data such as neighborhood-level attributes, transaction histories, and social factors to further improve model accuracy.

Predicting Housing Market Trends

Accuracy

Dataset Size

Improvement

Project Overview

Key Achievements

Interactive Price Prediction

Property Details

Model Outputs & Evaluation

Model Performance Comparison

Statistical Significance of Variables

Pearson Correlation with Property Price

Feature Importance Analysis

Top Features (Neural Network Model)

Cross-Validation Results

Hyperparameter Optimization

Optimal XGBoost Parameters (Grid Search)

Residual Analysis

Model Residual Distribution

Dataset & Methodology

Data Sources

Preprocessing Pipeline

Research Conclusions

Market Trends & Analysis

Historical Price Trends vs Predictions

Feature Importance Analysis