This project involves predicting house prices using TensorFlow Decision Forests. The dataset used is the Ames Housing dataset, which includes 79 explanatory variables describing various aspects of residential homes in Ames, Iowa. The goal is to predict the final price of each home.
Ensure you have the following libraries installed:
- TensorFlow
- TensorFlow Decision Forests
- Pandas
- Seaborn
- Matplotlib
You can install these libraries using pip:
pip install tensorflow tensorflow_decision_forests pandas seaborn matplotlib
project/
dataset.csv
: The dataset containing the housing data.train.py
: The script to train and evaluate the model.README.md
: This file.
The dataset used for this project is dataset.csv
, which contains 80 columns (features and target).
Below is a step-by-step explanation of the code used for this project.
import tensorflow_decision_forests as tfdf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
Load the dataset and display the shape and the first few rows.
dataset = pd.read_csv("project/dataset.csv")
print("Full dataset shape:", dataset.shape)
print(dataset.head(3))
Drop the unnecessary Id
column.
dataset = dataset.drop('Id', axis=1)
Inspect the types of feature columns and display basic statistics of the target variable (SalePrice
).
dataset.info()
print(dataset['SalePrice'].describe())
Plot the distribution of the target variable.
plt.figure(figsize=(9, 8))
sns.histplot(dataset['SalePrice'], color='g', bins=100, kde=True)
plt.show()
Split the dataset into training and testing datasets.
def split_dataset(dataset, test_ratio=0.30):
test_indices = np.random.rand(len(dataset)) < test_ratio
return dataset[~test_indices], dataset[test_indices]
train_ds_pd, valid_ds_pd = split_dataset(dataset)
print("{} examples in training, {} examples in testing.".format(
len(train_ds_pd), len(valid_ds_pd)))
Convert the Pandas DataFrames to TensorFlow Datasets.
label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
Create and train a Random Forest model.
model = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION)
model.compile(metrics=["mse"])
model.fit(train_ds)
Evaluate the model on the validation dataset and plot the training logs.
evaluation = model.evaluate(valid_ds, return_dict=True)
for name, value in evaluation.items():
print(f"{name}: {value:.4f}")
logs = model.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("RMSE (out-of-bag)")
plt.show()
Display and plot the variable importances.
inspector = model.make_inspector()
print("Variable importances:")
for importance in inspector.variable_importances().keys():
print("\t", importance)
importance = inspector.variable_importances()["NUM_AS_ROOT"]
print(importance)
plt.figure(figsize=(12, 4))
feature_names = [vi[0].name for vi in importance]
feature_importances = [vi[1] for vi in importance]
plt.barh(feature_names, feature_importances)
plt.xlabel("NUM_AS_ROOT")
plt.title("Variable Importances")
plt.show()
Predict on the test dataset and save the results to a CSV file.
test_data = pd.read_csv("project/test.csv")
ids = test_data.pop('Id')
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data, task=tfdf.keras.Task.REGRESSION)
preds = model.predict(test_ds)
output = pd.DataFrame({'Id': ids, 'SalePrice': preds.squeeze()})
output.to_csv('submission.csv', index=False)
print(output.head())
This project demonstrates how to use TensorFlow Decision Forests to predict house prices using a comprehensive dataset. The model utilizes tree-based algorithms which are robust and provide good performance on tabular data. Further tuning and experimentation with different models and hyperparameters can help improve the prediction accuracy.