Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable IV regression using sufficient statistics #588

Open
Jayhyung opened this issue Aug 24, 2024 · 0 comments
Open

Scalable IV regression using sufficient statistics #588

Jayhyung opened this issue Aug 24, 2024 · 0 comments

Comments

@Jayhyung
Copy link
Member

Jayhyung commented Aug 24, 2024

Inspired by the work by @apoorvalal and @s3alfisc(#574 and here), I suggest adding a new function method that computes first and second stage regressions of 2 SLS estimator using sufficient statistics. The trick is to use the predicted value of endogenous covariate using discrete exogenous covariates and instruments in the first stage regression , and estimate the second stage using this predicted value. Note that the predicted values of the endogenous covariate remains discrete, as they can be collapsed to match the number of unique values in the discrete control and instrument variables. Here is the code example using duckreg by @apoorvalal.

import numpy as np
import pandas as pd
from duckreg.estimators import DuckRegression
import duckdb
import pyfixest as pf

# Generate sample data
def generate_sample_data(N=10_000_000, seed=12345):
    rng = np.random.default_rng(seed)

    z = rng.binomial(1, 0.5, size=(N,1))
    z2 = rng.binomial(1, 0.5, size=(N,1))
   
    d =  0.5 * z + 1.2 * z2 + rng.normal(size=(N,1))

    fx= rng.choice(range(20), (N, 2), True)

    y = 1.0 + 2.3 * d + fx @ np.array([1, 2]).reshape(2, 1) + rng.normal(size=(N,1))
   
    df = pd.DataFrame(
        np.concatenate([y,d,z,z2, fx], axis=1), columns=["y", "d", "z", "z2", "fx1", "fx2"]
    ).assign(rowid=range(N))
    return df



# Function to create and populate DuckDB database
def create_duckdb_database(df, db_name="large_dataset.db", table="data"):
    conn = duckdb.connect(db_name)
    conn.execute(f"DROP TABLE IF EXISTS {table}")
    conn.execute(f"CREATE TABLE {table} AS SELECT * FROM df")
    conn.close()
    print(f"Data loaded into DuckDB database: {db_name}")


# Generate and save data
df = generate_sample_data()
db_name = 'large_dataset.db'
create_duckdb_database(df, db_name)

db_name = 'large_dataset.db'
conn = duckdb.connect(db_name)
query = "SELECT * FROM data limit 5"
conn.execute(query).fetchdf()

db_name = "large_dataset.db"
conn = duckdb.connect(db_name)
q = """
        SELECT  z, z2, fx1, fx2, COUNT(*) as count,
        SUM(d) as sum_d,
        SUM(POW(d, 2)) as sum_d_sq,
        FROM data
        GROUP BY z, z2, fx1, fx2
"""
compressed_df = conn.execute(q).fetchdf()
conn.close()

compressed_df.eval(f"mean_d = sum_d / count", inplace=True)

# Step 1 : Implement First stage reg.
m1 = DuckRegression(
    db_name='large_dataset.db',
    table_name='data',
    formula="d ~ z + z2 + fx1 + fx2",
    cluster_col="",
    n_bootstraps=0,
    seed=42,
)
m1.fit()
m1.fit_vcov()
results = m1.summary()
restab = pd.DataFrame(
    np.c_[results["point_estimate"], results["standard_error"]],
    columns=["point_estimate", "standard_error"],
)

# Step 2 : Predict endogenous variable from the first stage.

# Define the point estimates
constant = results["point_estimate"][0]
coef_z = results["point_estimate"][1]
coef_z2 = results["point_estimate"][2]
coef_fx1 = results["point_estimate"][3]
coef_fx2 = results["point_estimate"][4]

# Compute the predicted values
df['d_hat'] = (constant +
                   coef_z * df['z'] +
                   coef_z2 * df['z2'] +
                   coef_fx1 * df['fx1'] +
                   coef_fx2 * df['fx2'])

# Select the relevant columns to create the new DataFrame
result_df = df[['z', 'z2', 'fx1', 'fx2', 'd_hat', 'y']]

# Generate and save data in duckdb

db_name = 'large_dataset.db'
create_duckdb_database(result_df, db_name)

db_name = 'large_dataset.db'
conn = duckdb.connect(db_name)
query = "SELECT * FROM data limit 5"
conn.execute(query).fetchdf()

db_name = "large_dataset.db"
conn = duckdb.connect(db_name)
q = """
        SELECT  z, z2, fx1, fx2, d_hat, COUNT(*) as count,
        SUM(y) as sum_y,
        SUM(POW(y, 2)) as sum_y_sq,
        FROM data
        GROUP BY z, z2, fx1, fx2, d_hat
"""
compressed_df = conn.execute(q).fetchdf()
conn.close()

compressed_df.eval(f"mean_y = sum_y / count", inplace=True)
print(compressed_df.shape)
compressed_df.head()

# Step 3 : Implement second stage regression using the predicted covariate

m2 = DuckRegression(
    db_name='large_dataset.db',
    table_name='data',
    formula="y ~ d_hat + fx1 + fx2",
    cluster_col="",
    n_bootstraps=0,
    seed=42,
)
m2.fit()
m2.fit_vcov()
results = m2.summary()
restab = pd.DataFrame(
    np.c_[results["point_estimate"], results["standard_error"]],
    columns=["point_estimate", "standard_error"],
)
restab

The code above produces the following result.

Data loaded into DuckDB database: large_dataset.db
   point_estimate  standard_error
0        1.000021        0.002263
1        2.299646        0.001221
2        1.000025        0.000138
3        1.999973        0.000138

where the point estimates are the same with pyfixest

m_pf = pf.feols("y ~ 1  +  fx1 + fx2 | d ~ z + z2", df, vcov = "hetero")
m_pf.tidy()

# The above code produces the following result
         Estimate Std. Error t value Pr(>|t|) 2.5% 97.5%
Coefficient
Intercept 1.000021 0.000903 1107.615507 0.0 0.998251 1.001790
d        2.299646 0.000487 4725.378807 0.0 2.298692 2.300599
fx1        1.000025 0.000055 18231.844575 0.0 0.999918 1.000133
fx2       1.999973 0.000055 36461.661487 0.0 1.999865 2.000080

This process requires extra tasks : the data compression processes twice and generation of predicted values. I'm not familiar with duckdb, but without counting time taken for creating duckdb database, this certainly beats naive implementation of IV regression.

For @apoorvalal and @s3alfisc, do you think that this is worth adding to the codebase? If yes, I would like to work more to figure out how to compute different types of vcov of 2 sls estimator based on Wong et al 2021, and how to optimize on the data compression parts. Also, I would PR this after #574 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant