Using storage information to model Henry Hub Gas prices

Note

Generative AI was used to fix grammar in the writing. The rest is done by me.

In the previous post I built a latent volatility model that was used to predict Henry Hub gas prices. While it was a very simple model, it did output meaningful results in the end. One of the fundamental problems with that kind of model is that it makes no assumption on the nature of the underlying time series. It could just as easily be modeling sheep populations in New Zealand.. Today I’m making the first step in addressing that by including a fundamental macro indicator - reported levels of gas storage, collected by US Energy Information Administration. I’d recommend anyone to spend like 30 mins on their website - the amount of historical information on energy is amazing and it’s a great resource all around.

The reason I decided to start with storage information is because it is a very strong indicator of the volatility. Depending on weather forecasts, the current levels of storage and other events, volatility goes up and down. For example, in warmer months, NG volatility tends to be lower(unless snow hits Texas), and in colder months when sudden frosts hit, the volatility will increase, as there is more demand for an already depleted commodity.

EIA gas report

On a weekly basis, EIA collects information about gas storage levels. I will use storage level information from the lower 48 states. And while, there are other massive storages on other continents, US, on its own, being a massive energy market provides a valuable insight into future volatility values.

How to include storage levels

Within the bayesian framework, there are a few ways to include the storage information. First of all, the EIA methodology document states that the parties reporting their storage levels are sampled, meaning that from week to week we do not get reports from the same parties. Plus, while it is unclear how much evidence there is, it would reasonable to assume that there inaccuracies in those reports. With that in mind, my initial idea was to model the storage levels as a variable with noisy observation, and consider it a latent variable that was periodically observed. That would lead to a two-level bayesian model. However, our NG prices dataset is on the daily basis and the storage reports are on the weekly basis, making it unclear how exactly those two should be aligned, in case the storage information is assumed a latent variable.

I then decided to use it as an external regressor and leave the more complicated version for future trials. Additionally, I include storage level derivative into the modelling since I found that it helps to make the model more stable, plus conceptually it also makes sense that information about the speed of storage change and its direction are important.

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Data loading

Start with loading the daily level gas data

gas_data = (
    pl.read_csv("data/ng_daily.csv", try_parse_dates=True)
    .drop_nulls()
    .tail(3500)
    .rename({"Date": "date", "Price": "price"})
).with_columns(
    ret=pl.col("price") / pl.col("price").shift(1),
)
gas_data
shape: (3_500, 3)
date price ret
date f64 f64
2011-04-01 4.32 null
2011-04-04 4.21 0.974537
2011-04-05 4.22 1.002375
2011-04-06 4.17 0.988152
2011-04-07 4.12 0.98801
2025-02-04 3.25 0.984848
2025-02-05 3.22 0.990769
2025-02-06 3.31 1.02795
2025-02-07 3.32 1.003021
2025-02-10 3.48 1.048193

I then load storage data, focusing only on the lower 48 states. The relevant column is renamed to storage for convenience.

ng_storage_data = (
    pl.read_csv("data/ng_storage.csv", truncate_ragged_lines=True)
    .with_columns(
        pl.col("Week ending").str.strptime(pl.Datetime, "%d-%b-%y").alias("date"),
    )
    .with_columns(pl.col("date").cast(pl.Date))
    .rename({"Total Lower 48": "storage"})
)
ng_storage_data.head()
shape: (5, 10)
Week ending Source East Region Midwest Region Mountain Region Pacific Region South Central Region Salt NonSalt storage date
str str i64 i64 i64 i64 i64 i64 i64 date
"01-Jan-10" "Derived EIA Weekly Estimates" 769 900 195 268 985 159 826 2010-01-01
"08-Jan-10" "Derived EIA Weekly Estimates" 703 820 185 257 886 123 763 2010-01-08
"15-Jan-10" "Derived EIA Weekly Estimates" 642 750 176 246 793 91 702 2010-01-15
"22-Jan-10" "Derived EIA Weekly Estimates" 616 710 171 235 789 102 687 2010-01-22
"29-Jan-10" "Derived EIA Weekly Estimates" 582 661 164 221 779 108 671 2010-01-29

Here is what storage dynamic look like:

sns.lineplot(ng_storage_data, x="date", y="storage")

Finally, joining both frames into one.

gas_storage_data = gas_data.join(
    ng_storage_data["date", "storage"], how="left", on="date"
).with_columns(pl.col("storage").forward_fill())
gas_storage_data.head()
shape: (5, 4)
date price ret storage
date f64 f64 i64
2011-04-01 4.32 null 578
2011-04-04 4.21 0.974537 578
2011-04-05 4.22 1.002375 578
2011-04-06 4.17 0.988152 578
2011-04-07 4.12 0.98801 578

I use my feature engineer class to extract the following features:

  1. Log-return
  2. Log-variance
  3. Storage and variance quantiles
  4. Storage quantile derivative

I use quantiles pretty much everywhere, because it is good way to standardise the data and still leave a lot of useful information in the transformed variable. If the value is 0.99, you know it’s a rare event, which is useful for debugging.

feature_engineer = FeatureEngineer(
    drop_nulls=False,
    transforms=[
        LogReturn(source_field="ret", feature_name="log_ret"),
        Variance(
            source_field="price",
            feature_name="var",
            rolling_variance_window=2,
        ),
        LogReturn(source_field="var", feature_name="log_var", requested_lag=0),
        QuantileTransformer(
            source_field="var", feature_name="var_quantile", requested_lag=1
        ),
        QuantileTransformer(
            source_field="storage",
            feature_name="storage_quantile",
            step_size=8,
            requested_lag=1,
        ),
        Derivative(
            source_field="storage_quantile",
            feature_name="storage_quantile_derivative",
            step_size=8,
            requested_lag=1,
        ),
    ],
    n_shifts=3,
)
df_with_features = feature_engineer.create_features(gas_storage_data)
df_with_features = df_with_features.drop_nulls()
df_with_features
shape: (3_484, 20)
date price ret storage log_ret var log_var var_quantile storage_quantile storage_quantile_derivative month prev_log_ret_1 prev_log_ret_2 prev_log_ret_3 prev_var_1 prev_var_2 prev_var_3 prev_var_quantile_1 prev_storage_quantile_1 prev_storage_quantile_derivative_1
date f64 f64 i64 f64 f64 f64 f64 f64 f64 i8 f64 f64 f64 f64 f64 f64 f64 f64 f64
2011-04-26 4.32 0.988558 610 -0.011508 0.00125 -6.684612 0.470952 0.22973 0.027528 3 0.009195 0.0 0.032867 0.0008 0.0001 0.0098 0.411746 0.202202 0.019019
2011-04-27 4.35 1.006944 610 0.00692 0.00045 -7.706263 0.340426 0.22973 0.027528 3 -0.011508 0.009195 0.0 0.00125 0.0008 0.0001 0.470952 0.202202 0.019019
2011-04-28 4.38 1.006897 610 0.006873 0.00045 -7.706263 0.340574 0.22973 0.0 3 0.00692 -0.011508 0.009195 0.00045 0.00125 0.0008 0.340426 0.22973 0.046547
2011-04-29 4.51 1.02968 640 0.029248 0.00845 -4.773589 0.770854 0.264765 0.035035 3 0.006873 0.00692 -0.011508 0.00045 0.00045 0.00125 0.340574 0.22973 0.046547
2011-05-02 4.6 1.019956 640 0.019759 0.00405 -5.509038 0.658659 0.264765 0.035035 4 0.029248 0.006873 0.00692 0.00845 0.00045 0.00045 0.770854 0.22973 0.046547
2025-02-04 3.25 0.984848 638 -0.015267 0.00125 -6.684612 0.47869 0.261762 -0.200701 1 0.11892 -0.062831 -0.080043 0.06845 0.01805 0.0338 0.951502 0.462462 -0.115616
2025-02-05 3.22 0.990769 638 -0.009274 0.00045 -7.706263 0.347916 0.261762 -0.074074 1 -0.015267 0.11892 -0.062831 0.00125 0.06845 0.01805 0.47869 0.335836 -0.242242
2025-02-06 3.31 1.02795 638 0.027567 0.00405 -5.509038 0.662417 0.261762 -0.074074 1 -0.009274 -0.015267 0.11892 0.00045 0.00125 0.06845 0.347916 0.335836 -0.242242
2025-02-07 3.32 1.003021 624 0.003017 0.0001 -9.21034 0.0 0.241742 -0.094094 1 0.027567 -0.009274 -0.015267 0.00405 0.00045 0.00125 0.662417 0.335836 -0.242242
2025-02-10 3.48 1.048193 624 0.047068 0.0128 -4.35831 0.831581 0.241742 -0.094094 1 0.003017 0.027567 -0.009274 0.0001 0.00405 0.00045 0.0 0.335836 -0.242242
# Cut-off point
T = 300
(
    present_value_log_ret,
    present_value_log_var,
    storage,
    month,
    present_value_test,
) = (
    df_with_features["log_ret"][:-T].to_numpy(),
    df_with_features["log_var"][:-T].to_numpy(),
    df_with_features["storage_quantile"][:-T].to_numpy(),
    df_with_features["month"][:-T].to_numpy(),
    df_with_features["log_ret"][-T:].to_numpy(),
)
past_values_train, past_values_test = (
    feature_engineer.to_numpy_dict(df_with_features[:-T]),
    feature_engineer.to_numpy_dict(df_with_features[-T:]),
)

Model definition

The model is defined in the same way it was described above.


garch_like_sample_vol_model

 garch_like_sample_vol_model (value_log_ret:numpy.ndarray[typing.Any,numpy
                              .dtype[+_ScalarType_co]]=None, log_var_value
                              :numpy.ndarray[typing.Any,numpy.dtype[+_Scal
                              arType_co]]=None, storage:numpy.ndarray[typi
                              ng.Any,numpy.dtype[+_ScalarType_co]]=None, m
                              onth:numpy.ndarray[typing.Any,numpy.dtype[+_
                              ScalarType_co]]=None, past_values:dict[str,n
                              umpy.ndarray[typing.Any,numpy.dtype[+_Scalar
                              Type_co]]]=None)
numpyro.render_model(
    garch_like_sample_vol_model,
    model_args=(
        present_value_log_ret,
        present_value_log_var,
        storage,
        month,
        past_values_train,
    ),
)

The first order of business is to do PPC - prior predictive checks. In out case, I deal with data in the temporal domain, hence our checks will be autoregressive too. What I mainly want to check here is that with the current prior I do not get explosive time series, and they stay within a reasonable level. On the image below, you can see, that this is indeed the case.

Though PPCs are very useful in general, practically speaking, if you have a lot of observations, bad prior will be overcome by those data. This is the case for this model too: I tried to set unreasonable priors and the model would converge to the same values again and again. That’s because there are thousands of samples. Nonetheless, it is a good practice and I do it.

simulator = Simulator(
    model=garch_like_sample_vol_model,
    feature_engineer=feature_engineer,
    exo_fixed_effects=["month"],
    inherit_vals=["storage"],
)
starting_sim_df = gas_storage_data[-T - feature_engineer.n_shifts * 5 - 100 : -T]
all_runs = simulator.simulate_paths(
    steps=7,
    starting_sim_df=starting_sim_df,
    posterior_samples=None,
    num_sims=10,
)

sns.lineplot(all_runs, x="date", y="price", hue="run_id")
plt.xticks(rotation=90);

Fitting and diagnostics

The model is fitted using MCMC. You can see that estimates are relatively tight. Note, how the derivative is entering with the negative sign into the linear equation. If the storage level is dropping(negative change), the term will make a positive contribution into the volatility.

I also show the MCMC diagnostics below. There is a decent number of samples with low autocorrelations and the chains look to be mixing well.

rng_key = random.PRNGKey(0)
rng_key, rng_key_ = random.split(rng_key)

# Run NUTS.
kernel = NUTS(garch_like_sample_vol_model)
num_samples = 2000
mcmc = MCMC(kernel, num_warmup=1000, num_samples=num_samples)

mcmc.run(
    rng_key_,
    present_value_log_ret,
    present_value_log_var,
    storage,
    month,
    past_values_train,
)
mcmc.print_summary()
posterior_samples = mcmc.get_samples()
print(posterior_samples.keys())
sample: 100%|██████████| 3000/3000 [00:01<00:00, 1742.73it/s, 7 steps of size 2.36e-01. acc. prob=0.93] 

                                                mean       std    median      5.0%     95.0%     n_eff     r_hat
                                     b_var     -7.11      0.00     -7.11     -7.12     -7.10   1296.73      1.00
             param_prev_storage_quantile_1      0.19      0.01      0.19      0.18      0.20   1335.46      1.00
  param_prev_storage_quantile_derivative_1     -3.25      0.02     -3.25     -3.29     -3.21   1161.86      1.00
                 param_prev_var_quantile_1      1.23      0.01      1.23      1.22      1.24   1393.80      1.00

Number of divergences: 0
dict_keys(['b_var', 'param_prev_storage_quantile_1', 'param_prev_storage_quantile_derivative_1', 'param_prev_var_quantile_1'])
posterior_samples = mcmc.get_samples()
predictive = Predictive(
    garch_like_sample_vol_model,
    posterior_samples=posterior_samples,
    return_sites=["log_ret"],  # or whatever your observation site is called
)
rng_key, rng_key_ppc = random.split(rng_key)
ppc_samples = predictive(
    rng_key_ppc,
    present_value_log_ret,
    present_value_log_var,
    storage,
    month,
    past_values_train,
)
prior_samples = Predictive(garch_like_sample_vol_model, num_samples=2000)(
    rng_key,
    present_value_log_ret,
    present_value_log_var,
    storage,
    month,
    past_values_train,
)

idata = az.from_numpyro(mcmc, posterior_predictive=ppc_samples, prior=prior_samples)
az.plot_trace(idata, var_names=["~^log_ret_var"], filter_vars="regex");
/Users/sofeikov/work/bayesianfin/.venv/lib/python3.12/site-packages/arviz/utils.py:146: UserWarning: Items starting with ~: ['^log_ret_var'] have not been found and will be ignored
  warnings.warn(

Autoregressive quality checks

In this section I show how to check the quality of the model in this autoregressive setup. First I model a few dozens paths to see if on average the predictions stays within reasonable bounds. It is followed by inclusion error checks - model a few points in the future, and see where does an actual price lies within your prediction range.

simulator = Simulator(
    model=garch_like_sample_vol_model,
    feature_engineer=feature_engineer,
    exo_fixed_effects=["month"],
    inherit_vals=["storage"],
)
posterior_for_gen = {k: ps[0:1] for k, ps in posterior_samples.items()}
starting_sim_df = gas_storage_data[-T - feature_engineer.n_shifts * 5 - 50 : -T]
all_runs = simulator.simulate_paths(
    steps=30,
    starting_sim_df=starting_sim_df,
    posterior_samples=None,
    num_sims=100,
)

As you can see on the image below, the paths look quite reasonable, with few deviating too much. This was, of course, a qualitative statement, rather than a quantitative one. But the latter is coming too.

sns.lineplot(all_runs, x="date", y="price", hue="run_id")
plt.xticks(rotation=90);

last_prices = (
    all_runs.sort("date")  # Ensure dates are in correct order
    .group_by("run_id")
    .agg(pl.col("price").last())  # Get the last price for each run
    .rename({"price": "final_price"})
)
sns.histplot(last_prices, x="final_price")

Inclusion error checks

Here I do some inclusion checks to see how well the actual prices fit into the price spread predicted by the model.

target_window = 7
posterior_for_gen = {k: ps[0:1] for k, ps in posterior_samples.items()}
shifts = list(range(0, T, 10))
all_lasts = []
actual_prices = []
for ct, t in tqdm(enumerate(shifts), total=len(shifts)):
    starting_sim_df = gas_storage_data[
        -T - feature_engineer.n_shifts * 5 - 50 + t : -T + t
    ]
    this_sim = simulator.simulate_paths(
        steps=target_window,
        starting_sim_df=starting_sim_df,
        posterior_samples=posterior_for_gen,
        num_sims=50,
    )
    last_prices = (
        this_sim.sort("date")  # Ensure dates are in correct order
        .group_by("run_id")
        .agg(pl.col("price").last())  # Get the last price for each run
        .rename({"price": "final_price"})
        .with_columns(pl.lit(ct).alias("run_id"))
    )
    actual_prices.append(
        {"run_id": ct, "actual_price": gas_storage_data[-T + t]["price"][0]}
    )
    all_lasts.append(last_prices)
all_lasts = pl.concat(all_lasts)
actual_prices = pl.DataFrame(actual_prices)
100%|██████████| 30/30 [00:31<00:00,  1.04s/it]

From the image below it can be seen that all observed prices, except for one(which was a huge jump in real data) lie within the predicted interval. However it is always interesting to compare it with another version of the model. Let’s do that for the model that does not include storage as a regressor. Here is the result:

image.png

If you compare this image with the image below, you can clearly see that there are less outliers in the box plot, which means that the storage information that I added lead to improved precision and prediction stability. Which means that the storage information does indeed contain a useful signal. Practically speaking, if you model does not overprice the commodity value too much, it means that you can offer more competitive prices on the derivatives you might be selling.

sns.boxplot(data=all_lasts, x="run_id", y="final_price")
sns.scatterplot(data=actual_prices, x="run_id", y="actual_price", zorder=10)

The observed tighter spread above is confirmed by a lower error(0.065 vs 0.087).

all_lasts.join(
    actual_prices, left_on="run_id", right_on="run_id", how="inner"
).with_columns(error=(pl.col("final_price") - pl.col("actual_price")) ** 2).group_by(
    "run_id"
).agg(pl.col("error").mean())["error"].median()
0.06586484831089998

Conclusion

By including storage information as regressors I was able to get a better model that has a better predictive power and more realistic volatility dynamics. By finally using macro regressors like storage levels, I finally get to work on a model that is somewhat unique to gas and it can reflect that market’s specific dynamics.

Despite improving precision, this model suffers still suffers from the following:

  1. The vol window is too short in the feature engineering
  2. The volatility-of-volatility is currently fixed at a very low number. So far I could not find a satisfactory way to model it, it just leads to volatility explosion
  3. The seasonal effects are not included. Given that on winter months storages are depleted, while being injected in the summer, this should explain a good deal of volatility. However, this introduces 12 additional parameters and it is unclear how use them. As a bias entry into volatility? As a vol-of-vol? As something affecting the storage levels?

So far adding seasonality effects has led to worse sampling efficiency in the MCMC, and while prediction still look good, they do not improve, which feels wrong. Stay tuned — tackling seasonality is my next move.

Additional reading