Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on page /modules/generated/chainladder.IncrementalAdditive.html #377

Open
henrydingliu opened this issue Nov 23, 2022 · 13 comments
Open

Comments

@henrydingliu
Copy link
Collaborator

I have been mucking around the Incremental Additive method for some payment pattern work. Found some bugs and noticed a couple potential QoL improvements.

bugs:

  • transformer doesn't have incremental_
  • ldf_ doesn't give the actual incremental factors (not sure if intentional)

other enhancements:

  • add support for cdf_
  • store sample_weights for
  • add support for tail
@jbogaardt
Copy link
Collaborator

Thanks for the review! This is a lesser used development estimator, so thanks for finding the poor coverage. Some of the secondary diagnostic properties have not been passed through to tranform as consistently as they should. We should be passing incremental_ through.

A general ambition for the package is to be able to mix and match development estimators with tail estimators with IBNR methods. In order to do this, we standardize the ldf_ to a multiplicative basis. For IncrementalAdditive the multiplicative ldf_ should produce run-off consistent with the incremental_ property:

import chainladder as cl
tri = cl.load_sample("ia_sample")
ia_dev = cl.IncrementalAdditive().fit(
    X=tri['loss'], 
    sample_weight=tri['exposure'].latest_diagonal)
# These match
assert (
    cl.Chainladder().fit(ia_dev.transform(tri['loss'])).full_triangle_.loc[..., :72].round(0) == 
    ia_dev.incremental_.incr_to_cum().round(0))

Standardizing the ldf_ interface to always be multiplicative allows for user-specified pipelines that support tail estimation:

import chainladder as cl
tri = cl.load_sample("ia_sample")
pipe = cl.Pipeline(steps=[
    ('dev', cl.IncrementalAdditive()),
    ('tail', cl.TailCurve()), # Tail support provided via pipeline
    ('model', cl.BornhuetterFerguson(apriori=.6))]
)
pipe.fit(X=tri['loss'], sample_weight=tri['exposure'].latest_diagonal)

There are several "methods" in actuarial literature where development and tail extrapolation are rolled into one method. For these, we do not add tail options to the development estimator. Instead, we deliberately break them up to support compatibility with other user-selected options. The Clark growth curves is a good example of this. To replicate the paper, one would need to build a pipeline that includes both ClarkLDF and TailClark:

import chainladder as cl
genins = cl.load_sample('genins')
pipe = cl.Pipeline(steps=[
    ('dev', cl.ClarkLDF()),
    ('tail', cl.TailClark()),
    ('model', cl.Chainladder()),])
pipe.fit(genins)

However, if a user so chooses, they can swap out TailClark with any other tail if they tail to come up with their own approach. While doing this may invalidate some properties of stochastic approaches, we leave it up to the user to manage the appropriateness of there models.

@henrydingliu
Copy link
Collaborator Author

henrydingliu commented Nov 23, 2022

thanks for the walkthrough! i think i understand a bit better.

with that new understanding, ldf_ and cdf_ makes sense to me the way they are specified. I think what I was trying to ask for is a cumulative version of zeta_, since it looks like zeta_ think it's cumulative

genins = cl.load_sample("genins")
genins_dev = cl.Development().fit_transform(genins)
genins_cl = cl.Chainladder().fit(genins_dev)
genins_incr = cl.IncrementalAdditive().fit(genins, sample_weight = genins_cl.ultimate_ * 5)
print(genins_incr.zeta_)
print(genins_incr.zeta_.incr_to_cum())
print(genins_incr.zeta_.cum_to_incr())

i wonder if i may be able to persuade you to take a more relaxed stance on the relationship between the Development class and the IBNR Methods class. Kinda like how incr_to_cum() currently uses is_pattern to generalize between addition for triangles and multiplication for ldfs. What if there are additive Development classes and multiplicative Development classes. The way I think about the Incremental method is that it's a special case of the BF method, where the IBNR% is specified directly, rather than through 1/(1 - IBNR%), aka development factors. If there was a boolean or subclasses built around additive/multiplicative, then the IBNR methods can be flexible in how it chooses to interface with the Development class.

I'm thinking about a 2-phase refresh effort. Please lemme know if it makes sense.

Phase 1 - QoL improvements

  • zeta_ and cumulative zeta_
  • store sample_weights, and allow retrieval
  • store incremental % triangle, and allow retrieval

Phase 2 - Enhanced integration with rest of chainladder package

  • for IBNR methods where it makes sense, allow the IBNR method to directly take zeta_ as the key input, rather than ldf_. (I'm using the old 0.8.13 full_triangle_ here. Maybe the new implementation already fixes it.)
import chainladder as cl
tri = cl.load_sample("ia_sample")
pipe = cl.Pipeline(steps=[
    ('dev', cl.IncrementalAdditive()),
    ('model', cl.BornhuetterFerguson(apriori=.6))]
)
test = pipe.fit(X=tri['loss'], sample_weight=tri['exposure'].latest_diagonal)
print(test.named_steps.model.full_triangle_.cum_to_incr())
print(test.named_steps.dev.incremental_)
  • add additive tail transformers or additive logic into existing tail transformers (the smoothing applied to additive-turned-multiplicative ldf, which tend to 1, isn't quite the same as directly smoothing the additive factors, which tend to 0)

@jbogaardt
Copy link
Collaborator

i wonder if i may be able to persuade you to take a more relaxed stance on the relationship between the Development class and the IBNR Methods class. Kinda like how incr_to_cum() currently uses is_pattern

That makes sense. My design style has generally puts the burden of conformity on the estimator outputs (e.g. impute a multiplicative ldf_) rather than conforming and validating inputs (e.g. having BornheutterFerguson appropriately deal with additive approaches), mainly because it has been easier to manage OOP coupling that way. I recognize that there is a trade-off there that should be reevaluated.

Your phase 1 suggestions are great. I fully support them.

Your phase 2 example looks the same with the new full_triangle_ implementation. Something buggy is going on there. I'll create a new issue for this.

I support the direction you want to go in phase 2 as well. Its compatible with the end-user API and would improve the user experience. My biggest fear is that the software design skills needed to pull it off are pretty high. The way these classes interact today is novice to intermediate skill. To do it properly, I think we need some advanced software engineering or at the very least careful planning and scoping of estimator compatibility.

@henrydingliu
Copy link
Collaborator Author

created #381 for phase 1 fixes.

ideally would have liked to create cum_zeta_ and fit_zeta_ as properties. But I think that would only work if I added them directly to Triangle?

@henrydingliu
Copy link
Collaborator Author

in further testing, #181 is popping up again.

raa = cl.load_sample('raa')
raa.values[0,0,0,5:] = 18608
ult = cl.Chainladder().fit(raa)
incr = cl.IncrementalAdditive().fit(raa,sample_weight=ult.ultimate_)
print(incr.tri_zeta)
print(incr.zeta_)

Take the 108 column as example. the 0 incremental from AY 1981 should really have contributed towards a lower average, but it's being ignored as NaN. I would take the position that 0 incremental occurs sufficiently frequently in real data. Will think through a solution and propose here.

@henrydingliu
Copy link
Collaborator Author

def nan_to_zero(tri,years = None):
    xp = tri.get_array_module()
    tric = tri.copy()
    if years is None:
        tric.values = xp.nan_to_num(tric.values) * tric.nan_triangle
    else:
        years_list = years if isinstance(years,list) else [years]
        years_np = np.array(years_list)
        origin_ind = np.where(np.array([tri.origin.astype("string")]) == years_np.reshape(-1,1))[1]
        helper = tric.nan_triangle * 0
        helper[origin_ind,:] = helper[origin_ind,:] * 0 + 1
        tric.values = np.where(np.logical_and(np.isnan(tric.values), helper[None,None] == 1), 0.0, tric.values)
    return tric
raa = cl.load_sample('raa')
raa.values[:,:,0:4,5:] = 18608
print(nan_to_zero(raa.cum_to_incr()))
print(nan_to_zero(raa.cum_to_incr(),'1981'))
print(nan_to_zero(raa.cum_to_incr(),['1981','1983']))

here is the function. the intent is to put it under triangle as an internal function (tri turns into self). lemme know what you think.

@henrydingliu
Copy link
Collaborator Author

@jbogaardt thoughts?

@jbogaardt
Copy link
Collaborator

Sorry for late response. Didn't see the original proposal. This seems great. There are a lot of places in the package where 0s are coerced to nans, and we should be deliberate when using your function. As the package matures we do need to reimplement some backend functionality to consider that nan and 0 are not the same. Your function is a step in that direction.

In my own usage of Triangles, I build claim level triangles and rely heavily on the sparse array (fill_value=nan) for high throughput of data. For my use case, I store my triangles in incremental format and need the zeros to be nan so I don't blow out my machine's RAM. That is to say, there are some cases when we might want zeros represented by nan and other cases (like this issue) where we do not. The package currently turns zeros into nans always and that's probably not the right answer. Similarly turning all Nans into zeros is also not effective - especially for large sparse triangles. All this to say that when coercing nans to zeros we should consider the consequences on very large triangles.

@henrydingliu
Copy link
Collaborator Author

lemme add a sparse incremental test and tinker around. what's a good sparse data I can use?

@jbogaardt
Copy link
Collaborator

The "prism" dataset is a large sparse incremental triangle.

import chainladder as cl
prism = cl.load_sample('prism')['Paid']

model = cl.Chainladder().fit(prism.sum())
claim_level_ult = model.predict(prism).ultimate_
claim_level_ult.groupby('Limit').sum()

@henrydingliu
Copy link
Collaborator Author

based on my primitive understanding of sparse, would it be correct to conclude that these couple lines in the class are essentially turning off sparse support?

I haven't made full sense of the benktander implementation yet. My understanding is that much of it is to enable sparse support. (probably also why apriori doesn't support vector?). Meaning, it might actually make sense to deprecate incremental_, and just focus on adding additive support in BF?

@jbogaardt
Copy link
Collaborator

jbogaardt commented Dec 5, 2022

Yes, sorta. Its not well documented, but I went down the path of having any "development patterns" be dense. I couldn't think of a reason why we would ever have sparse development patterns. But I could think of plenty of reasons why we would want to multiply those development patterns by a sparse Triangle of individual claims.

As such, arithmetic between a sparse triangle and a dense triangle (such as multiplying latest diagonal by CDFs) is designed to yield a sparse result. Indeed much of the logic in methods/base.py and even the methods themselves have wonky looking arithmetic to pare extraneous entries down to nans so that the sparse computations are more efficient.

So, it is perhaps okay to have the estimators themselves create "coefficients" that are stored as dense, but X and sample_weight should honor both sparse and dense formats.

@henrydingliu
Copy link
Collaborator Author

I do see how in most cases dense coefficients work just fine. I don't think dense coefficients currently work for incrementaladditive. The conversion from zeta_ to ldf_ is done densely and depends on the dense latest diagonal. This certainly makes the CL estimator give odd results. I suspect BF as well. Once BF can take dense zeta_ directly, we should be able to get rid of the incremental_ attribute (nothing against incremental_, just seems like extra work to enable sparse support) and rely solely on BF for sparse-supported estimation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants