Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: merging Survival packages #1

Open
piever opened this issue Aug 1, 2017 · 11 comments
Open

Discussion: merging Survival packages #1

piever opened this issue Aug 1, 2017 · 11 comments

Comments

@piever
Copy link
Contributor

piever commented Aug 1, 2017

Hi!
A few months back I had started developing a package for Survival analysis in Julia (it's here )
So far it mainly has Kaplan Meier, Cox proportional hazard model and Accelerated Failure Time models.
Cox model is well optimized (last time I benchmarked it was about 3x faster than matlab's version on some test dataset). Accelerated Failure Time models are not as polished (and way less common I guess).

I don't think I will have enough time to dedicate to this in the future to make one fully polished Survival package alone (plus, it makes limited sense to have 2 separate Survival packages). If you think that would be valuable I can polish my version up a bit (get it up to date with Julia v0.6 and so on), make it compatible with your type system/formalism and make a PR to your package to start unifying things. I guess the first thing to contribute would be Cox (as it's more polished/ it's more clear how to do it). Accelerated Failure Time models can wait until they are cleaner and it also depends whether you are interested in them.

Let me know what you think!
Pietro

@ararslan
Copy link
Member

ararslan commented Aug 1, 2017

Hey Pietro, that sounds like an awesome idea. If you'd be willing to contribute that functionality to this package, that would be awesome!

@piever
Copy link
Contributor Author

piever commented Aug 1, 2017

Cool! I have Cox working in your package. I'll polish it and make a PR in the next few days. One question though: how do you want to integrate your package with DataFrames (in order to use formulas in Cox regression)?

The way I had done it was to simply add a column DataArray{EventTime} (which is how I've also translated it to your package so far). However, I see that you also have EventTimeVector (why is this better than Vector{EventTime} ? Some performance advantages ?), but it can't be added to a DataFrame as it has no upgrade_vector method. Should I simply add one trivial identity upgrade_vector method or stick to Vector{EventTime} ?

@ararslan
Copy link
Member

ararslan commented Aug 1, 2017

EventTimeVector is indeed intended to be a performance improvement over Vector{EventTime}, since it stores the survival times as a contiguous array of scalars and censoring indicators as a BitArray, rather than storing an array of EventTime structs. (Granted I didn't do any real performance benchmarking, I just kind of assumed this approach would be faster.)

I hadn't intended to add a dependency on DataFrames, but it seems reasonable to add one on DataArrays. An even better solution may be to make everything here as generic as possible so as to allow arbitrary arrays (including DataArrays) but it's been long enough since I've looked at this package that I'm not really sure whether that's possible with what I have in place now.

@piever
Copy link
Contributor Author

piever commented Aug 2, 2017

I see. I got it to work without a dependency in DataFrames, in a way that

using Survival, DataFrames
StatsBase.fit(Survival.CoxModel, @formula(event ~ fin+age+race+wexp+mar+paro+prio) ,rossi)

works, except for one silly issue for which I need your help. Cox regression doesn't actually give a value for the intercept, so it should be excluded for the formula (otherwise this step fails) .

How can I specify in the call StatsBase.fit(Survival.CoxModel, @formula(event ~ fin+age+race+wexp+mar+paro+prio) ,rossi) that I do not want the intercept to be part of the formula?

@ararslan
Copy link
Member

ararslan commented Aug 2, 2017

Ah right, formulas are an important part of things like this... That functionality will eventually be split out of DataFrames but in the meantime that's where it lives so I guess a dependency on DataFrames makes sense. (Though IIRC GLM gets by without it somehow, haven't looked into that.)

There was talk about requiring the intercept in formulas to be explicit but in the meantime an intercept is implicitly added. I believe the way to get around that is to add 0 + ... in the formula.

@piever
Copy link
Contributor Author

piever commented Aug 3, 2017

I see, then with the hack that the formula needs to be called with the 0+... I think I can reproduce the same structure as GLM does to manage and use formulas without depending on DataFrames. Once formulas are out of DataFrames I guess we can add a dependency on it and make sure that the intercept is excluded from formula in case of Cox model.

@nalimilan
Copy link
Member

Cox models are indeed and interesting case where implicit intercepts do not make sense. Cf. JuliaData/DataFrames.jl#574. Though, would it be possible to handle this in the Cox model code, by dropping the column for the intercept? I guess that's what R does.

@piever
Copy link
Contributor Author

piever commented Aug 3, 2017

Almost, The issue is that I have to compute the coeftable for the CoxModel (leaving in my package and being independent from DataFrames), without having access to the formula/ModelFrame, so that the coeftable is generate with placeholders as regressor names (see for example here ). After that, the CoxModel gets embedded in a DataFrameModels here ) and finally when displaying the outer layer, i.e. the DataFramenModels, the coeftable is filled in with the correct labels (happening here ). So I got it to work with normal formulas by taking out one column but there are two issues:

  • The formula displayed is wrong
  • The last passage (replacing placeholders with correct column labels in the CoefTable) fails, because the two arrays have different names (due to the extra intercept)

I couldn't figure out how either can be fixed without a dependency to DataFrames. I'd recommend that for now we require an explicit 0+... in the formula and for the future we press for one of the 2 following solutions:

  • Take formulas out of DataFrames (so that the CoxModel exceptions can have a more specific dispatch that takes care of everything)
  • Do not add implicit intercept

At least for now, I'll try and make a PR with this design so it becomes easier for everybody see what are the issues and if there is some smarter solution that I'm missing.

@nalimilan
Copy link
Member

I see. Requiring 0 + for now is OK. My question was more about what we should do in the long term. Unfortunately, the code to handle formulas is quite complex so it cannot/shouldn't be reimplemented by each package. And one of the goals was that packages should not have to depend on DataFrames/StatsModels at all. So I'm not sure how to handle that other than never adding an implicit intercept, which isn't great either. We could add an implicit_intercept function to StatsModels, with a default to true, but which you could override to false when needed. Only packages which need it would depend on StatModels, and with optional dependencies that would be even less of an issue.

@piever
Copy link
Contributor Author

piever commented Aug 3, 2017

100% agreed. I think this only is an issue because a dependency on DataFrames is much heavier than a dependency on some lightweight package like StatsModels.

On a separate note, when the formula formalism gets an independent package, it will be important to document it: the only way for me to produce some CoxModel type that works was by following line by line in the code what is happening in GLM and trying to reproduce it.

@ararslan
Copy link
Member

ararslan commented Aug 3, 2017

On a separate note, when the formula formalism gets an independent package, it will be important to document it

I believe StatsModels has some docs, and I thought there was a section on formulas in the DataFrames docs as well.

the only way for me to produce some CoxModel type that works was by following line by line in the code what is happening in GLM and trying to reproduce it.

😕 Sorry about that, that kinda sucks. I really appreciate the effort though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants