Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_default_state outputs quantities with wrong dimensions #65

Open
JoyMonteiro opened this issue Aug 11, 2018 · 14 comments
Open

get_default_state outputs quantities with wrong dimensions #65

JoyMonteiro opened this issue Aug 11, 2018 · 14 comments

Comments

@JoyMonteiro
Copy link
Member

The output of get_default_state is zero dimensional (only a scalar) if the field is uniform in all directions.
While this is ok for components since the extracted arrays are in the correct shape, the Monitors don't
seem to get the state in the correct shape, and therefore if you do a store at the first step, the dimensions
are not correct.

I think the correct approach would be to ensure that the initial state itself has the correct dimensions. This is also important to not throw off anyone inspecting the initial state.

@JoyMonteiro
Copy link
Member Author

Turns out this is more serious than I thought, since it is no longer possible to set initial conditions (as far as I can see). ConstantDefaultValue will have to be modified to be grid aware and create arrays of correct shape.

@JoyMonteiro
Copy link
Member Author

We should also allow for quantities to be defined on different grids. This is not a priority for this release due to lack of time, but land quantities (eg soil_surface_temperature) must be defined on a land grid, ocean quantities (eg sea_surface_temperature) on an ocean grid etc.,

this would mean that there must be a way to infer these dimensions given the grid_state. ideas?

@mcgibbon
Copy link
Member

mcgibbon commented Aug 11, 2018

What do you mean by "it is no longer possible to set initial conditions"? I will respond to the rest of this later.

@JoyMonteiro
Copy link
Member Author

The following line now errors:

state['air_temperature'].values[0, 0, :] = tp_profiles['air_temperature']

since air_temperature is a scalar. Same issue with other fields initialised by ConstantDefaultValue

@mcgibbon
Copy link
Member

I mean, it is possible to set initial values, you'd just have to use state['air_temperature'] = DataArray(my_array, dims=something, attrs={'units': something}).

@mcgibbon
Copy link
Member

In that example, tp_profiles['air_temperature'] should be a DataArray with a vertical axis, and the line should be state['air_temperature'] = tp_profiles['air_temperature'].

@mcgibbon
Copy link
Member

My initial feeling is that organizing the sizes and presence of dimensions shouldn't be the responsibility of get_default_state. There are better ways to handle this within the NetCDFMonitor and within user code like above. For example, NetCDFMonitor could cache the first store value and wait until the second one to decide it knows what the dimensions should be, at which time it would create output arrays that work for the first two values.

But it might make sense to shift this responsibility to get_default_state, with the understanding that it's giving you the dimensions for each quantity that will be present after you pass the state to the components you've given it. I have to think about it for a bit.

The main thing we lose if we put the responsibility of dimension creation on the initialization is then users can no longer very easily set a constant value as the initial value using state[quantity_name] = DataArray(constant, attrs={'units': units}). All user code that manually creates state quantities would need to also create those dimensions. If users are only using CliMT components this can be fine since they just modify the arrays already created by CliMT, but this problem comes up more strongly if they're using a component from another model or one they've written themselves.

For this reason, right now I feel like I'd prefer it to be that broadcasting along dimensions is the responsibility of the code taking in the state, and not the code that creates the state.

@JoyMonteiro
Copy link
Member Author

Not sure I understand the use case. Why would a component from another model have an issue with arrays created by climt? it would anyways get the correct dimensions in array_call.

I think the scenario where users are setting full 3D initial conditions is more realistic than them setting a constant value, and it must be easy to do so, without having to check dimension names, specifying units etc., and just using a numpy array if that is all that is available.

Since this is already possible in the released versions, this really feels restrictive and like a step backwards.

If NetCDFMonitor is initialised with write_on_store=True, then it won't be able (should'nt be) to cache the initial data.

Like I said, it is really easy for a user to inspect the state output by get_default_state and wonder why an array that they expect to be 3D (they specified dimensions, after all) is a scalar.

@mcgibbon
Copy link
Member

NetCDFMonitor can work in that case if it over-writes the stored single-step file when the second step is generated. I mean, it over-writes anyways each time, right?

Users having to check dimension names and specify units can save them a lot of work when they find out later the units aren't what they were expecting, or the dimensions are ordered differently than they were expecting. I'd say it's a good feature for us to have.

What I mean is that a component from another model has to specify its own initialization routines for quantities, which can't take advantage of our infrastructure for broadcasting dimensions. Maybe we can fix this by integrating initialization into Sympl and having registration of initializers.

A compromise may be to have this be an option for get_grid. We should also write all of our components so that they will work even when the initial arrays are not "full". This way we're only restricting our own code, and not user code. This allows users to write their own initialization code that uses non-full initial arrays, and it will work with our code.

I still may agree with you in the end about just having get_grid return full arrays, but I want to keep arguing this out to make sure we're making the right choice.

@JoyMonteiro
Copy link
Member Author

Nope, it appends at every time step, and errors if the dimension in one time step is different from the previous one.

I'm fully with you that climt components should work with quantities that have been partially specified. So if the state dictionary is modified by a different model or by hand, our components have no issues handling it.

I prefer things to be more predictable at the climt end, which is why i prefer that if the user specifies model dimensions, the model arrays are of that shape. This also eliminates the need to create dataarrays just to specify initial conditions.

@mcgibbon
Copy link
Member

That's how NetCDFMonitor currently works. I'm talking about how it should work. Do you agree about how it should work?

I disagree that being able to specify initial conditions without making DataArrays is a good feature to have. I'm alright with that being the case, but it's not something I care about having as a feature, since it can lead to as much of an issue (using incorrect dimensions or units) as it creates (having to explicitly define dimensions and units).

@JoyMonteiro
Copy link
Member Author

Nope. This means that write_on_store does not do what it says. I can also see other problems: say, for example, someone wants to update CO2 concentration from the 100th timestep, then store will fail again.

The same issue with PlotFunctionMonitor. I had to modify every script to not store on the first time step in climt/examples. This is definitely undesirable.

I'm definitely not saying that it should be advertised as a feature. Moreover, there is nothing that can prevent such issues from cropping up even now -- for example, I could replace the scalar in a DataArray with another one without keeping in mind units. It is harder to change dimensions as xarray
prevents assignments which changes array shapes.

@mcgibbon
Copy link
Member

Alright, there's a cleaner way to fix it which is that every time a new array is stored (not just the first time), NetCDFMonitor should check if the dimensions are compatible (if there are any new dimensions), and alter the array to be compatible if it must do so (by adding new dimensions). We're not using xarray, we're using netCDF4 as a backend directly for that component. However, xarray (and netCDF4) do allow changing array shapes/dimensions by replacing the variable with a new variable and copying the data over.

This would be fully compatible with write_on_store, the write operation would just be more expensive any time new dimensions are added to arrays.

At this point I agree that full arrays should be returned by get_default_state. But I think NetCDFMonitor should be fixed to work even when it's given sparse arrays that are filled out later. I'll open an issue in sympl for this.

@MTMUTH
Copy link

MTMUTH commented Aug 19, 2018

Build an API to pull multiple U.S. Department of(x) statistics for multiple specific values you want and pull, unless the problem has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants