Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with missing values better #1032

Closed
NickCrews opened this issue Jun 2, 2022 · 9 comments
Closed

Deal with missing values better #1032

NickCrews opened this issue Jun 2, 2022 · 9 comments

Comments

@NickCrews
Copy link
Contributor

Currently, if a scoring function returns np.nan, then at

dedupe/dedupe/datamodel.py

Lines 104 to 106 in 6fc8724

missing_data = numpy.isnan(distances[:, :current_column])
distances[:, :current_column][missing_data] = 0

we set those values arbitrarily to 0.

This isn't principled, and it is randomly throwing noise into the training, since this is the same score as you get from many scorers with two perfectly fine values. It even is inconsistent because in some scorers like PriceType, a score of 0 signifies a match, while for ExactType, a 0 signifies a mis-match.

Hopefully the problem here makes sense. I'm pretty sure this is lowering my accuracy when I'm dealing with genders being Male/Female/None and comparing them with an ExactType. I'm sort of getting around this by treating a missing value as "neutral" with:

def comparator(v1, v2):
    if v1 is None or v2 is None:
        return 0
    if v1 == v2:
        return 1
    else:
        return -1

but this is not ideal.

Options:

  1. I think the simplest, and at least better than current approach would be to impute with the mean of each column. I don't think there is a downside to this, so this should be bare minimum
  2. or we expose a way to set a custom impute hook. Not actually too complicated I don't think. In this case I think nans should be preserved in DataModel.distances, (breaking for anyone relying on internals) and nans should be imputed in the Classifier.
  3. or, if were switching away from RLR classifier, then perhaps we just use a classifier that can natively handle missing values, like XGBoosts RandomForests: https://stackoverflow.com/a/54869392/5156887

@fgregg can I open a PR for option 1 at the least, and maybe consider 2 and 3?

@fgregg
Copy link
Contributor

fgregg commented Jun 2, 2022

it may seem a little strange, but this relates to how dedupe handles missing data.

https://docs.dedupe.io/en/latest/Variable-definition.html#missing-data

you are right, however, that if the user does not declare that a field can be missing, then the current code is probably not the right thing.

options could be:

  1. throw an error that there is a nan, but the user did not set the has_missing flag
  2. do some form of imputation.

between these two options, i would prefer 1. it seems pretty confusing to have two approaches for handling missing data, and one is only triggered if the has_missing flag is not set.

i'm open, btw, to using a different approach for missing_data besides the indicator variable approach we use now, but that would be a separate issue.

@fgregg
Copy link
Contributor

fgregg commented Jun 2, 2022

not directly related to the issue at hand, but you probably want to use the categorical variable type for your gender variable.

@NickCrews
Copy link
Contributor Author

you probably want to use the categorical variable type for your gender variable.

Oops, definitely.

I might not be understanding it correctly (I didn't read the PhD thesis referenced in the docs, looks way over my head), but I think even with the has_missing flag, the zeroing out is still problematic. Sure, you add some additional is_missing columns, which might help. But the original zeroed fields are still affecting the predictions.

My understanding is that the has_missing flag is a little poorly named. Really it is a "use is_missing as a feature" flag, which is related, but not the same, as the problem I'm describing.

So I think even with supplying has_missing, imputation is still needed. Do you see a downside of using the mean for imputing? Exposing config to do custom imputation seems nice, but not needed. I'd imagine this always happens, regardless of whether has_missing is suppied. I can't think of a case where mean-ing would be worse than the zeroing we're doing now, can you?

By always doing this, I don't think it is too confusing: Missing data is always imputed. Then, iff you give the has_missing flag, additional feature columns are added.

@fgregg
Copy link
Contributor

fgregg commented Jun 4, 2022

let's say we have data with a name field and an address field. the address field is missing not at random, or even conditionally at random, so the assumptions of imputation do not hold.

the best thing to do, if we had enough training data, would be to learn two models, one for when we didn't have missing data and one for when we did.

simplifying, let's assume that our models are simple linear regressions, so our models could be

$$ y_1= \beta_0 + \beta_1 \cdot d(\text{name}_i, \text{name}_j)+\beta_2 \cdot d(\text{address}_i, \text{address}_j) + \epsilon $$

$$

$$ y_2 = \beta_3 + \beta_4 \cdot d(\text{name}_i, \text{name}_j) + \epsilon $$

we could split the data on record pairs with and without address missing and train the relevant model. for prediction, we would dispatch one or the other model, depending on whether address was missing.

we could avoid this dispatching by combining them into one linear model if we use a function "$\operatorname{missing}$" which returns $1$ if one of its arguments is missing and $0$ otherwise.

$$ \begin{align} y_3 = &\beta_a + \beta_b \cdot d(\text{name}_i, \text{name}_j) + \beta_c \cdot d(\text{name}_i, \text{name}_j) \cdot \operatorname{missing}(\text{address}_i, \text{address}_j) + \\ &\beta_d \cdot d(\text{address}_i, \text{address}_j) \cdot \left(1 - \operatorname{missing}(\text{address}_i, \text{address}_j)\right) + \beta_e \cdot \operatorname{missing}(\text{address}_f, \text{address}_j) + \epsilon \end{align} $$

If we train the model (without regularization), the parameters will come out such that if there is no missing data then

$$ \begin{align} y_3 &= \beta_a + \beta_b \cdot d(\text{name}_i, \text{name}_j) + \beta_d \cdot d(\text{address}_i, \text{address}_j) + \epsilon \\ &= \beta_0 + \beta_1 \cdot d(\text{name}_i, \text{name}_j)+\beta_2 d(\text{address}_i, \text{address}_j) + \epsilon \end{align} $$

and if there is missing data then

$$ \begin{align} y_3 &= (\beta_a + \beta_e) + (\beta_b + \beta_c)\cdot d(\text{name}_i, \text{name}_j) + \epsilon \\ &= \beta_3 + \beta_4 \cdot d(\text{name}_i, \text{name}_j) + \epsilon \end{align} $$

You should be able to convince yourselves of the above.


What we do is like this but not quite. Our model does not have the $\beta_c \cdot d(\text{name}_i, \text{name}_j) \cdot \operatorname{missing}(\text{address}_i, \text{address}_j)$ term. Instead, our model is like

$$ \begin{align} y_3 = &\beta_a + \beta_b \cdot d(\text{name}_i, \text{name}_j) + \beta_d \cdot d(\text{address}_i, \text{address}_j) \cdot \left(1 - \operatorname{missing}(\text{address}_i, \text{address}_j)\right) + \\ &\beta_e \cdot \operatorname{missing}(\text{address}_f, \text{address}_j) + \epsilon \end{align} $$

This effectively is making the assumption that the effect of differences in the name field is the same in the two models, i.e. $\beta_1 = \beta_4$.

This is a kind of compromise between having separate models and not having so much much model complexity that training is very hard.

(another slight wrinkle is that the identities between the two forms of expressing the model don't really hold when you are doing regularization)


i hope this clears up the approach we are taking and the reasoning behind it.

as i said, i'm open to a different approach. when I investigated approaches many years ago, this beat imputation quite easily.

the paper, if i recall correctly, has some good comparisons with other approaches to handling missing data, including imputation

@NickCrews
Copy link
Contributor Author

NickCrews commented Jun 9, 2022

Thank you @fgregg for that detailed explanation, that's a lot of Latex! Nice thought experiment with starting with two different models and combining them. That makes sense.

With that all written out like that, I see how we use $B_d$ if the value isn't missing, and we use $B_e$ if the value is missing. Effectively, instead of arbitrarily setting missing values to 0 as I originally understood, we really are setting them to $B_e$, which is a learned parameter. Fears assuaged.

The one other thing I was thinking about would be if, in my domain space, one missing value would be semantically different from both values being missing. Right now, we treat them the same. But, if I really wanted to treat them different, then I think I would put in a sentinel value that isn't None, and write my own custom scoring function that takes this into account.

The assumptions and simplifications seem reasonable and like they wouldn't throw anything off too much. If we moved away from a linear model and to a random forest classifier (or really any other classifier), then I think all these assumptions still hold right? The part that I would be least confident about would be "making the assumption that the effect of differences in the name field is the same in the two models".

Going back to your suggestion now that I understand things:

throw an error that there is a nan, but the user did not set the has_missing flag

I agree, I think we should do this. Started #1050

I have one unlikely scenario where someone would have NaNs and NOT want the is_missing columns: Every column has missing data, so they effectively double the number of columns, and therefore there are more predicates and weights to learn, which takes more resources. Currently, I could get around this by using my own custom scoring function that takes into account missing values. But I think this situation is unlikely, and additionally they could keep their old behavior by using non-None sentinel values as I describe above.

@NickCrews
Copy link
Contributor Author

NickCrews commented Jun 9, 2022

Wait, just to confirm, I'm looking at

dedupe/dedupe/datamodel.py

Lines 105 to 114 in dd8449d

is_missing = numpy.isnan(distances[:, :current_column])
distances[:, :current_column][is_missing] = 0
if self._missing_field_indices:
distances[:, current_column:] = (
1 - is_missing[:, self._missing_field_indices]
)
return distances

As it is right now, I think this code achieves:

  • if not missing, $B_d$ term is non-zero, and $B_e$ term is non-zero
  • if missing, $B_d$ term is zero, and $B_e$ term is zero

But don't we want:

  • if not missing, $B_d$ term is non-zero, and $B_e$ term is zero
  • if missing, $B_d$ term is zero, and $B_e$ term is non-zero

So should this instead be

        if self._missing_field_indices:
            distances[:, current_column:] = (
                is_missing[:, self._missing_field_indices]
            )

? Or is this not important, the $B_e$ and $B_a$ will just change to compensate for the fact that the boolean is inverted?

@fgregg
Copy link
Contributor

fgregg commented Jun 10, 2022

The one other thing I was thinking about would be if, in my domain space, one missing value would be semantically different from both values being missing. Right now, we treat them the same. But, if I really wanted to treat them different, then I think I would put in a sentinel value that isn't None, and write my own custom scoring function that takes this into account.

This is what precisely what the Exists variable is for.

? Or is this not important, the and will just change to compensate for the fact that the boolean is inverted?

Sharp observation! I think it shouldn't matter, but let me check that more carefully tomorrow.

@NickCrews
Copy link
Contributor Author

This is what precisely what the Exists variable is for.

Oh, I didn't understand this from the documentation. I thought it was just a true (both values present) vs false (one or both values is missing) like the flags above. I guess upon rereading the docs that is what it's saying, I'll see if I can tweak it though to be more clear.

NickCrews added a commit to NickCrews/dedupe that referenced this issue Jun 10, 2022
@NickCrews
Copy link
Contributor Author

Closing this because I think my original worries were addressed. Start a new issue if you have a related problem/idea.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants