Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with feature column names exceeding Postgres' max identifier length #739

Open
adunmore opened this issue May 21, 2020 · 6 comments

Comments

@adunmore
Copy link
Contributor

adunmore commented May 21, 2020

Triage can try to create column names longer than Postgres' 63 character max identifier length. When this happens, Postgres truncates the candidate column name.

I've seen this causing problems in the feature generation code where an expected column name doesn't match the actual name in database. For example, when triage applies imputation rules, it first queries a feature table for the list of columns containing nulls. If one of the column names returned was truncated on creation, then triage doesn't have an imputation rule to handle it.

This might also cause problems where triage tries to create duplicate columns, as when creating columns for different aggregations of the same quantity:

'[some_63_char_identifier]_sum'
and
'[some_63_char_identifier]_max'
both get truncated to the same identifier:
'[some_63_char_identifier]'

@ecsalomon
Copy link
Contributor

Thanks for flagging this. The temporary workaround is to shorten the name of your columns in your from_obj or when you are selecting values for categorical columns (this is where I tend to run into it). For example, for categoricals, I will sometimes use something like left(6) to produce shortened names.

@thcrock
Copy link
Contributor

thcrock commented May 21, 2020

It's worth thinking about what Triage should do in this case. Just flag it in validation and throw an error as early as possible, or is there a reasonable way to autocorrect/autoshorten for the user?

In either case, it'll probably be easier to fix it if #608 is merged, because in it the full feature names (with intervals and aggregate functions, etc) are available much earlier in the process, before the database starts doing work.

@adunmore
Copy link
Contributor Author

Could we

  • implement an option to specify an alias for the source column name (the middle part in the generated name) in the config file
  • throw an error and require that the user specify a shorter name in the alias

This has the advantage of being easier than trying to figure out an approach to automatically shortening the name, and allows the user to ensure that the resulting name is meaningful/readable while being unique (compared to something like generating a hash of the source column name)

@thcrock
Copy link
Contributor

thcrock commented May 22, 2020

I think #1 is one of the things @ecsalomon was alluding to, if I read it right. The aliasing you can do now.

If you specify the quantity as a dict instead of a string you can call the output column whatever you want. Dirtyduckling uses this aliasing option:
https://github.com/dssg/triage/blob/master/example/config/dirty-duckling.yaml#L32

Instead of
quantity: <sourcecolname>
do

quantity:
   <alias>: <sourcecolname>

@thcrock
Copy link
Contributor

thcrock commented May 22, 2020

As far as the validation goes, this ticket exists for that.

#517
#514

A bunch of people have hit this problem, so it would be very nice to fix! As of yet it hasn't been done because it's a bit complicated to catch this before feature creation given how the feature creation code is structured right now (hence my bringing up #608)

@ecsalomon
Copy link
Contributor

To be slightly more useful, here is an example feature group with my categorical workaround:

 -
        prefix: 'dx'
        from_obj: |
            (SELECT entity_id,
                   knowledge_date,
                   left(diagnosis, 6) AS dx                     # shortened name
              FROM diagnosis_table) AS dx
        knowledge_date_column: 'knowledge_date'
        categoricals_imputation:
            all:
                type: 'zero'
        categoricals:
            - # first 6 letters of diagnosis with at least 150 occurrences
                column: dx
                choice_query: |
                    SELECT DISTINCT left(dx, 6)       # shorten each category to 6 letters
                      FROM (
                               SELECT left(diagnosis, 6) AS dx, 
                                      count(*)
                                 FROM diagnosis_table
                                GROUP BY dx
                           ) AS dx_counts
                     WHERE count > 150
                metrics: 
                    - max
                    - sum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants