Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated dataframe interactions #29

Open
1 task
0xMochan opened this issue May 29, 2023 · 1 comment
Open
1 task

Updated dataframe interactions #29

0xMochan opened this issue May 29, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@0xMochan
Copy link
Collaborator

0xMochan commented May 29, 2023

Is your feature request related to a problem? Please describe.
I cannot use modern dataframe techniques effectively with subgrounds.

Describe the solution you'd like
I would like to leverage modern pandas (2.0) and polars alongside with the arrow data format and duckdb directly when using subgrounds.

query_arrow or even query(format="pandas"), perhaps a generic query interface similar to how PaginationStrategy is decided.

Theoretically, we could add a new argument to query and codify the existing query function with a default legacy_query callable or interface.

Describe alternatives you've considered
You can use query_json to implement a custom query but it's undocumented and quite obtuse. It's also quite awkward to navigate the python-ification of data types which often have to be undo'd with polars for example

Additional context

  • The interface for this should be future proof as likely, we'll want to push it over query_df as we currently do. Theoretically, query_df could be switched to this new query interface as a shorthand to maintain backwards compatibility.
    • In a theoretical subgrounds 2.0, breaking changes could elevate this interface further.
  • The arrow data format is an interesting standardization that could be rooted deeper in this interface.
    • Something like query_arrow could easily be converted to a pandas>=2.0 or polars dataframe without any conversion-loss, etc.
  • Likely, we'll want to push an alpha build of this interface for testers

Implementation checklist

  • Task 1
@0xMochan 0xMochan added the enhancement New feature or request label May 29, 2023
@Evan-Kim2028
Copy link

Evan-Kim2028 commented Sep 15, 2023

Just opened an issue that is similar to this - #42

CLosing out issue 42 and adding the comment to this issue to consolidate the discussion:

Subgrounds should offer dataframe graphql function support for multiple libraries as well, not just Pandas. Currently the only dataframe utility functions are Pandas, found here

The current direction of Subgrounds is going towards a multi-client world. One alternative client to the base client would be to utilize polars instead of pandas dataframes. However, currently dataframe_utils.py only offers pandas function helpers, which actively discriminates against using polars with Subgrounds.

To utilize subgrounds with polars, examples of functions that need to be constantly defined are fmt_dict_cols and fmt_arr_cols.

fmt_dict_cols - required to convert graphql json data into polars dataframe columns
fmt_arr_cols - required to separate graphql json data fields that contain arrays into polars individual dataframe columns.

Example code:

def fmt_dict_cols(df: pl.DataFrame) -> pl.DataFrame:
    """
    formats dictionary cols, which are 'structs' in a polars df, into separate columns and renames accordingly.
    """
    for column in df.columns:
        if isinstance(df[column][0], dict):  
            col_names = df[column][0].keys()
            # rename struct columns
            struct_df = df.select(
                pl.col(column).struct.rename_fields([f'{column}_{c}' for c in col_names])
            )
            struct_df = struct_df.unnest(column)
            # add struct_df columns to df and
            df = df.with_columns(struct_df)
            # drop the df column
            df = df.drop(column)
    
    return df

def fmt_arr_cols(df: pl.DataFrame) -> pl.DataFrame:
    """
    formats lists, which are arrays in a polars df, into separate columns and renames accordingly.
    Since there isn't a direct way to convert array -> new columns, we convert the array to a struct and then
    unnest the struct into new columns.
    """
    # use this logic if column is a list (rows show up as pl.Series)
    for column in df.columns:
        if isinstance(df[column][0], pl.Series):
            # convert struct to array
            struct_df = df.select([pl.col(column).arr.to_struct()])
            # rename struct fields
            struct_df = struct_df.select(
                pl.col(column).struct.rename_fields([f"{column}_{i}" for i in range(len(struct_df.shape))])
            )
            # unnest struct fields into their own columns
            struct_df = struct_df.unnest(column)
            # add struct_df columns to df and
            df = df.with_columns(struct_df)
            # drop the df column
            df = df.drop(column)

    return df
    ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants