Skip to content

prodillo/Cleaning-Functions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Cleaning-Functions

This notebook contains the class cleaining that has several methods that are useful to clean categorical variables. There is also a Scikit implementation of the methods remove_nulls and group_categories to be used in a pipeline of transformations. The methods are the following:

  • get_nulls(dataframe, columns): This method returns a dictionary with the percentage of nulls of each columns of a dataframe.

    -Inputs:

      - dataframe: a pandas dataframe object.
      - columns: the columns of the dataframe to be included in the calculation. If this is not specified all the 
        columns will be taken into account.
    
  • remove_nulls(dataframe, cut_off, columns): This method remove the columns of a dataframe that have a percentage of nulls higher than a certain cut_off percentage of nulls.

    -Inputs:

      - dataframe: a pandas dataframe object.
      - cut_off: The minimum percentage of nulls allowed to keep a columns. If a column has a percentage of nulls higher 
        than the cut_off percentage, it will be removed.
      - columns: the columns of the dataframe to be included in the operation. If this is not specified all the 
        columns will be taken into account.
    
  • fill_nulls(dataframe, label, columns): This method fill the null values of the columns of a dataframe with a desired label.

    -Inputs:

      - dataframe: a pandas dataframe object.
      - label: The text that will be used to replace nulls.
      - columns: the columns of the dataframe to be included in the operation. If this is not specified all the 
        columns will be taken into account.
    
  • group_categories(dataframe, cut_off, label, columns): This method change the category of a categorical variable to a desired label if the percentage of occurence of the category is less than a certain cut_off percentage. This allows to put in the same category those categories with low frequency.

    -Inputs:

      - dataframe: a pandas dataframe object.
      - cut_off: Categories with a percentage of occurence less than the cut_off percenatage will be relabeled
      - label: The label for those categories that will be relabeled.
      - columns: the columns of the dataframe to be included in the operation. If this is not specified all the 
        columns will be taken into account.
    

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published