Fairness Datasets

To make it easy to try algorithms and metrics on various datasets, Fairness.jl provides you with the popular fairness datasets.

These datasets can be easily accesses using macros.

COMPAS Dataset

Fairness.@load_compasMacro

Macro to load COMPAS dataset It is a reduced version of COMPAS Datset with 8 features and 6907 rows. The protected attributes are sex and race. The available features are used to predict whether a criminal defendant will recidivate(reoffend).

Returns (X, y)

source
julia> using Fairness
julia> X, y = @load_compas;

Adult Dataset

Fairness.@load_adultMacro

Macro to Load the Adult dataset It has 14 features and 32561 rows. The protected attributes are race and sex. This dataset is used to predict whether income exceeds 50K dollars per year.

Returns (X, y)

source

German Credit Dataset

Fairness.@load_germanMacro

Load the full version of German credit dataset. This dataset has 20 features and 1000 rows. The protected attributes are gender_status and age (>25 is priviledged) Using the 20 features, it classifies the credit decision to a person as good or bad credit risks.

Returns (X, y)

source

Bank Marketing Dataset

Fairness.@load_bank_marketingMacro

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. It has 20 features and 41188 rows. The protected attributes is marital.

source

Communities and Crime Dataset

Fairness.@load_communities_crimeMacro

The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. It has 127 features and 1994 rows. The protected attributes are ....?

source

Student Performance Dataset

Fairness.@load_student_performanceMacro

Student Performance Dataset. It has 395 rows and 30 features. The target attribute corresponds to grade G1. The target tells whether the student gets grade >= 12. The protected attribute is sex.

source

Synthetic Datasets

Fairness.genZafarDataFunction

genZafarData(n = 10000; d = pi/4)

Generate synthetic data from Zafar et al., 2017 Fairness Constraints: Mechanisms for Fair Classification.

Arguments

  • n=10000 : number of samples
  • d=pi/4 : discrimination factor

Returns

  • X : DataFrame containing features and protected attribute z {"A", "B"} where z="B" is the protected group.
  • y : Binary Target variable {-1, 1}
source
Fairness.genSubgroupDataFunction

genSubgroupData(n=10000, setting="B00")

Generate synthetic data from Loh et al., 2019 : Subgroup identification for precision medicine: A comparative review of 13 methods

Arguments

  • n=10000 : number of samples
  • setting="B00" : Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"

For "B00", ..., "B02" there is no "bias" in the data, i.e. group membership has no effect on y. whereas for "B1", ... , "B8", there is a direct effect of group membership z on y, usually mediated by one or more features.

Returns

  • X : DataFrame containing features and protected attribute z
  • y : Binary target variable
source
Fairness.genZafarData2Function

Generate synthetic data from Zafar et al., 2017 Fairness Beyond Disparate Treatment & Disparate Impact

Arguments

  • n=10000 : number of samples

Returns

  • X : DataFrame containing features and protected attribute z
  • y : Binary Target variable
source
Fairness.genBiasedSampleDataFunction

genBiasedSampleData(n=10000, sampling_bias=0.8)

Generate synthetic data: Biased sample

Arguments

  • n=10000 : number of samples
  • sampling_bias=0.8 : Percentage of data belonging to majority group.

The idea behind this simulation is that algorithms might fit the process in the majority group while disregarding the process in the minority group.

Two different processes for d1 and d2: d1: logit(y) = 0.5( X1 + X2 + 0.3X4) + 2(I(X3 > 0)) d2: logit(y) = 0.5(0.3X1 + X2 + X4) + 2(I(X3 > 0.2))

Returns

  • X : DataFrame containing features and protected attribute z
  • y : Binary Target variable
source

Inspecting Datasets

To see the columns in dataset, their types and scientific types, you can use schema from MLJ.

julia> using Fairness, MLJ
julia> X, y = @load_adult;
julia> schema(X)┌────────────────┬──────────────────────────────────┬────────────────┐ │ _.names │ _.types │ _.scitypes │ ├────────────────┼──────────────────────────────────┼────────────────┤ │ age │ Float64 │ Continuous │ │ workclass │ CategoricalValue{String, UInt32} │ Multiclass{9} │ │ fnlwgt │ Float64 │ Continuous │ │ education │ CategoricalValue{String, UInt32} │ Multiclass{16} │ │ education_num │ Float64 │ Continuous │ │ marital_status │ CategoricalValue{String, UInt32} │ Multiclass{7} │ │ occupation │ CategoricalValue{String, UInt32} │ Multiclass{15} │ │ relationship │ CategoricalValue{String, UInt32} │ Multiclass{6} │ │ race │ CategoricalValue{String, UInt32} │ Multiclass{5} │ │ sex │ CategoricalValue{String, UInt32} │ Multiclass{2} │ │ capital_gain │ Float64 │ Continuous │ │ capital_loss │ Float64 │ Continuous │ │ hours_per_week │ Float64 │ Continuous │ │ native_country │ CategoricalValue{String, UInt32} │ Multiclass{42} │ └────────────────┴──────────────────────────────────┴────────────────┘ _.nrows = 32561

Toy Data

This is a 10 row dataset that was used by authors of Reweighing Algorithm. This dataset is intended to test ideas and evaluate metrics without calculating predictions. It is different from other macros as it returns (X, y, ŷ) instead of (X, y)

Fairness.@load_toydataMacro

Macro to read csv file of job data (data/jobs.csv) and convert columns to categorical. Returns the tuple (X, y, ŷ)

source
Fairness.@load_toyfairtensorMacro

Macro to create fairness Tensor for data/jobs.csv The fairness tensor will be created on the basis of the column Job Type. This column has 3 different values for job types.

source
julia> X, y, ŷ = @load_toydata;
julia> ft = @load_toyfairtensorFairness.FairTensor{3}([2 2; 0 0; 0 2] [0 0; 2 1; 1 0], ["Board", "Education", "Healthcare"])

Other Datasets

You can try working with the vast range of datasets available through OpenML. Refer MLJ's OpenML documentation for the OpenML API. The id to be passed to OpenML.load can be found through OpenML site

julia> using MLJ, Fairness
julia> using DataFrames
julia> data = OpenML.load(1480); # load Indian Liver Patient Dataset
julia> df = DataFrame(data) ;
julia> y, X = unpack(df, ==(:Class), name->true); # Unpack the data into features and target
julia> y = coerce(y, Multiclass); # Specifies that the target y is of type Multiclass. It is othewise a string.
julia> coerce!(X, :V2 => Multiclass, Count => Continuous); # Specifying which columns are Multiclass in nature. Converting from Count to Continuous enables use of more models.

Helper Functions

Fairness.genGaussianFunction

genGaussian(meanin, covin, class_label, n)

Draw from a gaussian distribution

Arguments

  • mean_in : means
  • cov_in : covariances
  • class_label : class_label
  • n : number of samples to draw
source
Fairness.logit_funFunction

logit_fun(X, z, setting)

Compute y from X and z according to a setting provided in Loh et al., 2019: Subgroup identification for precision medicine: A comparative review of 13 methods

Arguments

  • X : matrix of features
  • z : vector of group assignments
  • setting : Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"
source