Fairness Datasets

To make it easy to try algorithms and metrics on various datasets, Fairness.jl provides you with the popular fairness datasets.

These datasets can be easily accesses using macros.

COMPAS Dataset

Fairness.@load_compas — Macro

Macro to load COMPAS dataset It is a reduced version of COMPAS Datset with 8 features and 6907 rows. The protected attributes are sex and race. The available features are used to predict whether a criminal defendant will recidivate(reoffend).

Returns (X, y)

source

julia> using Fairness
julia> X, y = @load_compas;

Adult Dataset

Fairness.@load_adult — Macro

Macro to Load the Adult dataset It has 14 features and 32561 rows. The protected attributes are race and sex. This dataset is used to predict whether income exceeds 50K dollars per year.

Returns (X, y)

source

German Credit Dataset

Fairness.@load_german — Macro

Load the full version of German credit dataset. This dataset has 20 features and 1000 rows. The protected attributes are gender_status and age (>25 is priviledged) Using the 20 features, it classifies the credit decision to a person as good or bad credit risks.

Returns (X, y)

source

Bank Marketing Dataset

Fairness.@load_bank_marketing — Macro

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. It has 20 features and 41188 rows. The protected attributes is marital.

source

Communities and Crime Dataset

Fairness.@load_communities_crime — Macro

The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. It has 127 features and 1994 rows. The protected attributes are ....?

source

Student Performance Dataset

Fairness.@load_student_performance — Macro

Student Performance Dataset. It has 395 rows and 30 features. The target attribute corresponds to grade G1. The target tells whether the student gets grade >= 12. The protected attribute is sex.

source

Synthetic Datasets

Fairness.genZafarData — Function

genZafarData(n = 10000; d = pi/4)

Generate synthetic data from Zafar et al., 2017 Fairness Constraints: Mechanisms for Fair Classification.

Arguments

n=10000 : number of samples
d=pi/4 : discrimination factor

Returns

X : DataFrame containing features and protected attribute z {"A", "B"} where z="B" is the protected group.
y : Binary Target variable {-1, 1}

source

Fairness.genSubgroupData — Function

genSubgroupData(n=10000, setting="B00")

Generate synthetic data from Loh et al., 2019 : Subgroup identification for precision medicine: A comparative review of 13 methods

Arguments

n=10000 : number of samples
setting="B00" : Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"

For "B00", ..., "B02" there is no "bias" in the data, i.e. group membership has no effect on y. whereas for "B1", ... , "B8", there is a direct effect of group membership z on y, usually mediated by one or more features.

Returns

X : DataFrame containing features and protected attribute z
y : Binary target variable

source

Fairness.genZafarData2 — Function

Generate synthetic data from Zafar et al., 2017 Fairness Beyond Disparate Treatment & Disparate Impact

Arguments

n=10000 : number of samples

Returns

X : DataFrame containing features and protected attribute z
y : Binary Target variable

source

Fairness.genBiasedSampleData — Function

genBiasedSampleData(n=10000, sampling_bias=0.8)

Generate synthetic data: Biased sample

Arguments

n=10000 : number of samples
sampling_bias=0.8 : Percentage of data belonging to majority group.

The idea behind this simulation is that algorithms might fit the process in the majority group while disregarding the process in the minority group.

Two different processes for d1 and d2: d1: logit(y) = 0.5( X1 + X2 + 0.3X4) + 2(I(X3 > 0)) d2: logit(y) = 0.5(0.3X1 + X2 + X4) + 2(I(X3 > 0.2))

Returns

X : DataFrame containing features and protected attribute z
y : Binary Target variable

source

Inspecting Datasets

To see the columns in dataset, their types and scientific types, you can use schema from MLJ.

julia> using Fairness, MLJ
julia> X, y = @load_adult;
julia> schema(X)┌────────────────┬──────────────────────────────────┬────────────────┐
│ _.names        │ _.types                          │ _.scitypes     │
├────────────────┼──────────────────────────────────┼────────────────┤
│ age            │ Float64                          │ Continuous     │
│ workclass      │ CategoricalValue{String, UInt32} │ Multiclass{9}  │
│ fnlwgt         │ Float64                          │ Continuous     │
│ education      │ CategoricalValue{String, UInt32} │ Multiclass{16} │
│ education_num  │ Float64                          │ Continuous     │
│ marital_status │ CategoricalValue{String, UInt32} │ Multiclass{7}  │
│ occupation     │ CategoricalValue{String, UInt32} │ Multiclass{15} │
│ relationship   │ CategoricalValue{String, UInt32} │ Multiclass{6}  │
│ race           │ CategoricalValue{String, UInt32} │ Multiclass{5}  │
│ sex            │ CategoricalValue{String, UInt32} │ Multiclass{2}  │
│ capital_gain   │ Float64                          │ Continuous     │
│ capital_loss   │ Float64                          │ Continuous     │
│ hours_per_week │ Float64                          │ Continuous     │
│ native_country │ CategoricalValue{String, UInt32} │ Multiclass{42} │
└────────────────┴──────────────────────────────────┴────────────────┘
_.nrows = 32561

Toy Data

This is a 10 row dataset that was used by authors of Reweighing Algorithm. This dataset is intended to test ideas and evaluate metrics without calculating predictions. It is different from other macros as it returns (X, y, ŷ) instead of (X, y)

Fairness.@load_toydata — Macro

Macro to read csv file of job data (data/jobs.csv) and convert columns to categorical. Returns the tuple (X, y, ŷ)

source

Fairness.@load_toyfairtensor — Macro

Macro to create fairness Tensor for data/jobs.csv The fairness tensor will be created on the basis of the column Job Type. This column has 3 different values for job types.

source

julia> X, y, ŷ = @load_toydata;
julia> ft = @load_toyfairtensorFairness.FairTensor{3}([2 2; 0 0; 0 2]

[0 0; 2 1; 1 0], ["Board", "Education", "Healthcare"])

Other Datasets

You can try working with the vast range of datasets available through OpenML. Refer MLJ's OpenML documentation for the OpenML API. The id to be passed to OpenML.load can be found through OpenML site

julia> using MLJ, Fairness
julia> using DataFrames
julia> data = OpenML.load(1480); # load Indian Liver Patient Dataset
julia> df = DataFrame(data) ;
julia> y, X = unpack(df, ==(:Class), name->true); # Unpack the data into features and target
julia> y = coerce(y, Multiclass); # Specifies that the target y is of type Multiclass. It is othewise a string.
julia> coerce!(X, :V2 => Multiclass, Count => Continuous); # Specifying which columns are Multiclass in nature. Converting from Count to Continuous enables use of more models.

Helper Functions

Fairness.ensure_download — Function

Checks whether the dataset is already present in data directory. Downloads it if not present.

source

Fairness.genGaussian — Function

genGaussian(meanin, covin, class_label, n)

Draw from a gaussian distribution

Arguments

mean_in : means
cov_in : covariances
class_label : class_label
n : number of samples to draw

source

Fairness.logit_fun — Function

logit_fun(X, z, setting)

Compute y from X and z according to a setting provided in Loh et al., 2019: Subgroup identification for precision medicine: A comparative review of 13 methods

Arguments

X : matrix of features
z : vector of group assignments
setting : Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"

source