Fairness Datasets
To make it easy to try algorithms and metrics on various datasets, Fairness.jl provides you with the popular fairness datasets.
These datasets can be easily accesses using macros.
COMPAS Dataset
Fairness.@load_compas
— MacroMacro to load COMPAS dataset It is a reduced version of COMPAS Datset with 8 features and 6907 rows. The protected attributes are sex and race. The available features are used to predict whether a criminal defendant will recidivate(reoffend).
Returns (X, y)
julia> using Fairness
julia> X, y = @load_compas;
Adult Dataset
Fairness.@load_adult
— MacroMacro to Load the Adult dataset It has 14 features and 32561 rows. The protected attributes are race and sex. This dataset is used to predict whether income exceeds 50K dollars per year.
Returns (X, y)
German Credit Dataset
Fairness.@load_german
— MacroLoad the full version of German credit dataset. This dataset has 20 features and 1000 rows. The protected attributes are gender_status and age (>25 is priviledged) Using the 20 features, it classifies the credit decision to a person as good or bad credit risks.
Returns (X, y)
Bank Marketing Dataset
Fairness.@load_bank_marketing
— MacroThe data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. It has 20 features and 41188 rows. The protected attributes is marital.
Communities and Crime Dataset
Fairness.@load_communities_crime
— MacroThe per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. It has 127 features and 1994 rows. The protected attributes are ....?
Student Performance Dataset
Fairness.@load_student_performance
— MacroStudent Performance Dataset. It has 395 rows and 30 features. The target attribute corresponds to grade G1. The target tells whether the student gets grade >= 12. The protected attribute is sex.
Synthetic Datasets
Fairness.genZafarData
— FunctiongenZafarData(n = 10000; d = pi/4)
Generate synthetic data from Zafar et al., 2017 Fairness Constraints: Mechanisms for Fair Classification.
Arguments
n=10000
: number of samplesd=pi/4
: discrimination factor
Returns
X
: DataFrame containing features and protected attribute z {"A", "B"} where z="B" is the protected group.y
: Binary Target variable {-1, 1}
Fairness.genSubgroupData
— FunctiongenSubgroupData(n=10000, setting="B00")
Generate synthetic data from Loh et al., 2019 : Subgroup identification for precision medicine: A comparative review of 13 methods
Arguments
n=10000
: number of samplessetting="B00"
: Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"
For "B00", ..., "B02" there is no "bias" in the data, i.e. group membership has no effect on y. whereas for "B1", ... , "B8", there is a direct effect of group membership z on y, usually mediated by one or more features.
Returns
X
: DataFrame containing features and protected attribute zy
: Binary target variable
Fairness.genZafarData2
— FunctionGenerate synthetic data from Zafar et al., 2017 Fairness Beyond Disparate Treatment & Disparate Impact
Arguments
n=10000
: number of samples
Returns
X
: DataFrame containing features and protected attribute zy
: Binary Target variable
Fairness.genBiasedSampleData
— FunctiongenBiasedSampleData(n=10000, sampling_bias=0.8)
Generate synthetic data: Biased sample
Arguments
n=10000
: number of samplessampling_bias=0.8
: Percentage of data belonging to majority group.
The idea behind this simulation is that algorithms might fit the process in the majority group while disregarding the process in the minority group.
Two different processes for d1 and d2: d1: logit(y) = 0.5( X1 + X2 + 0.3X4) + 2(I(X3 > 0)) d2: logit(y) = 0.5(0.3X1 + X2 + X4) + 2(I(X3 > 0.2))
Returns
X
: DataFrame containing features and protected attribute zy
: Binary Target variable
Inspecting Datasets
To see the columns in dataset, their types and scientific types, you can use schema
from MLJ.
julia> using Fairness, MLJ
julia> X, y = @load_adult;
julia> schema(X)
┌────────────────┬──────────────────────────────────┬────────────────┐ │ _.names │ _.types │ _.scitypes │ ├────────────────┼──────────────────────────────────┼────────────────┤ │ age │ Float64 │ Continuous │ │ workclass │ CategoricalValue{String, UInt32} │ Multiclass{9} │ │ fnlwgt │ Float64 │ Continuous │ │ education │ CategoricalValue{String, UInt32} │ Multiclass{16} │ │ education_num │ Float64 │ Continuous │ │ marital_status │ CategoricalValue{String, UInt32} │ Multiclass{7} │ │ occupation │ CategoricalValue{String, UInt32} │ Multiclass{15} │ │ relationship │ CategoricalValue{String, UInt32} │ Multiclass{6} │ │ race │ CategoricalValue{String, UInt32} │ Multiclass{5} │ │ sex │ CategoricalValue{String, UInt32} │ Multiclass{2} │ │ capital_gain │ Float64 │ Continuous │ │ capital_loss │ Float64 │ Continuous │ │ hours_per_week │ Float64 │ Continuous │ │ native_country │ CategoricalValue{String, UInt32} │ Multiclass{42} │ └────────────────┴──────────────────────────────────┴────────────────┘ _.nrows = 32561
Toy Data
This is a 10 row dataset that was used by authors of Reweighing Algorithm. This dataset is intended to test ideas and evaluate metrics without calculating predictions. It is different from other macros as it returns (X, y, ŷ) instead of (X, y)
Fairness.@load_toydata
— MacroMacro to read csv file of job data (data/jobs.csv) and convert columns to categorical. Returns the tuple (X, y, ŷ)
Fairness.@load_toyfairtensor
— MacroMacro to create fairness Tensor for data/jobs.csv The fairness tensor will be created on the basis of the column Job Type. This column has 3 different values for job types.
julia> X, y, ŷ = @load_toydata;
julia> ft = @load_toyfairtensor
Fairness.FairTensor{3}([2 2; 0 0; 0 2] [0 0; 2 1; 1 0], ["Board", "Education", "Healthcare"])
Other Datasets
You can try working with the vast range of datasets available through OpenML. Refer MLJ's OpenML documentation for the OpenML API. The id to be passed to OpenML.load can be found through OpenML site
julia> using MLJ, Fairness
julia> using DataFrames
julia> data = OpenML.load(1480); # load Indian Liver Patient Dataset
julia> df = DataFrame(data) ;
julia> y, X = unpack(df, ==(:Class), name->true); # Unpack the data into features and target
julia> y = coerce(y, Multiclass); # Specifies that the target y is of type Multiclass. It is othewise a string.
julia> coerce!(X, :V2 => Multiclass, Count => Continuous); # Specifying which columns are Multiclass in nature. Converting from Count to Continuous enables use of more models.
Helper Functions
Fairness.ensure_download
— FunctionChecks whether the dataset is already present in data directory. Downloads it if not present.
Fairness.genGaussian
— FunctiongenGaussian(meanin, covin, class_label, n)
Draw from a gaussian distribution
Arguments
mean_in
: meanscov_in
: covariancesclass_label
: class_labeln
: number of samples to draw
Fairness.logit_fun
— Functionlogit_fun(X, z, setting)
Compute y from X and z according to a setting provided in Loh et al., 2019: Subgroup identification for precision medicine: A comparative review of 13 methods
Arguments
X
: matrix of featuresz
: vector of group assignmentssetting
: Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"