Dashboard¶

Creating Dashboard instance¶

Features Descriptions Dictionary¶

Dashboard instance can be created with just X, y and output_directory arguments, but it doesn’t mean it cannot be customized. The most notable optional argument that can augment the HTML Dashboard is feature_descriptions_dict dictionary-like object. Structure of the dict should be:

feature_descriptions_dict = {

    "Feature1": {

        "description": "description of feature1, e.g. height in cm",
        "category": "cat" OR "num",
        "mapping": {
            "value1": "better name or explanation for value1",
            "value2": "better name or explanation for value2"
        }
    },

    "Feature2": {

        (...)

    }

}

"description" describes what the feature is about - what’s the story behind numbers and values.
"category" defines what kind of a data you want the feature to be treated as:
- "num" stands for Numerical values;
- "cat" is for Categorical variables;
"mapping" represents what do every value in your data mean - e.g. 0 - “didn’t buy a product”, 1 - “bought a product”.

This external information is fed to the Dashboard and it will be included in HTML files (where appropriate) for your convenience. Last but not least, by providing a specific "category" you are also forcing the Dashboard to interpret a given feature the way you want, e.g.: you could provide "num" to a binary variable (only 0 and 1s) and Dashboard will treat that feature as Numerical (which means that, for example, Normal Transformations will be applied).

Note

Please keep in mind that every argument in feature_descriptions_dict is optional: you can provide all Features or only one, only "category" for few of the Features and "description" for others, etc.

Other Dashboard instance arguments¶

already_transformed_columns can be a list of features that are already transformed and won’t need additional transformations from the Dashboard:

Dashboard(X, y, output_directory,
          already_transformed_columns=["Feature1", "pre-transformed Feature4"]
          )

classification_pos_label forces the Dashboard to treat provided label as a positive (1) label (classification problem type):

Dashboard(X, y, output_directory,
          classification_pos_label="Not Survived"
          )

force_classification_pos_label_multiclass is a bool flag useful when you also provide classification_pos_label in a multiclass problem type (essentially turning it into classification problem) - without it, classification_pos_label will be ignored:

Dashboard(X, y, output_directory,
          classification_pos_label="Iris-Setosa",
          force_classification_pos_label_multiclass=True
          )

random_state can be provided for results reproducibility:

Dashboard(X, y, output_directory,
          random_state=13,
          )

Example¶

dsh = Dashboard(
    X=your_X,
    y=your_y,
    output_directory="path/output",
    features_descriptions_dict={"petal width (cm)": {"description": "width of petal (in cm)"}},
    already_transformed_columns=["sepal length (cm)"],
    classification_pos_label=1,
    force_classification_pos_label=True,
    random_state=10
    )

Creating HTML Dashboard¶

To create HTML Dashboard from Dashboard instance, you need to call create_dashboard method:

dsh.create_dashboard()

You can customize the process further by providing appropriate arguments to the method (see below).

Models¶

models stands for collection of sklearn Models that will be fit on provided data. They can be provided in different ways:

list of Models instances;
dict of Model class: param_grid attributes to do GridSearch on;
None - in which case default Models will be used.

# list of Models
models = [DecisionTreeClassifier(), SVC(C=100.0), LogisticRegression()]

# dict for GridSearch
models = {
    DecisionTreeClassifier: {"max_depth": [1, 5, 10], "criterion": ["gini", "entropy"]},
    SVC: {"C": [10, 100, 1000]}
}

# None
models = None

Scoring¶

scoring should be a sklearn scoring function appropriate for a given problem type (e.g. roc_auc_score for classification). It can also be None, in which case default scoring for a given problem will be used:

scoring = precision_score

Note

Some functions might not work for some type of problems (e.g. roc_auc_score for multiclass)

Mode¶

mode should be provided as either "quick" or "detailed" string literal. Argument is useful only when models=None.

if "quick", then the initial search is done only on default instances of Models (for example SVC(), LogisticRegression(), etc.) as Models are simply scored with scoring function. Top scoring Models are then GridSearched;
if "detailed", then all available combinations of default Models are GridSearched.

Logging¶

logging is a bool flag indicating if you want to have .csv files (search logs) included in your output directory in logs subdirectory.

Disabling Pairplots¶

Both seaborn PairPlot in Overview subpage and ScatterPlot Grid in Features subpage were identified to be the biggest time/resource bottlenecks in creating HTML Dashboard. If you feel like speeding up the process, set disable_pairplots=True.

Note

Pairplots are disabled by default when the number of features in your data crosses certain threshold. See also Forcing Pairplots.

Forcing Pairplots¶

When number of features in X and y crosses a certain threshold, creation of both seaborn PairPlot and ScatterPlot Grid is disabled. This was a conscious decision, as not only it extremely slows down the process (and might even lead to raising Exceptions or running out of memory), PairPlots are getting so enormous that the insight gained from them is minuscule.

If you know what you’re doing, set force_pairplot=True.

Note

If disable_pairplots=True and force_pairplot=True are both provided, disable_pairplots takes precedence and pairplots will be disabled.

Example¶

dsh.create_dashboard(
    models=None,
    scoring=sklearn.metrics.precision_score,
    mode="detailed",
    logging=True,
    disable_pairplots=False,
    force_pairplots=True
)

Setting Custom Preprocessors in Dashboard¶

set_custom_transformers is a method to provide your own Transformers to Dashboard pipeline. Dashboard preprocessing is simple, so you are free to change it to your liking. There are 3 arguments (all optional):

categorical_transformers
numerical_transformers
y_transformer

Both categorical_transformers and numerical_transformers should be list-like objects of instantiated Transformers. As names suggest, categorical_transformers will be used to transform Categorical features, whereas numerical_transformers will transform Numerical features.

y_transformer should be a single Transformer.

dsh.set_custom_transformers(
    categorical_transformers=[SimpleImputer(strategy="most_frequent")],
    numerical_transformers=[StandardScaler()],
    y_transformer=LabelEncoder()
)

Note

Keep in mind that in regression problems, Dashboard already wraps the target in TransformedTargetRegressor object (with QuantileTransformer as a transformer). See also sklearn documentation.

Using Dashboard as sklearn pipeline¶

Dashboard can also be used as a simpler version of sklearn.pipeline - methods such as transform, predict, etc. are exposed and available. Please refer to Documentation for more information.

Documentation¶

class data_dashboard.dashboard.Dashboard(X, y, output_directory, feature_descriptions_dict=None, already_transformed_columns=None, classification_pos_label=None, force_classification_pos_label_multiclass=False, random_state=None)¶

Data Dashboard with Visualizations of original data, transformations and Machine Learning Models performance.

Dashboard analyzes provided data (summary statistics, correlations), transforms it and feeds it to Machine Learning algorithms to search for the best scoring Model. All steps are written down into the HTML output for end-user experience.

HTML output created is a set of ‘static’ HTML pages saved into provided output directory - there are no server - client interactions. Visualization are still interactive though through the use of Bokeh library.

Note

As files are static, data might be embedded in HTML files. Please be aware when sharing produced HTML Dashboard.

Dashboard object can also be used as a pipeline for transforming/fitting/predicting by using exposed methods loosely following sklearn API.

output_directory¶

directory where HTML Dashboard will be placed

Type: str

already_transformed_columns¶

list of feature names that are already pre-transformed

Type: list

random_state¶

integer for reproducibility on fitting and transformations, defaults to None if not provided during __init__

Type: int, None

features_descriptions¶

FeatureDescriptor containing external information on features

Type: FeatureDescriptor

features¶

Features object with basic features information

Type: Features

analyzer¶

Analyzer object analyzing and performing calculations on features

Type: Analyzer

transformer¶

Transformer object responsible for transforming the data for ML algorithms, fit to all data

Type: Transformer

transformer_eval¶

Transformer object fit on train data only

Type: Transformer

model_finder¶

ModelFinder object responsible for searching for Models and assessing their performance

Type: ModelFinder

X¶

data to be analyzed

Type: pandas.DataFrame, numpy.ndarray, scipy.csr_matrix

y¶

target variable

Type: pandas.Series, numpy.ndarray

X_train¶

train split of X

Type: pandas.DataFrame

X_test¶

test split of X

Type: pandas.DataFrame

y_train¶

train split of y

Type: pandas.Series

y_test¶

test split of y

Type: pandas.Series

transformed_X¶

all X data transformed with transformer

Type: numpy.ndarray, scipy.csr_matrix

transformed_y¶

all y data transformed with transformer

Type: numpy.ndarray

transformed_X_train¶

X train split transformed with transformer_eval

Type: numpy.ndarray, scipy.csr_matrix

transformed_X_test¶

X test split transformed with transformer_eval

Type: numpy.ndarray, scipy.csr_matrix

transformed_y_train¶

y train split transformed with transformer_eval

Type: numpy.ndarray

transformed_y_test¶

y test split transformed with transformer_eval

Type: numpy.ndarray

__init__(X, y, output_directory, feature_descriptions_dict=None, already_transformed_columns=None, classification_pos_label=None, force_classification_pos_label_multiclass=False, random_state=None)

Create Dashboard object.

Provided X and y are checked and converted to pandas object for easier analysis and eventually split and transformed. classification_pos_label is checked if the label is present in y target variable. X is assessed for the number of features and appropriate flags are set. All necessary objects are created. Transformer and transformer_eval are fit to all and train data, appropriately.

X

data to be analyzed

Type: pandas.DataFrame, numpy.ndarray, scipy.csr_matrix

y

target variable

Type: pandas.Series, numpy.ndarray

output_directory

directory where HTML Dashboard will be placed

Type: str

feature_descriptions_dict

dictionary of metadata on features in X and y, defaults to None

Type: dict, None

already_transformed_columns

list of feature names that are already pre-transformed

Type: list

classification_pos_label

value in target that will be used as positive label

Type: Any

force_classification_pos_label_multiclass

flag indicating if provided classification_pos_label in multiclass problem should be forced, de facto changing the problem to classification

Type: bool

random_state

integer for reproducibility on fitting and transformations, defaults to None

Type: int, None

create_dashboard(models=None, scoring=None, mode='quick', logging=True, disable_pairplots=False, force_pairplot=False)¶

Create several Views (Subpages) and join them together to form an interactive WebPage/Dashboard.

Models can be:

list of initialized models
dict of ‘Model Class’: param_grid of a given model to do the GridSearch on
None - default Models collection will be used

scoring should be a sklearn scoring function. If None is provided, default scoring function will be used.

mode can be:

“quick”: search is initially done on all models but with no parameter tuning after which top
Models are chosen and GridSearched with their param_grids
“detailed”: GridSearch is done on all default models and their params

Provided mode doesn’t matter when models are explicitly provided (not None).

Note

Some functions might not work as of now: e.g. roc_auc_score for multiclass problem as it requires probabilities for every class in comparison to regular predictions expected from other scoring functions.

Depending on logging flag, .csv logs might be created or not in the output directory.

force_pairplot flag forces the dashboard to create Pairplot and ScatterPlot Grid when it was assessed in the beginning not to plot it (as number of features in the data exceeded the limit).

disable_pairplot flag disables creation of Pairplot and ScatterPlot Grid in the Dashboard - it takes precedence over force_pairplot flag.

HTML output is created in output_directory attribute file path and opened in a web browser window.

Parameters

models (list, dict, optional) – list of Models or ‘Model class’: param_grid dict pairs, defaults to None
scoring (func, optional) – sklearn scoring function, defaults to None
mode ("quick", "detailed", optional) – either “quick” or “detailed” string, defaults to “quick”
logging (bool, optional) – flag indicating if .csv logs should be created, defaults to True
disable_pairplots (bool, optional) – flag indicating if Pairplot and ScatterPlot Grid in the Dashboard should be created or not, defaults to False
force_pairplot (bool, optional) – flag indicating if PairPlot and ScatterPlot Grid in the Dashboard should be created when number of features in the data crossed the internal limit, defaults to False

search_and_fit(models=None, scoring=None, mode='quick')¶

Search for the best scoring Model, fit it with all data and return it.

Models can be: - list of initialized models - dict of ‘Model Class’: param_grid of a given model to do the GridSearch on - None - default Models collection will be used

scoring should be a sklearn scoring function. If None is provided, default scoring function will be used.

mode can be:

“quick”: search is initially done on all models but with no parameter tuning after which top
Models are chosen and GridSearched with their param_grids
“detailed”: GridSearch is done on all default models and their params

Provided mode doesn’t matter when models are explicitly provided (not None).

Note

Some functions might not work as of now: e.g. roc_auc_score for multiclass problem as it requires probabilities for every class in comparison to regular predictions expected from other scoring functions.

Parameters

models (list, dict, optional) – list of Models or ‘Model class’: param_grid dict pairs, defaults to None
scoring (func, optional) – sklearn scoring function, defaults to None
mode ("quick", "detailed", optional) – either “quick” or “detailed” string, defaults to “quick”

Returns

best scoring Model already fit to X and y data

Return type

sklearn.Model

set_and_fit(model)¶

Set provided Model as a best scoring Model and fit it to all X and y data.

Parameters: model (sklearn.Model) – instance of ML Model

transform(X)¶

Transform provided X data with Transformer.

Returns: transformed X
Return type: numpy.ndarray, scipy.csr_matrix

predict(transformed_X)¶

Predict target from provided X with the best scoring Model.

Parameters: transformed_X (pandas.DataFrame, numpy.ndarray, scipy.csr_matrix) – transformed X feature space to predict target variable from
Returns: predicted y target variable
Return type: numpy.ndarray

best_model()¶

Return best (chosen) Model used in predictions.

Returns: best scoring Model
Return type: sklearn.Model

set_custom_transformers(categorical_transformers=None, numerical_transformers=None, y_transformer=None)¶

Set custom Transformers to be used in the problem pipeline.

Provided arguments should be a list of Transformers to be used with given type of features. Only one type of transformers can be provided.

Transformers are updated in both transformer and transformer_eval instances. ModelFinder and Output instances are recreated with the new transformed data and new Transformers.

Parameters

categorical_transformers (list) – list of Transformers to be used on categorical features
numerical_transformers (list) – list of Transformers to be used on numerical features
y_transformer (sklearn.Transformer) – singular Transformer to be used on target variable