Dashboard¶
Creating Dashboard instance¶
Features Descriptions Dictionary¶
Dashboard
instance can be created with just X
, y
and output_directory
arguments, but it doesn’t mean
it cannot be customized. The most notable optional argument that can augment the HTML Dashboard is
feature_descriptions_dict
dictionary-like object. Structure of the dict should be:
feature_descriptions_dict = {
"Feature1": {
"description": "description of feature1, e.g. height in cm",
"category": "cat" OR "num",
"mapping": {
"value1": "better name or explanation for value1",
"value2": "better name or explanation for value2"
}
},
"Feature2": {
(...)
}
}
"description"
describes what the feature is about - what’s the story behind numbers and values."category"
defines what kind of a data you want the feature to be treated as:"num"
stands for Numerical values;"cat"
is for Categorical variables;
"mapping"
represents what do every value in your data mean - e.g. 0 - “didn’t buy a product”, 1 - “bought a product”.
This external information is fed to the Dashboard
and it will be included in HTML files (where appropriate)
for your convenience. Last but not least, by providing a specific "category"
you are also forcing the Dashboard
to
interpret a given feature the way you want, e.g.: you could provide "num"
to a binary variable (only 0 and 1s) and
Dashboard
will treat that feature as Numerical (which means that, for example, Normal Transformations will be applied).
Note
Please keep in mind that every argument in feature_descriptions_dict
is optional: you can provide all Features
or only one, only "category"
for few of the Features and "description"
for others, etc.
Other Dashboard instance arguments¶
already_transformed_columns
can be a list of features that are already transformed and won’t need additional
transformations from the Dashboard
:
Dashboard(X, y, output_directory,
already_transformed_columns=["Feature1", "pre-transformed Feature4"]
)
classification_pos_label
forces the Dashboard
to treat provided label as a positive (1) label (classification
problem type):
Dashboard(X, y, output_directory,
classification_pos_label="Not Survived"
)
force_classification_pos_label_multiclass
is a bool
flag useful when you also provide
classification_pos_label
in a multiclass problem type (essentially turning it into classification problem)
- without it, classification_pos_label
will be ignored:
Dashboard(X, y, output_directory,
classification_pos_label="Iris-Setosa",
force_classification_pos_label_multiclass=True
)
random_state
can be provided for results reproducibility:
Dashboard(X, y, output_directory,
random_state=13,
)
Example¶
dsh = Dashboard(
X=your_X,
y=your_y,
output_directory="path/output",
features_descriptions_dict={"petal width (cm)": {"description": "width of petal (in cm)"}},
already_transformed_columns=["sepal length (cm)"],
classification_pos_label=1,
force_classification_pos_label=True,
random_state=10
)
Creating HTML Dashboard¶
To create HTML Dashboard from Dashboard
instance, you need to call create_dashboard
method:
dsh.create_dashboard()
You can customize the process further by providing appropriate arguments to the method (see below).
Models¶
models
stands for collection of sklearn Models that will be fit on provided data. They can be provided in different ways:
list
of Models instances;dict
of Model class: param_grid attributes to do GridSearch on;None
- in which case default Models will be used.
# list of Models
models = [DecisionTreeClassifier(), SVC(C=100.0), LogisticRegression()]
# dict for GridSearch
models = {
DecisionTreeClassifier: {"max_depth": [1, 5, 10], "criterion": ["gini", "entropy"]},
SVC: {"C": [10, 100, 1000]}
}
# None
models = None
Scoring¶
scoring
should be a sklearn scoring function appropriate for a given problem type (e.g. roc_auc_score
for
classification). It can also be None
, in which case default scoring for a given problem will be used:
scoring = precision_score
Note
Some functions might not work for some type of problems (e.g. roc_auc_score
for multiclass)
Mode¶
mode
should be provided as either "quick"
or "detailed"
string literal. Argument is useful only
when models=None
.
if
"quick"
, then the initial search is done only on default instances of Models (for exampleSVC()
,LogisticRegression()
, etc.) as Models are simply scored with scoring function. Top scoring Models are then GridSearched;if
"detailed"
, then all available combinations of default Models are GridSearched.
Logging¶
logging
is a bool
flag indicating if you want to have .csv files (search logs) included in your output
directory in logs subdirectory.
Disabling Pairplots¶
Both seaborn
PairPlot in Overview
subpage and ScatterPlot Grid in Features
subpage were identified to be
the biggest time/resource bottlenecks in creating HTML Dashboard. If you feel like speeding up the process, set
disable_pairplots=True
.
Note
Pairplots are disabled by default when the number of features in your data crosses certain threshold. See also Forcing Pairplots.
Forcing Pairplots¶
When number of features in X and y crosses a certain threshold, creation of both seaborn
PairPlot and
ScatterPlot Grid is disabled. This was a conscious decision, as not only it extremely slows down the process (and
might even lead to raising Exceptions or running out of memory), PairPlots are getting so enormous that the insight
gained from them is minuscule.
If you know what you’re doing, set force_pairplot=True
.
Note
If disable_pairplots=True
and force_pairplot=True
are both provided, disable_pairplots
takes precedence and pairplots will be disabled.
Example¶
dsh.create_dashboard(
models=None,
scoring=sklearn.metrics.precision_score,
mode="detailed",
logging=True,
disable_pairplots=False,
force_pairplots=True
)
Setting Custom Preprocessors in Dashboard¶
set_custom_transformers
is a method to provide your own Transformers to Dashboard
pipeline. Dashboard
preprocessing is simple, so you are free to change it to your liking. There are 3 arguments (all optional):
categorical_transformers
numerical_transformers
y_transformer
Both categorical_transformers
and numerical_transformers
should be list-like objects of instantiated
Transformers. As names suggest, categorical_transformers
will be used to transform Categorical features, whereas
numerical_transformers
will transform Numerical features.
y_transformer
should be a single Transformer.
dsh.set_custom_transformers(
categorical_transformers=[SimpleImputer(strategy="most_frequent")],
numerical_transformers=[StandardScaler()],
y_transformer=LabelEncoder()
)
Note
Keep in mind that in regression problems, Dashboard
already wraps the target in TransformedTargetRegressor
object (with QuantileTransformer
as a transformer). See also sklearn documentation.
Using Dashboard as sklearn pipeline¶
Dashboard
can also be used as a simpler version of sklearn.pipeline
- methods such as transform
, predict
,
etc. are exposed and available. Please refer to Documentation for more information.
Documentation¶
- class data_dashboard.dashboard.Dashboard(X, y, output_directory, feature_descriptions_dict=None, already_transformed_columns=None, classification_pos_label=None, force_classification_pos_label_multiclass=False, random_state=None)¶
Data Dashboard with Visualizations of original data, transformations and Machine Learning Models performance.
Dashboard analyzes provided data (summary statistics, correlations), transforms it and feeds it to Machine Learning algorithms to search for the best scoring Model. All steps are written down into the HTML output for end-user experience.
HTML output created is a set of ‘static’ HTML pages saved into provided output directory - there are no server - client interactions. Visualization are still interactive though through the use of Bokeh library.
Note
As files are static, data might be embedded in HTML files. Please be aware when sharing produced HTML Dashboard.
Dashboard object can also be used as a pipeline for transforming/fitting/predicting by using exposed methods loosely following sklearn API.
- output_directory¶
directory where HTML Dashboard will be placed
- Type
str
- already_transformed_columns¶
list of feature names that are already pre-transformed
- Type
list
- random_state¶
integer for reproducibility on fitting and transformations, defaults to None if not provided during __init__
- Type
int, None
- features_descriptions¶
FeatureDescriptor containing external information on features
- Type
FeatureDescriptor
- features¶
Features object with basic features information
- Type
Features
- analyzer¶
Analyzer object analyzing and performing calculations on features
- Type
Analyzer
- transformer¶
Transformer object responsible for transforming the data for ML algorithms, fit to all data
- Type
Transformer
- transformer_eval¶
Transformer object fit on train data only
- Type
Transformer
- model_finder¶
ModelFinder object responsible for searching for Models and assessing their performance
- Type
ModelFinder
- X¶
data to be analyzed
- Type
pandas.DataFrame, numpy.ndarray, scipy.csr_matrix
- y¶
target variable
- Type
pandas.Series, numpy.ndarray
- X_train¶
train split of X
- Type
pandas.DataFrame
- X_test¶
test split of X
- Type
pandas.DataFrame
- y_train¶
train split of y
- Type
pandas.Series
- y_test¶
test split of y
- Type
pandas.Series
- transformed_X¶
all X data transformed with transformer
- Type
numpy.ndarray, scipy.csr_matrix
- transformed_y¶
all y data transformed with transformer
- Type
numpy.ndarray
- transformed_X_train¶
X train split transformed with transformer_eval
- Type
numpy.ndarray, scipy.csr_matrix
- transformed_X_test¶
X test split transformed with transformer_eval
- Type
numpy.ndarray, scipy.csr_matrix
- transformed_y_train¶
y train split transformed with transformer_eval
- Type
numpy.ndarray
- transformed_y_test¶
y test split transformed with transformer_eval
- Type
numpy.ndarray
- __init__(X, y, output_directory, feature_descriptions_dict=None, already_transformed_columns=None, classification_pos_label=None, force_classification_pos_label_multiclass=False, random_state=None)
Create Dashboard object.
Provided X and y are checked and converted to pandas object for easier analysis and eventually split and transformed. classification_pos_label is checked if the label is present in y target variable. X is assessed for the number of features and appropriate flags are set. All necessary objects are created. Transformer and transformer_eval are fit to all and train data, appropriately.
- X
data to be analyzed
- Type
pandas.DataFrame, numpy.ndarray, scipy.csr_matrix
- y
target variable
- Type
pandas.Series, numpy.ndarray
- output_directory
directory where HTML Dashboard will be placed
- Type
str
- feature_descriptions_dict
dictionary of metadata on features in X and y, defaults to None
- Type
dict, None
- already_transformed_columns
list of feature names that are already pre-transformed
- Type
list
- classification_pos_label
value in target that will be used as positive label
- Type
Any
- force_classification_pos_label_multiclass
flag indicating if provided classification_pos_label in multiclass problem should be forced, de facto changing the problem to classification
- Type
bool
- random_state
integer for reproducibility on fitting and transformations, defaults to None
- Type
int, None
- create_dashboard(models=None, scoring=None, mode='quick', logging=True, disable_pairplots=False, force_pairplot=False)¶
Create several Views (Subpages) and join them together to form an interactive WebPage/Dashboard.
- Models can be:
list of initialized models
dict of ‘Model Class’: param_grid of a given model to do the GridSearch on
None - default Models collection will be used
scoring should be a sklearn scoring function. If None is provided, default scoring function will be used.
- mode can be:
- “quick”: search is initially done on all models but with no parameter tuning after which top
Models are chosen and GridSearched with their param_grids
“detailed”: GridSearch is done on all default models and their params
Provided mode doesn’t matter when models are explicitly provided (not None).
Note
Some functions might not work as of now: e.g. roc_auc_score for multiclass problem as it requires probabilities for every class in comparison to regular predictions expected from other scoring functions.
Depending on logging flag, .csv logs might be created or not in the output directory.
force_pairplot flag forces the dashboard to create Pairplot and ScatterPlot Grid when it was assessed in the beginning not to plot it (as number of features in the data exceeded the limit).
disable_pairplot flag disables creation of Pairplot and ScatterPlot Grid in the Dashboard - it takes precedence over force_pairplot flag.
HTML output is created in output_directory attribute file path and opened in a web browser window.
- Parameters
models (list, dict, optional) – list of Models or ‘Model class’: param_grid dict pairs, defaults to None
scoring (func, optional) – sklearn scoring function, defaults to None
mode ("quick", "detailed", optional) – either “quick” or “detailed” string, defaults to “quick”
logging (bool, optional) – flag indicating if .csv logs should be created, defaults to True
disable_pairplots (bool, optional) – flag indicating if Pairplot and ScatterPlot Grid in the Dashboard should be created or not, defaults to False
force_pairplot (bool, optional) – flag indicating if PairPlot and ScatterPlot Grid in the Dashboard should be created when number of features in the data crossed the internal limit, defaults to False
- search_and_fit(models=None, scoring=None, mode='quick')¶
Search for the best scoring Model, fit it with all data and return it.
Models can be: - list of initialized models - dict of ‘Model Class’: param_grid of a given model to do the GridSearch on - None - default Models collection will be used
scoring should be a sklearn scoring function. If None is provided, default scoring function will be used.
- mode can be:
- “quick”: search is initially done on all models but with no parameter tuning after which top
Models are chosen and GridSearched with their param_grids
“detailed”: GridSearch is done on all default models and their params
Provided mode doesn’t matter when models are explicitly provided (not None).
Note
Some functions might not work as of now: e.g. roc_auc_score for multiclass problem as it requires probabilities for every class in comparison to regular predictions expected from other scoring functions.
- Parameters
models (list, dict, optional) – list of Models or ‘Model class’: param_grid dict pairs, defaults to None
scoring (func, optional) – sklearn scoring function, defaults to None
mode ("quick", "detailed", optional) – either “quick” or “detailed” string, defaults to “quick”
- Returns
best scoring Model already fit to X and y data
- Return type
sklearn.Model
- set_and_fit(model)¶
Set provided Model as a best scoring Model and fit it to all X and y data.
- Parameters
model (sklearn.Model) – instance of ML Model
- transform(X)¶
Transform provided X data with Transformer.
- Returns
transformed X
- Return type
numpy.ndarray, scipy.csr_matrix
- predict(transformed_X)¶
Predict target from provided X with the best scoring Model.
- Parameters
transformed_X (pandas.DataFrame, numpy.ndarray, scipy.csr_matrix) – transformed X feature space to predict target variable from
- Returns
predicted y target variable
- Return type
numpy.ndarray
- best_model()¶
Return best (chosen) Model used in predictions.
- Returns
best scoring Model
- Return type
sklearn.Model
- set_custom_transformers(categorical_transformers=None, numerical_transformers=None, y_transformer=None)¶
Set custom Transformers to be used in the problem pipeline.
Provided arguments should be a list of Transformers to be used with given type of features. Only one type of transformers can be provided.
Transformers are updated in both transformer and transformer_eval instances. ModelFinder and Output instances are recreated with the new transformed data and new Transformers.
- Parameters
categorical_transformers (list) – list of Transformers to be used on categorical features
numerical_transformers (list) – list of Transformers to be used on numerical features
y_transformer (sklearn.Transformer) – singular Transformer to be used on target variable