Training#
Data selector#
When training models, it is common to try out different
subsets of features or subpopulations. DataSelector
allows you to define
a series of transformations on your data so you can succinctly define a
subsetting pipeline as a series of dictionaries.
- class sklearn_evaluation.training.DataSelector(*steps)#
Subset a pandas.DataFrame by passing a series of steps
- Parameters:
*steps – Steps to apply to the data sequentially (order matters). Each step must be a dictionary with a key “kind” whose value must be one of “column_drop”, “row_drop” or “column_keep”. The rest of the key-value pairs must match the signature for the corresponding Step objects
- transform(df, return_summary: bool = False)#
Apply steps
- Parameters:
df – Data frame to transform
return_summary – If False, the function only returns the output data frame, if True, it also returns a summary table
ColumnDrop#
- class sklearn_evaluation.training.selector.ColumnDrop(names: list = None, prefix: str = None, suffix: str = None, contains: str = None, max_na_prop: float = None)#
Drop columns
- Parameters:
names – List of columns to drop
prefix – Drop columns with this prefix (or list of)
suffix – Drop columns with this suffix (or list of)
contains – Drop columns if they contains this substring
max_na_prop – Drop columns whose proportion of NAs [0, 1] is larger than this
RowDrop#
- class sklearn_evaluation.training.selector.RowDrop(if_nas: bool = False, query: str = None)#
Drop rows
- Parameters:
if_nas – If True, deletes all rows where there is at leat one NA
query – Drops all rows matching the query (passed via pandas.query)
ColumnKeep#
- class sklearn_evaluation.training.selector.ColumnKeep(names: Optional[list] = None, dotted_path: Optional[str] = None)#
Subset columns
- Parameters:
names – List of columns to keep