Processing data

You can chain filter-df command after the all-data command or from-file command to apply various finds of processing to the data obtained through your query.

For example, the following invocation will give you the data corresponding to the best runs (using metric best_validation_fixed_f1 as higher the better) in sweeps 5pfpcetn and abcd12345.

$ wandb-utils \
-e username_or_team \
-p project_name \
all-data \
--filters "{\"sweep\":{\"\$in\":[\"5pfpcetn\", \"abcd12345\"]}}" \
filter-df -f sweep_name -f sweep -f run -f path -f test_fixed_f1 -f best_validation_fixed_f1 -i path \
--query "df.sort_values('best_validation_fixed_f1', ascending=False).drop_duplicates(['sweep_name'])" \

You can also do fairly complex things using –pd-eval that uses pandas.eval function. For instance following command performs 4 processing steps

1. –pd-eval “test_CMAP=rmax(df.test_MAP_max_n, df.test_MAP_min_n)” creates a new column named test_MAP by taking the max of two existing columns.

2. The second and third steps create two new columns _dataset and _model by extracting strings from the tags column.

3. The last step –pd-eval “df.groupby([‘_model’, ‘_dataset’], as_index=False).mean()” performs grouby followed by taking the mean of groups.

$ wandb-utils -e USERNAME -p PROJECT \
all-data \
filter-df --pd-eval "test_CMAP=rmax(df.test_MAP_max_n, df.test_MAP_min_n)" \
filter-df --pd-eval "_model=df.tags.str.extract(r'model@([^\|]+)',expand=False)" \
filter-df --pd-eval "_dataset=df.tags.str.extract(r'dataset@([^\|]+)',expand=False)" \
filter-df --pd-eval "df.groupby(['_model', '_dataset'], as_index=False).mean()" \
filter-df -f test_MAP -f test_CMAP -f test_constraint_violation -f _model -f _dataset \

There even more general and powerful ways, –python-exec and –python-eval, to process the dataframe using python’s native exec() and eval() functions, respectively, that allow executing arbitrary python code.