Using the command line scripts

This package comes with several command line scripts for use with the csv data dumps produced by Panoptes.

Download your data from the project builder

You will need two to three files from your project for offline use:

  • The classification dump: The Request new classification export or Request new workflow classification export button from the lab’s Data Export tab

  • The workflow dump: The Request new workflow export button from the lab’s Data Export tab

Example: Penguin Watch

Penguin Watch has several workflows, for this example we will look at workflow number 6465 (time lapse cameras) and version 52.76. The downloaded files for this project are:

  • penguin-watch-workflows.csv: the workflow file (contains the major version number as a column)

  • penguin-watch-classifications-trim.csv: the classification file for workflow 6465

This zip folder contains these files.


Scripts

All scripts are packaged under a single name panoptes_aggregation. Under this command there are three sub-commands config, extract, and reduce.

usage: panoptes_aggregation [-h] {config,extract,reduce} ...

Aggregate panoptes data files

positional arguments:
  {config,extract,reduce}
    config              Make configuration files for panoptes data extraction
                        and reduction based on a workflow export
    extract             Extract data from panoptes classifications based on
                        the workflow
    reduce              reduce data from panoptes classifications based on the
                        extracted data

optional arguments:
  -h, --help            show this help message and exit

Configure the extractors and reducers

Use the command line tool to make configuration yaml files that are used to set up the extractors and reducers. These base files will use the default settings for various task types. They can be adjusted if the defaults are not needed (e.g. you don’t need to extract all the tasks, or you need more control over the reducer’s settings).

usage: panoptes_aggregation config [-h] [-d DIR] [-v VERSION]
                                   [--min_version MIN_VERSION]
                                   [--max_version MAX_VERSION] [-k KEYWORDS]
                                   [-vv]
                                   workflow_csv workflow_id

Make configuration files for panoptes data extraction and reduction based on a
workflow export

optional arguments:
  -h, --help            show this help message and exit

Load Workflow Files:
  This file can be exported from a project\'s Data Export tab

  workflow_csv          The csv file containing the workflow data

Workflow ID and version numbers:
  Enter the workflow ID, major version number, and minor version number

  workflow_id           the workflow ID you would like to extract
  -v VERSION, --version VERSION
                        The workflow version to extract. If only a major
                        version is given (e.g. -v 3) all minor versions will
                        be extracted at once. If a minor version is provided
                        (e.g. -v 3.14) only that specific version will be
                        extracted.
  --min_version MIN_VERSION
                        The minimum workflow version to extract (inclusive).
                        This can be provided as either a major version (e.g.
                        --min_version 3) or a major version with a minor
                        version (e.g. --min_version 3.14). If this flag is
                        provided the --version flag will be ignored.
  --max_version MAX_VERSION
                        The maximum workflow version to extract (inclusive).
                        This can be provided as either a major version (e.g.
                        --max_version 3) or a major version with a minor
                        version (e.g. --max_version 3.14). If this flag is
                        provided the --version flag will be ignored.

Other keywords:
  Additional keywords to be passed into the configuration files

  -k KEYWORDS, --keywords KEYWORDS
                        keywords to be passed into the configuration of a task
                        in the form of a json string, e.g. '{"T0":
                        {"dot_freq": "line"} }' (note: double quotes must be
                        used inside the brackets)

Save Config Files:
  The directory to save the configuration files to

  -d DIR, --dir DIR     The directory to save the configuration files to

Other options:
  -vv, --verbose        increase output verbosity

Example: Penguin Watch

panoptes_aggregation config penguin-watch-workflows.csv 6465 -v 52.76

This creates four files:

  • Extractor_config_workflow_6465_V52.76.yaml: The configuration for the extractor code

  • Reducer_config_workflow_6465_V52.76_point_extractor_by_frame.yaml: The configuration for the reducer used for the point task

  • Reducer_config_workflow_6465_V52.76_question_extractor.yaml: The configuration for the reducer used for the question task

  • Task_labels_workflow_6465_V52.76.yaml: A lookup table to translate the column names used in the extractor/reducer output files into the text originally used on the workflow.

Note

If you have a recursive workflow you will need to edit the configuration yaml file by hand. For example, if you have a workflow with three question tasks T0, T1, and T2 with an option in T2 leading back to T0 the extractor_config section of the yaml file would be:

extractor_config:
    question_extractor:
    -   task: T0
        recursive: true
    -   task: T1
        recursive: true
    -   task: T2
        recursive: true

In this setup every instance of T0 will be extracted with the same subject ID and will be reduced together (-F all will need to be set in the reducer for this to work).

Instead, if each recursion through the workflow should be treated as a different subject (e.g. each loop through the questions is for a different row in a data table), the config should be set up as:

extractor_config:
    question_extractor:
    -   task: T0
        recursive: true
        recursive_subject_ids: true
    -   task: T1
        recursive: true
        recursive_subject_ids: true
    -   task: T2
        recursive: true
        recursive_subject_ids: true

This will adjust the subject ID with a suffix identifying what “loop” it was classified in. In effect, this will reduce together each pass through the workflow separately. It also makes the underlying assumption that every volunteer classified the subject in the same order (e.g. for the data table example, each volunteer classified the rows in the same order without skipping any).

These can values be set with the panoptes_aggregation config command using the --keywords flag (e.g. --keywords '{"T0": {"recursive": true, "recursive_subject_ids": true}}').


Extracting data

Note: this only works for some task types, see the documentation for a full list of supported task types.

Use the command line tool to extract your data into one flat csv file for each task type:

usage: panoptes_aggregation extract [-h] [-d DIR] [-o OUTPUT] [-O]
                                    [-c CPU_COUNT] [-vv]
                                    classification_csv extractor_config

Extract data from panoptes classifications based on the workflow

optional arguments:
  -h, --help            show this help message and exit

Load classification and configuration files:
  classification_csv    The classification csv file containing the panoptes
                        data dump
  extractor_config      The extractor configuration configuration file

What directory and base name should be used for the extractions:
  -d DIR, --dir DIR     The directory to save the extraction file(s) to
  -o OUTPUT, --output OUTPUT
                        The base name for output csv file to store the
                        extractions (one file will be created for each
                        extractor used)

Other options:
  -O, --order           Arrange the data columns in alphabetical order before
                        saving
  -c CPU_COUNT, --cpu_count CPU_COUNT
                        How many cpu cores to use during extraction
  -vv, --verbose        increase output verbosity

Example: Penguin Watch

Before starting let’s take a closer look at the extractor configuration file Extractor_config_workflow_6465_V52.76.yaml:

extractor_config:
    point_extractor_by_frame:
    -   details:
            T0_tool3:
            - question_extractor
        task: T0
        tools:
        - 0
        - 1
        - 2
        - 3
    question_extractor:
    -   task: T6
    -   task: T1
workflow_id: 6465
workflow_version: '52.76'

This shows the basic setup for what extractor will be used for each task. From this configuration we can see that the point extractor will be used for each of the tools in task T0, tool3 of that task will have the question extractor run on its sub-task, and a question extractor will be used for tasks T1 and T6. If any of these extractions are not desired they can be deleted from this file before running the extractor. In this case task T4 was on the original workflow but was never used on the final project, I have already removed it from the configuration above.

Note: If a workflow contains any task types that don’t have extractors or reducers they will not show up in this config file.

panoptes_aggregation extract penguin-watch-classifications-trim.csv Extractor_config_workflow_6465_V52.76.yaml -o example

This creates two csv files (one for each extractor listed in the config file):

  • question_extractor_example.csv

  • point_extractor_by_frame_example.csv


Reducing data

Note: this only works for some task types, see the documentation for a full list of supported task types.

usage: panoptes_aggregation reduce [-h] [-F {first,last,all}] [-O]
                                   [-c CPU_COUNT] [-d DIR] [-o OUTPUT] [-s]
                                   extracted_csv reducer_config

reduce data from panoptes classifications based on the extracted data

optional arguments:
  -h, --help            show this help message and exit

Load extraction and configuration files:
  extracted_csv         The extracted csv file
  reducer_config        The reducer configuration file

What directory and base name should be used for the reductions:
  -d DIR, --dir DIR     The directory to save the reduction file to
  -o OUTPUT, --output OUTPUT
                        The base name for output csv file to store the
                        reductions
  -s, --stream          Stream output to csv after each reduction (this is
                        slower but is resumable)

Reducer options:
  -F {first,last,all}, --filter {first,last,all}
                        How to filter a user making multiple classifications
                        for one subject
  -O, --order           Arrange the data columns in alphabetical order before
                        saving
  -c CPU_COUNT, --cpu_count CPU_COUNT
                        How many cpu cores to use during reduction

Example: Penguin Watch

For this example we will do the point clustering for the task T0. Let’s take a look at the default config file for that reducer Reducer_config_workflow_6465_V52.76_point_extractor_by_frame.yaml:

reducer_config:
    point_reducer_dbscan:
        details:
            T0_tool3:
            - question_reducer

As we can see, the default reducer is point_reducer_dbscan and the only keyword specified is the only associated with the sub-task of tool3. To get better results we will add some clustering keywords to the configuration of DBSCAN:

reducer_config:
    point_reducer_dbscan:
        eps: 5
        min_samples: 3
        details:
            T0_tool3:
            - question_reducer

But for this project there is a large amount of depth-of-field in the images, leading to a non-constant density of point clusters across the images (more dense in the background of the image and less dense in the foreground). This means that HDBSCAN will work better:

reducer_config:
    point_reducer_hdbscan:
        min_cluster_size: 4
        min_samples: 3
        details:
            T0_tool3:
            - question_reducer

Now that it is set up we can run:

panoptes_aggregation reduce point_extractor_by_frame_example.csv Reducer_config_workflow_6465_V52.76_point_extractor_by_frame.yaml -o example

This will create one file:

  • point_reducer_hdbscan_example.csv: The clustered data points for task T0


Reading csv files in python

The resulting csv files typically contain arrays as values. These arrays are read in as strings by most csv readers. To make it easier to read these files in a “science ready” way a utility function for pandas.read_csv is provided in panoptes_aggregation.csv_utils:

import pandas
from panoptes_aggregation.csv_utils import unjson_dataframe

# the `data.*` columns are read in as strings instead of arrays
data = pandas.read_csv('point_reducer_hdbscan_example.csv')

# use unjson_dataframe to convert them to lists
# all values are updated in place leaving null values untouched
unjson_dataframe(data)