Analysis of Sensitive Data in Secure Processing Environments (SPE)

This tutorial presents the implementation of a SPE in the de.NBI Cloud (ELIXIR-DE) using ELIXIR and open-source services.

The aim of this tutorial is to describe how to deploy and configure a Secure Processing Environment (SPE) for analyzing large volumes of sensitive data generated by biomedical and clinical research. Easy and secure access to such environments accelerates research and enables participation by researchers with limited resources.

Users of an SPE can run workflows on sensitive data, without ever gaining access to the actual data. The data is processed securely and the user can only access the results of the workflows.

Overview

The setup has three central components: - Secure Execution Backend - External Storage (S3) for result deposition. - User Authentication (LS Login)

The execution backend consists of two independent systems. The execution of workflows is managed by WESkit. It provides a REST interface to submit workflow runs and monitor progress. The actual execution of the workflow scripts takes place in a Slurm cluster. All sensitive data is stored and processed within the cluster. This tutorial assumes a Slurm cluster hosted in the de.NBI Cloud.

The results are stored in MinIO/S3-compatible storage that can be accessed by authorized users. Life Science Login is used to authenticate users to a registered service that allows them to request workflow executions and study results in an external storage. Therefore users can read only non-sensitive information resulting from workflow execution. Any sensitive data is not accessible.

Setup

Required infrastructure

1 VM for WESkit deployment
1-2 VMs for a SLURM cluster (depending on the workload more)
1 VM for S3 Storage

Authorization

Data processing is permitted only for authorized users. LS-Login can be used to register a service/client. The provided client credentials can be used for your service to obtain an access token. Potential users need to request authorization to use the service.

Execution

WESkit allows execution of Snakemake and Nextflow workflows by sending a request to the compute infrastructure (Cloud/Cluster). Find details in the WESkit docs.

A Slurm cluster can be deployed with little effort using BiBiGrid, a framework for creating and managing cloud clusters. BiBiGrid uses Ansible to configure cloud images and set up an on-demand SLURM cluster. Alternatively use any other Slurm deployment.

Access to the SPE must be restricted due to national restrictions and laws. Collaborators and foreign researchers need to obtain permission from the Identity Provider to use the SPE. A permission allows them to authenticate at the Identity Provider site and request workflow execution via WESkit on the SLURM cluster.

Results

Finally, results are stored in a storage that is mounted into the cluster and an interface that is only accessible via LS-Login. Sensitive data is not managed by WESkit or accessible in the result storage.

Step 1: WESkit

The SPE uses WESkit to execute workflows on the sensitive data. Therefore, WESkit must be installed on a machine that is accessible via the internet and has access to the internet. This machine could be hosted by an institute compute center or by a cloud provider.

The deployment of WESkit involves the following steps:

Install WESkit: Simple deployment using Docker.
Set up compute environment: WESkit must be configured according to the compute environment.
Provide workflows: In this scenario, a data controller has to validate and provide every workflow on the compute environment. Only then they are available for the researchers. WESkit provides instructions for workflow installation. Workflows are Snakemake or Nextflow scripts, along with all dependencies and additional data.
Configure workflow engine: Define workflow engine parameters.
Provide data: The workflows are executed on sensitive data within the compute environment. Therefore, the data should be available in the file system of the compute environment (e.g. Slurm).
Publish web service: We assume that the service will be available online. This requires configuration on the provider side.

Step 2: MinIO

The SPE uses MinIO/S3 to provide researchers access to non-sensitive results data. Depending on the environment, there are several options available on how to deploy MinIO. To configure OpenID please refer to the MinIO OIDC Documentation.

In this scenario we create a bucket "results" in MinIO and allow all authorized users to access MinIO with read-access on the results data.

Note: MinIO as a storage provider has removed its open source license, therefore it might be advisable to switch to a different storage solution. Refer to legacy binary releases for the last open source release.

Results crawler

To make the non-sensitive results available, a crawler continuously checks for new results and copies them to MinIO. This can be implemented as a shell script running as a cron job.

A simple example script is given below:

mc config host add local http://localhost:9000 USERNAME PASSWORD;

BASE_DIR=/minio_data/data

process_directory() {
    local dir="$1"
    local bucketname=$(basename $dir)
    if [[ ! -f "$dir/upload_token" ]]; then
      if [ -f "$dir/plots/quals.svg" ]; then
          mc mb local/results/$bucketname;
          mc cp $dir/results.csv local/results/$bucketname;
      fi
      touch "$dir/upload_token"
    fi
}

for dir in "$BASE_DIR"/*/*/; do
  for logsdir in "$dir".weskit/*/; do
    if [ -d "$logsdir" ]; then
      if [ -f "$logsdir/log.json" ]; then
        process_directory $dir
      fi
    fi
  done
done

This script regularly checks the WESkit results folder. WESkit logs information about a workflow execution in the file log.json once the workflow execution has finished. The script checks if the log.json file exists and, if so, uploads the result file results.csv to the S3 bucket. Uploaded run-directories are tagged with an upload_token file to prevent redundant uploads.

Step 3: User Interface

To offer a user interface for the SPE, the simplest way is to use a customized version of the WESkit GUI. It offers a lightweight web application to allow researchers to run and monitor workflows. The WESkit GUI repository can be used as a blueprint to create a customized website.

Step 4: Authentication and Authorization

Authentication and authorization is implemented using OIDC. This setup uses the LS-Login infrastructure for OIDC integration. The LS-Login documentation contains a guide on how to register a new service.

In this tutorial, we assume a single LS-Login service for all the deployed tools (WESkit, MinIO, WebApp).

LS-Login can be activated in MinIO either by using the MinIO console using the OIDC configuration or by setting environmental variables, as described in the MinIO OIDC Documentation. There are detailed instructions in the ELIXIR-on-Cloud documentation for using MinIO with LS-Login.

WESkit can be configured for OIDC. After enabling OIDC, WESkit requires OAuth2 tokens for each request. Please refer to the WESkit documentation for configuration instructions.