Page Last Updated: May 22, 2026

File-Based Data Processing Workflow🔗

Part 1: Site Capture & BIDS Conversion🔗

Data is collected from sites into LORIS (EEG, Axivity, and GABI) or FIONA (for MRI and MRS). LORIS data is subsequently transferred directly into the central S3 main PR bucket, which subsequently is sourced for CBRAIN processing. MRI and MRS must first be converted to BIDS format and MRI data also undergoes extensive raw data QC (see details).

Your browser does not support SVG

S3 Bucket Key
Key Name in Diagram S3 URL
JCVI s3://midb-hbcd-ucsd-main-pr-dicoms/
MRS BIDS s3://midb-hbcd-main-pr-mrs/
Main PR / BIDS s3://midb-hbcd-main-pr/assembly_bids/
QC Env s3://midb-hbcd-lasso-hdcc-qc-ongoing-dccid/

Part 2: De-Identification, CBRAIN Processing, & Lasso Ingestion🔗

For a detailed breakdown of de-identification, CBRAIN pipeline processing, re-identification, etc., see the UMN De-Identification & Pipeline Processing section below.
All of the processes below are performed by UMN/MSI:

Your browser does not support SVG
S3 Bucket Key
Key Name in Diagram S3 URL
De-ID-List s3://midb-hbcd-main-pr-deidentification-list/
De-ID / BIDS s3://midb-hbcd-main-deid/assembly_bids/
De-ID / Derivatives s3://midb-hbcd-main-deid/derivatives/
De-ID / BrainSwipes s3://midb-hbcd-main-deid/v2/brainswipes/
Main PR / Derivatives s3://midb-hbcd-main-pr/reid_derivatives/
Main PR / BrainSwipes s3://midb-hbcd-main-pr/reid_brainswipes/
Lasso PR s3://midb-hbcd-lasso-hdcc-qc-br/br{BETA RELEASE#}/hbcd/
QC Environment s3://midb-hbcd-lasso-hdcc-qc-ongoing-dccid/
BrainSwipes

The workflow for BrainSwipes is unique compared to other data due to the fact that the quality control (QC) is performed post-CBRAIN processing and therefore must go through additional steps. Some details to note:

  • After transfer of the visual reports used for QC to the Prerelease Derivatives S3 URL (s3://midb-hbcd-prerelease-bids/derivatives/ses-V02/xcp_d/{{SUBJECT}}/figures/{{FILENAME}}.png), a query is run to identify outputs that are out of date and either remove or archive records related to out-of-date files
  • TBD: Participant sessions that fail structural QC (based on XCP-D derivative visual reports) are flagged to perform manual corrections on the corresponding BIBSNet brain segmentations. The corrected segmentations will not be fed back into the main processing workflows, but are instead integrated into the training set for future BIBSNet models.

Responsibility Assignment Matrices By Modality🔗

Imaging
Study Stage Step Location Responsible Accountable CConsulted/IInformed
Data Collection Participant Source Data Acquisition: DCMs, eCRF population (MRI Acquisition Form) FIONALORIS Site Staff (Varies by site) Site Staff (Varies by site) -/-
Data QC + Action QC at Source Data Acquisition: eCRF populated properly, DCM header checks, naming convention checks FIONA Site Staff (Varies by site) Site Staff (Varies by site) -/-
Data QC + Action QC at Source: Check acquisition/protocol adherence JCVI Josh Kuperman Anders Dale -/-
Data QC + Action Transfer acquisition/protocol adherence (both DCM acquisition and MRI Acquisition Form) to DCC JCVI Ron Yang, Don Hagler Don Hagler -/-
Data Collection Convert DICOMs to BIDS/NIfTI UMN MSI Cecile Madjar Cecile Madjar -/-
Data Collection Convert MRS data to BIDS UMN HSTUMN MSI Reed McEwan, Cecile Madjar Reed McEwan [C] Helge Zoellner
[C] Erik Lee
[C] Georg Oeltzschner
Data QC + Action QC of the DCM ot BIDs Conversion: Correct BIDS errors UMN MSI Cecile Madjar Cecile Madjar [C] Erik Lee
[C] Lucille Moore
[C] Tim Hendrickson
Data QC + Action Check for protocol deviations (based on BIDS) JCVI MRI QC Workgroup Don Hagler -/-
Data Ingestion Ingestion and catalogue DICOMs in Lasso       -/-
Data QC + Action QC of ingestion JCVIHST     -/-
Data QC + Action Initial QC raw data (e.g. manual, automated) JCVI MRI QC Workgroup Don Hagler -/-
EEG
Study Stage Step Location Responsible Accountable CConsulted/IInformed
Monthly Net Inventory/Equipment QC Site Site Staff (Varies Per Site), EEG Core, HDCC, LORIS Site Staff (Varies Per Site) -/-
EEG Acquisition Site Site Staff (Varies Per Site), EEG Core, HDCC, LORIS Site Staff (Varies Per Site) EEG Core
Populate EEG Acquisition Form Site Site Staff (Varies Per Site), EEG Core, HDCC, LORIS Site Staff (Varies Per Site) -/-
QC Acquisition form population Site Site Staff (Varies Per Site), EEG Core, HDCC, LORIS Site Staff (Varies Per Site) EEG Core
De-Identification & Flags (pre-LORIS) Site Site Staff (Varies Per Site), EEG Core, HDCC, LORIS Site Staff (Varies Per Site) -/-
BIDs Wizard Population and Execution LORIS Site Staff (Varies Per Site), EEG Core, HDCC, LORIS Site Staff (Varies Per Site) [C] Laetitia Fesselier
Convert EEG data to BIDS UMN MSI Laetitia Fesselier Laetitia Fesselier EEG Core
Run MADE Pipeline UMN MSI Erik Lee Erik Lee EEG Core
QC pre-processed EEG UMD EEG Core Kira Ashton, Dylan Gilbreath, Trisha Maheswari, Elise Harris Santiago Morales EEG Core
Biospecimens
Study Stage Step Location Responsible Accountable CConsulted/IInformed
Acquisition of sample Site Site Staff Site Staff -/-
Population of meta data form LORIS Site Staff Site Coordinator -/-
QC of form population OHSUOregon Health and Science University WG Co-Chairs Elinor Sullivan (Co-Chair) -/-
Shipment of sample Site Site Staff Site Staff -/-
QC: Ensuring sample was received Sampled/USDTL Charles Hevi (Sampled)
Priti Soni (USDTL)
Charles Hevi (Sampled)
Priti Soni (USDTL)
-/-
QC: deviation code of sample Sampled/USDTL Charles Hevi (Sampled)
Priti Soni (USDTL)
Charles Hevi (Sampled)
Priti Soni (USDTL)
-/-
Analysis of sample Sampled/USDTL Charles Hevi (Sampled)
Priti Soni (USDTL)
Charles Hevi (Sampled)
Priti Soni (USDTL)
-/-
Wearable Sensors
Study Stage Step Location Responsible Accountable CConsulted/IInformed
Data Collection Site
Convert Axivity data to BIDS UMN MSI Cecile Madjar Cecile Madjar [C] Jinseok Oh
[C] Beth Smith
[C] Erik Lee
Run Axtivity Pipeline UMN MSI Erik Lee Erik Lee [C] Jinseok Oh
[C] Beth Smith
QC preprocessed data

UMN De-Identification & Pipeline Processing🔗

This documentation outlines how UMN processes imaging data after it has been curated by LORIS into BIDS format.
The workflow consists of eight interdependent components that handle de-identification, pipeline processing, synchronization, and cleanup of imaging data.

Workflow Summary🔗

# Workflow Core Function Frequency
1 Release Candidate ID Creation Uploads updated release ID mappings for new subjects Daily
2 Raw BIDS De-Identification Removes identifiers and uploads anonymized data to de-ID bucket Daily
3 CBRAIN Subject Registration Registers de-identified subjects in CBRAIN for processing Daily
4 Post-Processing of De-ID Data Runs pipelines (e.g., Nibabies, QSIPrep) on de-identified BIDS Daily
5 CBRAIN Log Preservation Archives failed task logs for permanent tracking Daily
6 Raw BIDS Sync Cleanup Removes outdated data when LORIS and de-ID buckets diverge Daily
7 Re-ID for LORIS Replaces Release Candidate IDs with DCCIDs for LORIS ingestion Daily
8 Derivative Sync Cleanup Removes outdated re-ID derivatives from LORIS bucket Daily

Primary Goals🔗

  • Ensure only anonymized data (using Release Candidate IDs) is released publicly
  • Prevent overlap of Release Candidate IDs and DCCIDs/PSCIDs within the same dataset
  • The raw BIDS data curated by LORIS will be periodically updated. These updates also include updates to BIDS metadata and QC (which may impact processing pipelines).
  • The derived processing outputs released to the public must be from the same processing stream that internal HBCD investigators use for QC purposes.
  • Provide LORIS with access to derived outputs for facilitating internal QC with Workgroups and for tabulation of derivatives
  • Limit unnecessary reprocessing while still maintaining integrity between inputs/outputs. For example, if ses-V03 becomes available for a given subject, this should not initiate re-processing of ses-V02 data. However if new files or updated QC becomes available for ses-V02 then ses-V02 reprocessing should occur.

General Limitations🔗

Incoming session data (MRI including initial scans and rescans, EEG, Axivity, GABI, and manual QC ratings) often arrive over several weeks. Automated QC and processing routines may be delayed until all expected elements for a session are available.

Individual Workflows🔗

Creation of Release Candidate IDs for Anonymization

Goal: Maintain an up-to-date mapping of identifiers for de-identification workflows.
Contacts: Reed McEwan, Dan Duhon
Frequency: Daily (<1 hour)
Inputs: N/A
Outputs: s3://midb-hbcd-main-pr-deidentification-list/release_identifiers.csv
Notes: Phantom data may not yet be included.

Raw BIDS De-identification

Goal: De-identify and upload raw BIDS sessions


Criteria for De-identification:


  • Subject is listed in the release ID mapping
  • No existing session files in the de-ID bucket
  • Session files are available in the LORIS bucket and are ≥1 day old

De-identification Procedure Overview:

  • De-identify and upload all supported session files to the de-ID bucket
  • Update session metadata (sessions.<tsv|json>) in de-ID bucket
  • Tag each file with its loris-versionid (corresponds to VersionId in original LORIS files) for traceability


REMOVED/RETAINED IDENTIFIERS:
Removed: PSCIDs, DCCIDs, and Site IDs & manually populated fields (prone to typos) that commonly contain these identifiers
Retained: Jittered patient age at acquisition, Acquisition dates/times, and & Acquisition device serial numbers
FILE COVERAGE:
BIDS metadata (scans/session tsv & JSONs)
  • Remove PatientName and PatientBirthDate from JSONs
  • Replace site info with anonymized site IDs via mapping file
  • Remove InstitutionAddress, InstitutionalDepartmentName, and InstitutionName
EEG
  • sourcedata (eventlogs.txt): Anonymize entries for DataFile.Basename, DCCID, and Subject columns
  • .set files: Replace (1) All DCCIDs/PSCIDs with Release Candidate IDs & (2) Unapproved manual entries with “Anonymized”
MRS NIfTI files:
  • Remove InstitutionName, InstitutionAddress, PatientSex, and PatientWeight using spec2nii

Contacts: Sriharshitha Anuganti, Erik Lee
Frequency: Daily (PM CST; ensure completion within 24 hours)
Inputs: s3://midb-hbcd-main-pr/assembly_bids (raw BIDS data with DCCIDs)
Outputs: s3://midb-hbcd-main-deid/assembly_bids (with Release Candidate IDs)
Notes: EEG sourcedata files eventlogs.edat3 and eeg_flags.json are not yet supported.

Registration of Subjects from Raw BIDS Data into CBRAIN

Goal: Make CBRAIN aware of subjects available for processing.
Contacts: Monalisa Bilas, Erik Lee
Frequency: Daily (<1 hour)
Inputs: s3://midb-hbcd-main-deid/assembly_bids
Outputs: Internal CBRAIN records indicating existence of subject folder within BIDS directory
Notes: Each subject has a single CBRAIN BidsSubject File Collection linking all sessions, though sessions are processed independently.

Pipeline processing of De-identified Data

Goal: Run de-identified BIDS data through BIDS App pipelines in CBRAIN (e.g., `Nibabies`, `BIBSNet`, `QSIPrep`) to generate derivatives.

Workflow Steps:

  1. Detect available sessions sessions in the BIDS directory and CBRAIN
  2. Check for existing outputs or prior processing attempts
  3. For sessions without outputs or attempted processing, verify that required prerequisite files exist and pass QC (from scans.tsv)
  4. Select files for processing based on modality-specific rules (e.g., best T1w image only, all fMRI images passing QC)
  5. Confirm dependencies between pipelines (e.g., BIBSNet outputs are required by Nibabies)
  6. Launch CBRAIN processing tasks using predefined settings and including only files selected for processing
  7. CBRAIN uploads successful job outputs to session-specific folders on S3 and records the corresponding processing tasks and generated file collections internally.

Contacts: Erik Lee, Monalisa Bilas
Frequency: Daily (initial routine <1 hour; processing jobs may take ~1 day)
Inputs: s3://midb-hbcd-main-deid/assembly_bids (raw BIDS data)
Outputs: s3://midb-hbcd-main-deid/derivatives/ses-{V0X} (session-specific subject folders with Release Candidate IDs)
Notes: See the GitHub repository and Documentation for the code that manages processing. CBRAIN logs and file collections are stored internally for traceability.

Saving stdout/stderr Logs for Failed CBRAIN Processing Tasks

Goal: Preserve CBRAIN processing logs for failed tasks before the jobs are deleted (a few weeks after completion). Logs from successful jobs are already archived in the .cbrain logs included with the S3 outputs. Note that CBRAIN only transfers outputs to S3 for successful jobs.
Contacts: Monalisa Bilas, Erik Lee
Frequency: Daily (<1 hour)
Inputs: CBRAIN task directories stored on MSI under /scratch.global
Outputs: s3://midb-hbcd-main-deid/cbrain_std_logs/ (Files named {CBRAIN_Task_ID}.<out|err>.out)
Notes: CBRAIN task IDs are unique, so duplicates pose no issue.

Cleanup for Out-of-Sync Raw BIDS Data (between LORIS & de-id buckets)

Goal: Detect and remove sessions where LORIS and de-ID data diverge.

    Process:
  1. Compare file counts between de-id and LORIS session folders
  2. If files counts are the same, compare the loris-versionid of the de-id files to ensure they match
  3. If session counts or loris-versionid mismatch, delete all associated derivatives, CBRAIN task records, and raw BIDS data. The next time the query scripts are run that look for new subjects to process, the processing will be re-initiated for these subjects.

Contacts: Erik Lee, Monalisa Bilas
Frequency: Daily (runtime varies by data volume)

    Inputs:
  • Raw BIDS data: s3://midb-hbcd-main-deid/assembly_bids and s3://midb-hbcd-main-pr/assembly_bids
  • Derivatives: s3://midb-hbcd-main-deid/derivatives
  • CBRAIN records of userfiles and tasks

Outputs: N/A
Notes: Cleanup enables the next de-ID workflow to rerun cleanly for that session.

Re-identification of Derivatives (i.e. re-insertion of DCCIDs) for LORIS

Goal: Re-identify de-identified derivatives by replacing Release Candidate IDs with DCCIDs, enabling upload to LORIS. Ensures derivatives are accurately linked back to participant DCCIDs to facilitate internal QC with Workgroups and tabulation of derivatives.

Process:

  • Download de-identified derivatives from s3://midb-hbcd-main-deid/derivatives.
  • Replace all Release Candidate IDs with corresponding DCCIDs in both filenames and file contents (using file type–specific routines) for (1)text-based files (.csv, .html, .json, .txt, .toml, .tsv, .log) and (2) EEG .set files (.set, .mat)
  • Replace anonymized site IDs with real site IDs.
  • Upload re-identified files to s3://midb-hbcd-main-pr/reid_derivatives and set the metadata field cbrain-timestamp based on the original file’s LastModified value.

Re-identification is performed on derivatives for the following pipelines:

bibsnet
bme_x
hbcd_motion
made
mriqc
mrsqc
nibabies
osprey
qmri_postproc
qsiprep
qsirecon-DIPYDKI
qsirecon-DSIStudio
qsirecon-TORTOISE_model-<MAPMRI|tensor>
symri
xcp_d

Contacts: Sriharshitha Anuganti, Erik Lee
Frequency: Runs daily
Inputs: s3://midb-hbcd-main-deid/derivatives (de-identified derivatives)
Outputs: s3://midb-hbcd-main-pr/reid_derivatives (re-identified derivatives)

    Notes:
  • Update re-ID routines whenever pipeline filenames or formats change.
  • Previous documentation referenced VersionId metadata; this has been replaced with LastModified since the de-ID bucket is non-versioned.
Cleanup for Out-of-Sync Derivatives in LORIS Bucket

Goal: Remove re-identified derivatives from LORIS when they become out of sync with corresponding de-identified derivatives.

Process:

  • For each subject/session/pipeline, compare LastModified (de-ID) and cbrain-timestamp (re-ID) values between:
    • s3://midb-hbcd-main-deid/derivatives
    • s3://midb-hbcd-main-pr/reid_derivatives
  • If the number of files or timestamps differ, delete the corresponding re-identified data from s3://midb-hbcd-main-pr.

Contacts: Sriharshitha Anuganti, Monalisa Bilas, Erik Lee
Frequency: Daily
Inputs: s3://midb-hbcd-main-pr/reid_derivatives and s3://midb-hbcd-main-deid/derivatives
Outputs: N/A
Notes: Ensures only synchronized derivatives remain in LORIS and prevents outdated or mismatched data from being used.