Page Last Updated: May 22, 2026

File-Based Data Processing Workflow🔗

Part 1: Site Capture & BIDS Conversion🔗

Data is collected from sites into LORIS (EEG, Axivity, and GABI) or FIONA (for MRI and MRS). LORIS data is subsequently transferred directly into the central S3 main PR bucket, which subsequently is sourced for CBRAIN processing. MRI and MRS must first be converted to BIDS format and MRI data also undergoes extensive raw data QC (see details).

Key Name in Diagram	S3 URL
JCVI	`s3://midb-hbcd-ucsd-main-pr-dicoms/`
MRS BIDS	`s3://midb-hbcd-main-pr-mrs/`
Main PR / BIDS	`s3://midb-hbcd-main-pr/assembly_bids/`
QC Env	`s3://midb-hbcd-lasso-hdcc-qc-ongoing-dccid/`

Part 2: De-Identification, CBRAIN Processing, & Lasso Ingestion🔗

For a detailed breakdown of de-identification, CBRAIN pipeline processing, re-identification, etc., see the UMN De-Identification & Pipeline Processing section below.
All of the processes below are performed by UMN/MSI:

Key Name in Diagram	S3 URL
De-ID-List	`s3://midb-hbcd-main-pr-deidentification-list/`
De-ID / BIDS	`s3://midb-hbcd-main-deid/assembly_bids/`
De-ID / Derivatives	`s3://midb-hbcd-main-deid/derivatives/`
De-ID / BrainSwipes	`s3://midb-hbcd-main-deid/v2/brainswipes/`
Main PR / Derivatives	`s3://midb-hbcd-main-pr/reid_derivatives/`
Main PR / BrainSwipes	`s3://midb-hbcd-main-pr/reid_brainswipes/`
Lasso PR	`s3://midb-hbcd-lasso-hdcc-qc-br/br{BETA RELEASE#}/hbcd/`
QC Environment	`s3://midb-hbcd-lasso-hdcc-qc-ongoing-dccid/`

The workflow for BrainSwipes is unique compared to other data due to the fact that the quality control (QC) is performed post-CBRAIN processing and therefore must go through additional steps. Some details to note:

After transfer of the visual reports used for QC to the Prerelease Derivatives S3 URL (s3://midb-hbcd-prerelease-bids/derivatives/ses-V02/xcp_d/{{SUBJECT}}/figures/{{FILENAME}}.png), a query is run to identify outputs that are out of date and either remove or archive records related to out-of-date files
TBD: Participant sessions that fail structural QC (based on XCP-D derivative visual reports) are flagged to perform manual corrections on the corresponding BIBSNet brain segmentations. The corrected segmentations will not be fed back into the main processing workflows, but are instead integrated into the training set for future BIBSNet models.

Responsibility Assignment Matrices By Modality🔗

Study Stage	Step	Location	Responsible	Accountable	CConsulted/IInformed
Data Collection	Participant Source Data Acquisition: DCMs, eCRF population (MRI Acquisition Form)	FIONALORIS	Site Staff (Varies by site)	Site Staff (Varies by site)	-/-
Data QC + Action	QC at Source Data Acquisition: eCRF populated properly, DCM header checks, naming convention checks	FIONA	Site Staff (Varies by site)	Site Staff (Varies by site)	-/-
Data QC + Action	QC at Source: Check acquisition/protocol adherence	JCVI	Josh Kuperman	Anders Dale	-/-
Data QC + Action	Transfer acquisition/protocol adherence (both DCM acquisition and MRI Acquisition Form) to DCC	JCVI	Ron Yang, Don Hagler	Don Hagler	-/-
Data Collection	Convert DICOMs to BIDS/NIfTI	UMN MSI	Cecile Madjar	Cecile Madjar	-/-
Data Collection	Convert MRS data to BIDS	UMN HSTUMN MSI	Reed McEwan, Cecile Madjar	Reed McEwan	[C] Helge Zoellner [C] Erik Lee [C] Georg Oeltzschner
Data QC + Action	QC of the DCM ot BIDs Conversion: Correct BIDS errors	UMN MSI	Cecile Madjar	Cecile Madjar	[C] Erik Lee [C] Lucille Moore [C] Tim Hendrickson
Data QC + Action	Check for protocol deviations (based on BIDS)	JCVI	MRI QC Workgroup	Don Hagler	-/-
Data Ingestion	Ingestion and catalogue DICOMs in Lasso				-/-
Data QC + Action	QC of ingestion	JCVIHST			-/-
Data QC + Action	Initial QC raw data (e.g. manual, automated)	JCVI	MRI QC Workgroup	Don Hagler	-/-

Study Stage Step	Location	Responsible	Accountable	CConsulted/IInformed
Monthly Net Inventory/Equipment QC	Site	Site Staff (Varies Per Site), EEG Core, HDCC, LORIS	Site Staff (Varies Per Site)	-/-
EEG Acquisition	Site	Site Staff (Varies Per Site), EEG Core, HDCC, LORIS	Site Staff (Varies Per Site)	EEG Core
Populate EEG Acquisition Form	Site	Site Staff (Varies Per Site), EEG Core, HDCC, LORIS	Site Staff (Varies Per Site)	-/-
QC Acquisition form population	Site	Site Staff (Varies Per Site), EEG Core, HDCC, LORIS	Site Staff (Varies Per Site)	EEG Core
De-Identification & Flags (pre-LORIS)	Site	Site Staff (Varies Per Site), EEG Core, HDCC, LORIS	Site Staff (Varies Per Site)	-/-
BIDs Wizard Population and Execution	LORIS	Site Staff (Varies Per Site), EEG Core, HDCC, LORIS	Site Staff (Varies Per Site)	[C] Laetitia Fesselier
Convert EEG data to BIDS	UMN MSI	Laetitia Fesselier	Laetitia Fesselier	EEG Core
Run MADE Pipeline	UMN MSI	Erik Lee	Erik Lee	EEG Core
QC pre-processed EEG	UMD EEG Core	Kira Ashton, Dylan Gilbreath, Trisha Maheswari, Elise Harris	Santiago Morales	EEG Core

Study Stage Step	Location	Responsible	Accountable	CConsulted/IInformed
Acquisition of sample	Site	Site Staff	Site Staff	-/-
Population of meta data form	LORIS	Site Staff	Site Coordinator	-/-
QC of form population	OHSUOregon Health and Science University	WG Co-Chairs	Elinor Sullivan (Co-Chair)	-/-
Shipment of sample	Site	Site Staff	Site Staff	-/-
QC: Ensuring sample was received	Sampled/USDTL	Charles Hevi (Sampled) Priti Soni (USDTL)	Charles Hevi (Sampled) Priti Soni (USDTL)	-/-
QC: deviation code of sample	Sampled/USDTL	Charles Hevi (Sampled) Priti Soni (USDTL)	Charles Hevi (Sampled) Priti Soni (USDTL)	-/-
Analysis of sample	Sampled/USDTL	Charles Hevi (Sampled) Priti Soni (USDTL)	Charles Hevi (Sampled) Priti Soni (USDTL)	-/-

Study Stage Step	Location	Responsible	Accountable	CConsulted/IInformed
Data Collection	Site
Convert Axivity data to BIDS	UMN MSI	Cecile Madjar	Cecile Madjar	[C] Jinseok Oh [C] Beth Smith [C] Erik Lee
Run Axtivity Pipeline	UMN MSI	Erik Lee	Erik Lee	[C] Jinseok Oh [C] Beth Smith
QC preprocessed data

UMN De-Identification & Pipeline Processing🔗

This documentation outlines how UMN processes imaging data after it has been curated by LORIS into BIDS format.
The workflow consists of eight interdependent components that handle de-identification, pipeline processing, synchronization, and cleanup of imaging data.

Workflow Summary🔗

#	Workflow	Core Function	Frequency
1	Release Candidate ID Creation	Uploads updated release ID mappings for new subjects	Daily
2	Raw BIDS De-Identification	Removes identifiers and uploads anonymized data to de-ID bucket	Daily
3	CBRAIN Subject Registration	Registers de-identified subjects in CBRAIN for processing	Daily
4	Post-Processing of De-ID Data	Runs pipelines (e.g., Nibabies, QSIPrep) on de-identified BIDS	Daily
5	CBRAIN Log Preservation	Archives failed task logs for permanent tracking	Daily
6	Raw BIDS Sync Cleanup	Removes outdated data when LORIS and de-ID buckets diverge	Daily
7	Re-ID for LORIS	Replaces Release Candidate IDs with DCCIDs for LORIS ingestion	Daily
8	Derivative Sync Cleanup	Removes outdated re-ID derivatives from LORIS bucket	Daily

Primary Goals🔗

Ensure only anonymized data (using Release Candidate IDs) is released publicly
Prevent overlap of Release Candidate IDs and DCCIDs/PSCIDs within the same dataset
The raw BIDS data curated by LORIS will be periodically updated. These updates also include updates to BIDS metadata and QC (which may impact processing pipelines).
The derived processing outputs released to the public must be from the same processing stream that internal HBCD investigators use for QC purposes.
Provide LORIS with access to derived outputs for facilitating internal QC with Workgroups and for tabulation of derivatives
Limit unnecessary reprocessing while still maintaining integrity between inputs/outputs. For example, if ses-V03 becomes available for a given subject, this should not initiate re-processing of ses-V02 data. However if new files or updated QC becomes available for ses-V02 then ses-V02 reprocessing should occur.

General Limitations🔗

Incoming session data (MRI including initial scans and rescans, EEG, Axivity, GABI, and manual QC ratings) often arrive over several weeks. Automated QC and processing routines may be delayed until all expected elements for a session are available.

Individual Workflows🔗

Goal: Maintain an up-to-date mapping of identifiers for de-identification workflows.
Contacts: Reed McEwan, Dan Duhon
Frequency: Daily (<1 hour)
Inputs: N/A
Outputs: s3://midb-hbcd-main-pr-deidentification-list/release_identifiers.csv
Notes: Phantom data may not yet be included.

Goal: De-identify and upload raw BIDS sessions

Criteria for De-identification:

Subject is listed in the release ID mapping
No existing session files in the de-ID bucket
Session files are available in the LORIS bucket and are ≥1 day old

De-identification Procedure Overview:

De-identify and upload all supported session files to the de-ID bucket
Update session metadata (sessions.<tsv|json>) in de-ID bucket
Tag each file with its loris-versionid (corresponds to VersionId in original LORIS files) for traceability

REMOVED/RETAINED IDENTIFIERS:

Removed: PSCIDs, DCCIDs, and Site IDs & manually populated fields (prone to typos) that commonly contain these identifiers

Retained: Jittered patient age at acquisition, Acquisition dates/times, and & Acquisition device serial numbers

FILE COVERAGE:

BIDS metadata (scans/session tsv & JSONs)

Remove PatientName and PatientBirthDate from JSONs
Replace site info with anonymized site IDs via mapping file
Remove InstitutionAddress, InstitutionalDepartmentName, and InstitutionName

EEG

sourcedata (eventlogs.txt): Anonymize entries for DataFile.Basename, DCCID, and Subject columns
.set files: Replace (1) All DCCIDs/PSCIDs with Release Candidate IDs & (2) Unapproved manual entries with “Anonymized”

MRS NIfTI files:

Remove InstitutionName, InstitutionAddress, PatientSex, and PatientWeight using spec2nii

Contacts: Sriharshitha Anuganti, Erik Lee
Frequency: Daily (PM CST; ensure completion within 24 hours)
Inputs: s3://midb-hbcd-main-pr/assembly_bids (raw BIDS data with DCCIDs)
Outputs: s3://midb-hbcd-main-deid/assembly_bids (with Release Candidate IDs)
Notes: EEG sourcedata files eventlogs.edat3 and eeg_flags.json are not yet supported.

Goal: Make CBRAIN aware of subjects available for processing.
Contacts: Monalisa Bilas, Erik Lee
Frequency: Daily (<1 hour)
Inputs: s3://midb-hbcd-main-deid/assembly_bids
Outputs: Internal CBRAIN records indicating existence of subject folder within BIDS directory
Notes: Each subject has a single CBRAIN BidsSubject File Collection linking all sessions, though sessions are processed independently.

Goal: Run de-identified BIDS data through BIDS App pipelines in CBRAIN (e.g., `Nibabies`, `BIBSNet`, `QSIPrep`) to generate derivatives.

Workflow Steps:

Detect available sessions sessions in the BIDS directory and CBRAIN
Check for existing outputs or prior processing attempts
For sessions without outputs or attempted processing, verify that required prerequisite files exist and pass QC (from scans.tsv)
Select files for processing based on modality-specific rules (e.g., best T1w image only, all fMRI images passing QC)
Confirm dependencies between pipelines (e.g., BIBSNet outputs are required by Nibabies)
Launch CBRAIN processing tasks using predefined settings and including only files selected for processing
CBRAIN uploads successful job outputs to session-specific folders on S3 and records the corresponding processing tasks and generated file collections internally.

Contacts: Erik Lee, Monalisa Bilas
Frequency: Daily (initial routine <1 hour; processing jobs may take ~1 day)
Inputs: s3://midb-hbcd-main-deid/assembly_bids (raw BIDS data)
Outputs: s3://midb-hbcd-main-deid/derivatives/ses-{V0X} (session-specific subject folders with Release Candidate IDs)
Notes: See the GitHub repository and Documentation for the code that manages processing. CBRAIN logs and file collections are stored internally for traceability.

Goal: Preserve CBRAIN processing logs for failed tasks before the jobs are deleted (a few weeks after completion). Logs from successful jobs are already archived in the .cbrain logs included with the S3 outputs. Note that CBRAIN only transfers outputs to S3 for successful jobs.
Contacts: Monalisa Bilas, Erik Lee
Frequency: Daily (<1 hour)
Inputs: CBRAIN task directories stored on MSI under /scratch.global
Outputs: s3://midb-hbcd-main-deid/cbrain_std_logs/ (Files named {CBRAIN_Task_ID}.<out|err>.out)
Notes: CBRAIN task IDs are unique, so duplicates pose no issue.

Goal: Detect and remove sessions where LORIS and de-ID data diverge.

Process:

Compare file counts between de-id and LORIS session folders
If files counts are the same, compare the loris-versionid of the de-id files to ensure they match
If session counts or loris-versionid mismatch, delete all associated derivatives, CBRAIN task records, and raw BIDS data. The next time the query scripts are run that look for new subjects to process, the processing will be re-initiated for these subjects.

Contacts: Erik Lee, Monalisa Bilas
Frequency: Daily (runtime varies by data volume)

Inputs:

Raw BIDS data: s3://midb-hbcd-main-deid/assembly_bids and s3://midb-hbcd-main-pr/assembly_bids
Derivatives: s3://midb-hbcd-main-deid/derivatives
CBRAIN records of userfiles and tasks

Outputs: N/A
Notes: Cleanup enables the next de-ID workflow to rerun cleanly for that session.

Goal: Re-identify de-identified derivatives by replacing Release Candidate IDs with DCCIDs, enabling upload to LORIS. Ensures derivatives are accurately linked back to participant DCCIDs to facilitate internal QC with Workgroups and tabulation of derivatives.

Process:

Download de-identified derivatives from s3://midb-hbcd-main-deid/derivatives.
Replace all Release Candidate IDs with corresponding DCCIDs in both filenames and file contents (using file type–specific routines) for (1)text-based files (.csv, .html, .json, .txt, .toml, .tsv, .log) and (2) EEG .set files (.set, .mat)
Replace anonymized site IDs with real site IDs.
Upload re-identified files to s3://midb-hbcd-main-pr/reid_derivatives and set the metadata field cbrain-timestamp based on the original file’s LastModified value.

Re-identification is performed on derivatives for the following pipelines:

bibsnet
bme_x
hbcd_motion

made
mriqc
mrsqc

nibabies
osprey
qmri_postproc

qsiprep
qsirecon-DIPYDKI
qsirecon-DSIStudio

qsirecon-TORTOISE_model-<MAPMRI|tensor>
symri
xcp_d

Contacts: Sriharshitha Anuganti, Erik Lee
Frequency: Runs daily
Inputs: s3://midb-hbcd-main-deid/derivatives (de-identified derivatives)
Outputs: s3://midb-hbcd-main-pr/reid_derivatives (re-identified derivatives)

Notes:

Update re-ID routines whenever pipeline filenames or formats change.
Previous documentation referenced VersionId metadata; this has been replaced with LastModified since the de-ID bucket is non-versioned.

Goal: Remove re-identified derivatives from LORIS when they become out of sync with corresponding de-identified derivatives.

Process:

For each subject/session/pipeline, compare LastModified (de-ID) and cbrain-timestamp (re-ID) values between:
- s3://midb-hbcd-main-deid/derivatives
- s3://midb-hbcd-main-pr/reid_derivatives
If the number of files or timestamps differ, delete the corresponding re-identified data from s3://midb-hbcd-main-pr.

Contacts: Sriharshitha Anuganti, Monalisa Bilas, Erik Lee
Frequency: Daily
Inputs: s3://midb-hbcd-main-pr/reid_derivatives and s3://midb-hbcd-main-deid/derivatives
Outputs: N/A
Notes: Ensures only synchronized derivatives remain in LORIS and prevents outdated or mismatched data from being used.