Johannes Roth

Johannes Roth

I'm finishing my PhD in Computational Cognitive Neuroscience at the Max Planck Institute in Leipzig, where my research focuses on making vision neuroscience experiments more efficient using ML and large-scale image datasets.

Before academia, I spent several years as a data scientist - building recommendation systems, image processing pipelines, and ML infrastructure in production. What drives me is solving hard technical problems that actually matter: whether that's making neuroimaging experiments more efficient or providing a model that changes how a product works.

Open to opportunities starting September 2026
Download CV ↓

Experience

2022 - Present

PhD in Computational Cognitive Neuroscience

Max Planck Institute for Human Cognition and Brain Sciences & University of Gießen

Built a high-throughput filtering pipeline to curate 120M naturalistic images from a pool of 2 billion (LAION-2B). Developed an active learning framework for optimal stimulus selection in neuroimaging experiments. Contributed to open-source tools and datasets used across the research community.

Python · PyTorch · CLIP · scikit-learn · SLURM · Docker

+ 4 more roles — ML research, full-stack dev, data science at CHECK24 & others
2021 - 2022

Research Assistant - ML in Medicine

ScaDS.AI Dresden/Leipzig

Engineered a multi-plane UNet++ ensemble for brain tumor segmentation (0.90 Dice score). Built an attention-based mortality prediction model with epistemic uncertainty estimation (0.85 AUC-ROC).

PyTorch · FT-Transformer · SAINT · UNet++

2020 - 2021

Software Developer (Freelance)

Hebart Lab / MPI CBS

Built hebartlab.com, things-initiative.org, and re-vision-initiative.org. Full-stack development, Linux hosting, automated deployments.

Django · JavaScript · Nginx · Linux · CI/CD

2019 - 2021

Data Scientist (Working student)

CHECK24

Designed and deployed an image processing microservice handling 20M+ images - deduplication, retrieval, classification, quality scoring. Optimized the hotel recommendation system using Implicit and Bayesian hyperparameter tuning (Hyperopt). Built monitoring dashboards and outlier detection for API health.

Python · PyTorch · Implicit · Hyperopt · Flask · Redis · BigQuery · Grafana

2018 - 2019

Data Scientist (Working student)

Webdata Solutions (now Vistex)

Revamped a product-matching pipeline with a neural network approach, improving accuracy from <50% to 92%. Implemented Grad-CAM interpretability to verify model attention on relevant product features. Work formed the basis for an approved research grant.

Python · TensorFlow · PostgreSQL · AWS

2018

Data Scientist (Working student)

Mercateo (now Unite)

Built product classification and matching models for B2B procurement marketplace.

Python · TensorFlow · PostgreSQL

2014 - 2021

B.Sc. Business Information Systems & M.Sc. Computer Science

Leipzig University

M.Sc. grade 1.2 (Distinction). Focused on ML, data analysis, and medical image processing. Thesis on using GANs to synthesize images that maximally activate specific brain regions.

Publications

  • 2025 How to sample the world for understanding the visual system Roth & Hebart / CCN Oral

    Vision neuroscience runs on large fMRI datasets, but nobody had checked whether the stimulus images in these datasets actually cover what humans see in the real world. We built LAION-natural -a reference distribution of ~120M naturalistic photographs filtered from 2 billion LAION images using a CLIP-based classifier trained on 25k actively sampled labels. Then we measured coverage: ~50% of the visual-semantic space is missing from the two most widely used datasets (NSD and THINGS).

    The good news: you don't need millions of images to fix this. In both simulations and real fMRI data, out-of-distribution generalization saturates at 5-10k samples - as long as you draw them from a diverse enough pool. We compared seven sampling strategies (random, stratified, k-Means, Core-Set, effective dimensionality optimization, active learning) and found that pool diversity matters far more than which algorithm you use to sample from it.

    The pipeline processes billions of images using CLIP embeddings, Annoy indices for nearest-neighbor search, mini-batch k-Means clustering, and Ridge regression encoding models - all at a scale that runs on a university HPC cluster, not a cloud budget.

  • 2025 Ten principles for reliable, efficient, and adaptable coding Roth et al. / Communications Psychology

    Most scientists learn to code informally - picking things up as they go, optimizing for "does it run?" over "will anyone else understand this?" This paper introduces a structured framework for writing better research code, built around the idea that researchers naturally switch between quick prototyping and careful development - and that being deliberate about which mode you're in makes all the difference.

    The ten principles span three tiers: organizing code (standardized project structures, version control, automation), writing reusable code (testing, documentation, clean interfaces), and collaborating (code review systems, shared knowledge bases, lab-wide standards). Already at 22k+ accesses, it clearly hit a nerve - these are problems every computational lab deals with but rarely talks about explicitly.

  • 2025 Fine-grained image and category information in ventral visual pathway Badwal, Bergmann, Roth et al. / J Neuroscience
+ 3 more — fMRI methods, GAN-based neuroscience, brain tumor segmentation

Datasets & Tools

Dataset ReLAION-2B Natural Naturalness scores for 2.1B images, identifying ~500M photographic images for vision research 167 GB · CLIP ViT-H/14

Vision research needs naturalistic photographs, but web-scraped datasets like LAION are full of screenshots, memes, ads, and generated images. We scored all 2.1 billion images in ReLAION-2B for "naturalness" using a CLIP-based classifier, then extracted and published ViT-H/14 embeddings for the ~500M most photographic ones. The result is a 167GB dataset on Hugging Face that lets researchers query half a billion images by visual similarity without downloading a single pixel.

Library thingsvision Unified feature extraction API for 100+ vision models Core contributor · Python · 460k+ downloads

An open-source Python toolbox for extracting and comparing image representations from deep neural networks. Supports 100+ models across torchvision, timm, CLIP, self-supervised models (DINO, MAE, SimCLR), and more. Also includes tools for aligning DNN representations with human similarity judgments via RSA and CKA. I'm the third-largest contributor to the project, which has 460k+ PyPI downloads and is used across vision and cognitive neuroscience labs.

Tool cvmanova_python Cross-validated MANOVA for fMRI pattern analysis Python
Tool pysearchlight Customizable searchlight analysis for fMRI data Python
2025 CMBB Replication Award for reliable coding practices in neuroscience

Get in Touch

Happy to chat about research, potential collaborations, or opportunities. Email is best. Also on LinkedIn, GitHub, Hugging Face, and Google Scholar.