Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets

meta-ai-releases-neuralbench:-a-unified-open-source-framework-to-benchmark-neuroai-models-across-36-eeg-tasks-and-94-datasets

Source: MarkTechPost

Evaluating AI models trained on brain signals has long been a messy, inconsistent topic. Different research groups use different preprocessing pipelines, train models on different datasets, and report results on a narrow set of tasks — making it nearly impossible to know which model actually works best, or for what. A new framework from Meta AI team is designed to fix that.

Meta Researchers have released NeuralBench, a unified, open-source framework for benchmarking AI models of brain activity. Its first release, NeuralBench-EEG v1.0, is the largest open benchmark of its kind: 36 downstream tasks, 94 datasets, 9,478 subjects, 13,603 hours of electroencephalography (EEG) data, and 14 deep learning architectures evaluated under a single standardized interface.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

The Problem NeuralBench Solves

The broader field of NeuroAI where deep learning meets neuroscience has exploded in recent years. Self-supervised learning techniques originally developed for language, speech and images are now being adapted to build brain foundation models: large models pretrained on unlabeled brain recordings and fine-tuned for downstream tasks ranging from clinical seizure detection to decoding what a person is seeing or hearing.

But the evaluation landscape has been badly fragmented. Existing benchmarks like MOABB cover up to 148 brain-computer interfacing (BCI) datasets but limit evaluation to just 5 downstream tasks. Other efforts — EEG-Bench, EEG-FM-Bench, AdaBrain-Bench — are each constrained in their own ways. For modalities like magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), there is no systematic benchmark at all.

The result — claims about foundation models being “generalizable” or “foundational” often rest on cherry-picked tasks with no common reference point.

What is NeuralBench?

NeuralBench is built on three core Python packages that form a modular pipeline.

NeuralFetch handles dataset acquisition, pulling curated data from public repositories including OpenNeuro, DANDI, and NEMAR. NeuralSet prepares data as PyTorch-ready dataloaders, wrapping existing neuroscience tools like MNE-Python and nilearn for preprocessing, and HuggingFace for extracting stimulus embeddings (for tasks involving images, speech, or text). NeuralTrain provides modular training code built on PyTorch-Lightning, Pydantic, and the exca execution and caching library.

Once installed via pip install neuralbench, the framework is controlled via a command-line interface (CLI). Running a task is as simple as three commands: download the data, prepare the cache, and execute. Every task is configured through a lightweight YAML file that specifies the data source, train/validation/test splits, preprocessing steps, target processing, training hyperparameters, and evaluation metrics.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

What NeuralBench-EEG v1.0 Covers

The first release focuses on EEG and spans eight task categories: cognitive decoding (image, sentence, speech, typing, video, and word decoding), brain-computer interfacing (BCI), evoked responses, clinical tasks, internal state, sleep, phenotyping, and miscellaneous.

Three classes of models are compared:

  • Task-specific architectures (~1.5K–4.2M parameters, trained from scratch): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, and CTNet.
  • EEG foundation models (~3.2M–157.1M parameters, pretrained and fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, and REVE.
  • Handcrafted feature baselines: sklearn-style pipelines using symmetric positive definite (SPD) matrix representations fed into logistic or Ridge regression.

All foundation models are fine-tuned end-to-end using a shared training recipe — AdamW optimizer, learning rate of 10⁻⁴, weight decay of 0.05, cosine-annealing with 10% warmup, up to 50 epochs with early stopping (patience=10). The sole exception is BENDR, for which the learning rate is lowered to 10⁻⁵ and gradient clipping is applied at 0.5 to obtain stable learning curves. This intentional standardization otherwise removes model-specific optimization tricks — such as layer-wise learning rate decay, two-stage probing, or LoRA — so that architecture and pretraining methodology are what actually gets evaluated.

Data splitting is handled differently per task type to reflect real-world generalization constraints: predefined splits where provided by dataset research team, leave-concept-out for cognitive decoding tasks (all subjects seen in training, but a held-out set of stimuli used for testing), cross-subject splits for most clinical and BCI tasks, and within-subject splits for datasets with very few participants. Each model is trained three times per task using three different random seeds.

Evaluation metrics are standardized by task type: balanced accuracy for binary and multiclass classification, macro F1-score for multilabel classification, Pearson correlation for regression, and top-5 accuracy for retrieval tasks. All results are additionally reported as normalized scores (s̃), where 0 corresponds to dummy-level performance and 1 corresponds to perfect performance, enabling fair cross-task comparisons regardless of metric scale.

One important methodological note: some EEG foundation models were pretrained on datasets that overlap with NeuralBench’s downstream evaluation sets. Rather than discarding these results, the benchmark flags them with hashed bars in result figures so readers can identify potential pretraining data leakage — no strong trend suggesting leakage inflates performance was observed, but the transparency is preserved.

The benchmark offers two variants: NeuralBench-EEG-Core v1.0, which uses a single representative dataset per task for broad coverage, and NeuralBench-EEG-Full v1.0, which expands to up to 24 datasets per task to study within-task variability across recording hardware, labs, and subject populations. A Kendall’s τ of 0.926 (p < 0.001) between Core and Full rankings confirms that the Core variant is a reliable proxy — though a few model positions do shift, including CTNet overtaking LUNA when more datasets are included.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

Two Key Findings

Finding 1: Foundation models only marginally outperform task-specific models. The top-ranked models overall are REVE (69.2M parameters, mean normalized rank 0.20), LaBraM (5.8M, rank 0.21), and LUNA (40.4M, rank 0.30). But several task-specific models trained from scratch — CTNet (150K parameters, rank 0.32), SimpleConvTimeAgg (4.2M, rank 0.35), and Deep4Net (146K, rank 0.43) — trail closely behind. CTNet actually overtakes the LUNA foundation model to rank third in the Full variant, despite having roughly 270× fewer parameters. This shows the gap between task-specific and foundation models is narrow enough that expanding dataset coverage alone is sufficient to change global rankings.

Finding 2: Many tasks remain genuinely hard. Cognitive decoding tasks — recovering dense representations of images, speech, sentences, video, or words from brain activity — are particularly challenging, with even the best models scoring well below ceiling. Tasks like mental imagery, sleep arousal, psychopathology decoding, and cross-subject motor imagery and P300 classification frequently yield performance close to dummy level. These tasks represent the best benchmarks for stress-testing the next generation of EEG foundation models.

Tasks approaching saturation include SSVEP classification, pathology detection, seizure detection, sleep stage classification, and phenotyping tasks like age regression and sex classification.

Beyond EEG: MEG and fMRI

Even in this initial EEG-focused release, NeuralBench already supports MEG and fMRI tasks as proof of concept. Notably, the REVE model — pretrained exclusively on EEG data — achieves the best performance among all tested models on the typing decoding task in MEG. This is a striking early signal that EEG-pretrained representations may transfer meaningfully across brain recording modalities, a hypothesis the framework is positioned to rigorously test in future releases.

The infrastructure is explicitly designed for expansion to intracranial EEG (iEEG), functional near-infrared spectroscopy (fNIRS), and electromyography (EMG).

How to Get Started

Installation takes a single command: pip install neuralbench. From there, running the audiovisual stimulus classification task on EEG looks like this:

neuralbench eeg audiovisual_stimulus --download   # Download data neuralbench eeg audiovisual_stimulus --prepare    # Prepare cache neuralbench eeg audiovisual_stimulus              # Run the task

To run all 36 tasks against all 14 EEG models, the -m all_classic all_fm flag handles the orchestration. Full benchmark storage requirements are substantial: approximately 11 TB total (~3.2 TB raw data, ~7.8 TB preprocessed cache, ~333 GB logged results), with one GPU of at least 32 GB VRAM per job — though average peak GPU usage measured across experiments is only ~1.3 GB (maximum ~30.3 GB).

The full NeuralBench-EEG-Full v1.0 run requires approximately 1,751 GPU-hours across 4,947 experiments.

Key Takeaways

  • Meta AI’s NeuralBench-EEG v1.0 is an open EEG benchmark — 36 tasks, 94 datasets, 9,478 subjects, and 14 deep learning architectures under one standardized interface.
  • Despite up to 270× more parameters, EEG foundation models like REVE only marginally outperform lightweight task-specific models like CTNet (150K params) across the benchmark.
  • Cognitive decoding tasks (speech, video, sentence, word decoding from brain activity) and clinical predictions remain highly challenging, with most models scoring near dummy level.
  • REVE, pretrained only on EEG data, outperformed all models on MEG typing decoding — an early signal of meaningful cross-modality transfer.
  • NeuralBench is MIT-licensed.

Check out the Paper and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us