Research Data Scientist
Data Science
Barcelona, Spain
About Fundamental
Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict.
At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI.
Key responsibilities
As part of the Research team, you will contribute to the development of breakthrough machine learning models by working on one of the most important frontiers in model training and evaluation: high-quality real and synthetic data.
This role is especially focused on synthetic data generation, Structural Causal Models (SCMs), and realistic simulation-based data sources. You will help us design, evaluate, and scale datasets that capture the structure, dependencies, and edge cases needed to train foundation models for enterprise tabular data.
The main responsibilities of this role are:
Identifying, characterizing, and evaluating high-value data sources for training and evaluating ML models, including real-world data, synthetic data, SCM-generated data, and physical or systems-based simulator outputs
Designing and analysing synthetic data generation approaches based on Structural Causal Models, probabilistic models, simulators, and other mechanisms that capture realistic relationships between variables
Working with researchers to define what makes a synthetic dataset useful, realistic, diverse, causally meaningful, and appropriate for model training or evaluation
Building tools and workflows to generate, validate, benchmark, and iterate on synthetic datasets at scale
Developing metrics and evaluation procedures for synthetic data quality
Transforming structured, unstructured, simulated, and causally generated data into formats suitable for training and evaluating large-scale ML models
Collaborating with the research team to maintain a reliable, efficient training pipeline where data quality, data diversity, and synthetic data generation are critical components
Collaborating with the wider engineering and infrastructure team to ensure data generation and processing workflows are scalable, reproducible, and robust
Must have
Experience with:
Synthetic data generation for machine learning, especially for structured or tabular data
Structural Causal Models, causal graphs, causal inference, probabilistic modelling, or simulation-based data generation
Identifying and evaluating high-quality data sources to train and evaluate ML models, including both real-world and realistic synthetic data sources
Bringing data from structured and unstructured sources, simulators, causal models, or generative processes into formats accessible by ML models
-
Designing quantitative analyses to assess data quality, realism, diversity, bias, coverage, and downstream model performance
Strong fundamentals in:
Statistics, probability, and applied machine learning
Data science workflows, including exploratory analysis, feature understanding, validation, and experimental design
Software engineering for research-grade and production-grade data workflows
Strong knowledge of:
Python data processing and scientific computing stack, including numpy, pandas, scipy, scikit-learn, or similar tools
Familiarity with:
Causal modelling, graphical models, probabilistic programming, agent-based simulation, discrete-event simulation, or physical / systems-based simulators
Data storage and data versioning solutions
Classical machine learning and deep learning methods, especially outside of purely LLM-based workflows
Nice to have
Contributions to open source ML, causal inference, synthetic data, simulation, or data science projects
BSc, MSc, or PhD in computer science, machine learning, statistics, mathematics, physics, engineering, economics, or another quantitative field
Experience working with tabular data, predictive analytics, or enterprise decision-making systems
Experience building or evaluating synthetic datasets for model training
Experience with SCM libraries, probabilistic programming frameworks, simulation environments, or custom data generation pipelines
Benefits
Competitive compensation with salary and equity
Comprehensive health coverage for you and your dependents
Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations
A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action