ML Researcher - Evaluations
Fundamental
Software Engineering, Data Science
Barcelona, Spain
Location
Barcelona
Employment Type
Full time
Location Type
On-site
Department
Research
About Fundamental
Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict.
At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI.
We are looking for a Machine Learning Researcher - Evaluations to establish the ground truth for what our models can actually do. In this role, you will take ambiguous, real-world data challenges and translate them into concrete, defensible metrics that our researchers and leadership can trust.
Evaluation is not an afterthought here; it is the engine that drives our research roadmap. Working alongside our core researchers, you will be embedded in the entire lifecycle of model development. This means taking signals from our internal deployment teams to define what matters, tracking performance across live training runs, and decoding the final results. If you are obsessed with empirically measuring exactly why a model fails and where it excels, this role is for you.
Key responsibilities
Develop Signal-Driven Evals: Design and implement rigorous evaluation frameworks. You will translate real-world requirements gathered internally into measurable metrics that accurately reflect downstream use cases.
Own the Evaluation Infrastructure: Build, scale, and maintain the internal Python pipelines and datasets used to stress-test our models on a day-to-day basis.
Explore External Benchmarks: Scout the industry for new, relevant external benchmarks for tabular data. You will evaluate our models against these public benchmarks and maintain those pipelines.
Maintain Competitive Baselines: The AI landscape moves fast. You will monitor external foundation models and classical ML baselines, integrating and updating them within our system so we always know exactly how we stack up against the state-of-the-art.
Manage the Internal Leaderboard: Create and maintain a comprehensive leaderboard and characterization of our models. You will be responsible for reporting back to the research team exactly where our models are excelling and where they are falling short.
Must have
Proven experience in Machine Learning, Data Science, or AI Engineering, with a strong focus on model evaluation, testing, or benchmarking.
Strong programming skills in Python and relevant libraries such as pandas.
A solid understanding of traditional ML metrics alongside emerging ways to evaluate foundation model outputs.
Experience building and maintaining automated testing pipelines or evaluation harnesses.
Excellent internal communication skills. You need to be comfortable telling the research team hard truths about model regressions, and adept at translating field requirements into technical metrics.
Experience with translating real-world problems into quantifiable metrics.
Nice to have
Experience with tabular data or time series forecasting.
Benefits
Competitive compensation with salary and equity
Comprehensive health coverage, including medical, dental, vision, and 401K
Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations
A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action