ETH Zürich
Senior Storage & Data Engineer
📍 Lugano
Rolle und Verantwortlichkeiten
Bridge ingestion and use. Design the pipelines and metadata that turn ingested data into something findable and consumable — catalogs, schemas, and access layers that match how training jobs and simulations actually read, not just where bytes sit. Make data traceable. Build lineage and provenance so any dataset, checkpoint, or result can be traced back to its inputs and transformations. Reproducibility is a first-class requirement here, not a retrofit. Tune for the workload. Optimise parallel filesystems (Lustre, GPFS) and object storage for the concurrency, small-file, and large-checkpoint patterns of distributed GPU training and HPC simulation. Operate at scale, safely. Design and run multi-petabyte storage with the integrity and availability scientific work depends on — erasure coding, redundancy, hot-to-archival tiering. Automate everything. Deploy and scale storage and data services as code. Snowflake infrastructure doesn't survive at this scale. Make it observable. Instrument storage health, capacity trends, and pipeline performance so problems surface before users feel them. Translate. Turn real access patterns from domain scientists and ML engineers into technical requirements — and push back when a request would quietly break something downstream.
Team / Beschreibung
The Swiss National Supercomputing Centre (CSCS) develops and operates a high-performance computing and data research infrastructure that supports world-class science in Switzerland. Its user laboratory is available to domestic and international researchers in academia, industry, and the business sector. The centre is operated by ETH Zurich and has offices at its data centre in Lugano and in Zurich.
Qualifikationen und Fähigkeiten
A technical degree (CS, engineering) or equivalent experience that demonstrates the same depth.
Solid storage grounding: filesystems (block and object), performance tuning, redundancy (RAID, erasure coding).
Python, and comfort automating infrastructure (Ansible, Terraform, or similar).
A working understanding of how ML and scientific workloads consume data — billions of small files, large checkpoints, sharding — and why naive layouts fall over.
A point of view on data lineage, provenance, or reproducibility — and ideally tooling you've used to enforce it.