Data Engineer - San Francisco
1 day ago

Job description
About AldeaAldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software.
Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.
The RoleWe are hiring a Data Engineer to build the data infrastructure that powers Aldea's multi-modal AI research.
You will design and scale data pipelines for pretraining, midtraining, and post-training at trillion-token scale, process diverse data sources across language and speech domains, and generate high-quality synthetic data for model training.
This is a high-impact role where your work directly determines training quality and efficiency. If you're passionate about building data systems that power cutting-edge AI research, this role is for you.What You'll Do
Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
Generate synthetic data for model training and evaluation across diverse tasks and domains
Design efficient data loading systems achieving high throughput across multi-node training clusters
Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
Collaborate with ML engineers and researchers to optimize pipelines and improve data quality
Minimum Qualifications
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
Experience with data quality techniques including deduplication, filtering, and validation at scale
Proven ability to optimize data pipelines for performance and throughput in distributed systems
Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Preferred Qualifications
Experience building data pipelines for LLM pretraining or large-scale ML training
Hands-on experience with synthetic data generation for language or speech models
Experience with text processing at scale: tokenization, deduplication (MinHash, LSH), and quality assessment
Familiarity with audio/speech data processing and dataset curation
Knowledge of data contamination detection and dataset versioning best practices
Experience optimizing data loaders for PyTorch or TensorFlow at scale
Understanding of distributed storage systems (S3, GCS, HDFS) and data streaming patterns
Compensation & Benefits
Competitive base salary
Performance-based bonus aligned with research and model milestones
Equity participation
Comprehensive health, dental, and vision coverage
Flexible paid time off
Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one.
We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.
Aldea uses E-Verify to confirm employment eligibility in compliance with federal law.For more information please visit:
https:
//www.e-Please note: We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.
Similar jobs
We are looking for a skilled Data Engineer with a strong focus on AI and machine learning to join our dynamic team. · Design, build, and maintain scalable data pipelines and ETL processes to support AI and machine learning workflows. · Collaborate with data scientists and machine ...
1 week ago
+We currently have multiple openings for Contract Workers to join us in Data Engineer roles for one year assignments to implement and manage data products, ensuring that our data pipelines are scalable, secure, · and efficient. · +Design, · develop, · maintain robust and efficien ...
4 weeks ago
We're looking for a passionate and skilled Data Engineer to join our fast growing data team to revolutionize healthcare billing products and systems that directly address the needs of our customers. · ...
1 month ago
This is a foundational role and an opportunity to be the first dedicated data engineering hire, · Design and build a modern cloud-based data platform, · Develop and maintain reliable ETL / ELT pipelines from internal and third-party source systems, · ...
3 weeks ago
We're looking for an experienced · Ai/ml engineer to design, build, · and operationalize Generative Ai solutions.Designing · building, · and operationalizing Generative Ai solutions. · ...
1 month ago
Aaratech Inc is seeking a results driven Data Engineer Retail E Commerce to support customer sales and product data platforms The role focuses on building scalable pipelines that enable real time and batch analytics for business growth. · Bachelors degree in Computer Science Engi ...
2 weeks ago
Factory is bringing autonomy to software engineering, and we're hiring a Data Engineer to own the systems that power how we understand and operate the business. · ...
1 week ago
We currently have multiple openings for Contract Workers to join us in Data Engineer roles for one year assignments to implement and manage data products, ensuring that our data pipelines are scalable, secure, and efficient. · We are working to modernize how we manage and leverag ...
3 weeks ago
We are looking for an experienced Data Engineer (Databricks) for a contract role. · 12+ years of total IT experience in data engineering or analytics roles. · 4+ years of hands-on experience with Databricks · 3+ years of experience with Azure Cloud · ...
3 days ago
We are hiring a Data Engineer to build the data infrastructure that powers Aldea's multi-modal AI research. You will design and scale data pipelines for pretraining, midtraining, and post-training at trillion-token scale. · ...
1 month ago
Data Engineer - III collects,parses,manages,and visualizes large sets of data to turn information into actionable insights.Designs,maintains,and improves robust and efficient pipelines. · Delivers high-quality data products following Safe Agile Practices. · ...
1 month ago
Data plays a central role at MasterClass and is pivotal to our decision-making processes. Our Data engineering team tackles challenging problems across many technical disciplines. We seek an exceptional Data Engineer to help design, build, and operate our data platform. · We are ...
1 month ago
Twitch is hiring a data engineer to transform its largest data sets into offerings usable by the entire company. The role involves working with senior team members and stakeholders to define business metrics and create robust data architectures. · ...
1 month ago
We're looking for a data engineer to wrangle complex cloud billing data by designing the pipelines that power our product. · ...
2 weeks ago
We currently have multiple openings for Contract Workers to join us in Data Engineer roles for one year assignments to implement and manage data products, · Design, develop, and maintain robust and efficient data pipelines. · Actively participate in Agile rituals. · ...
3 weeks ago
Airtable is the no-code app platform that empowers people closest to the work to accelerate their most critical business processes. · Work between our engineering organization and stakeholders from our data science, growth, sales, marketing, and product teams,to understand the da ...
1 month ago
If you are interested in this position, please apply on Twitch's Career site · About Us: · Twitch is the world's biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities com ...
1 day ago
Lead the design development and modernization of scalable data platforms using Python REST APIs and cloud technologies Implement GenAI solutions for code generation SQL optimization anomaly detection and intelligent automation. · Implement GenAI solutions for code generation SQL ...
1 week ago
This role involves developing ETL pipelines for decision-making processes. · The successful candidate will contribute to data scientists, analysts, and business leaders making informed decisions that drive company success. · ...
4 days ago
We are seeking a highly hands-on Data Engineer to support a large-scale cloud data modernization initiative. · ...
4 weeks ago
We are currently seeking a Data Engineer to join the Finance Technology and Digital Transformation team in our Finance Department. This role will support the rapid growth of FinTech across many departments at the firm. · Adapt and optimize firm-standard ETL/ELT data pipelines to ...
3 weeks ago