Customers who viewed this item also viewed
Download the free Kindle app and start reading Kindle books instantly on your smartphone, tablet, or computer - no Kindle device required.
Read instantly on your browser with Kindle for Web.
Using your mobile phone camera - scan the code below and download the Kindle app.
Follow the authors
OK
Delta Lake: The Definitive Guide: Modern Data Lakehouse Architectures with Data Lakes 1st Edition
Purchase options and add-ons
Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques.
Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale.
This book helps you:
- Understand key data reliability challenges and how Delta Lake solves them
- Explain the critical role of Delta transaction logs as a single source of truth
- Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino
- Architect data lakehouses with the medallion architecture
- Optimize Delta Lake performance with features like deletion vectors and liquid clustering
- ISBN-101098151941
- ISBN-13978-1098151942
- Edition1st
- PublisherO'Reilly Media
- Publication dateDecember 10, 2024
- LanguageEnglish
- Dimensions7 x 0.79 x 9.19 inches
- Print length380 pages
Frequently bought together

Deals on related products
Customers also bought or read
- Building Medallion Architectures: Designing with Delta Lake and Spark
Paperback$39.49$39.49FREE delivery Sun, May 3 - Apache Iceberg: The Definitive Guide: Data Lakehouse Functionality, Performance, and Scalability on the Data Lake
Paperback$44.94$44.94FREE delivery Sun, May 3 - Practical Lakehouse Architecture: Designing and Implementing Modern Data Platforms at Scale
Paperback$45.39$45.39FREE delivery Sun, May 3 - Fundamentals of Data Engineering: Plan and Build Robust Data Systems
Paperback$40.99$40.99FREE delivery Sun, May 3 - Spark: The Definitive Guide: Big Data Processing Made Simple
Paperback$50.97$50.97FREE delivery Mon, May 4 - Data Engineering Design Patterns: Recipes for Solving the Most Common Data Engineering Problems
Paperback$57.96$57.96FREE delivery Sun, May 3 - Databricks Certified Data Engineer Associate Study Guide: In-Depth Guidance and Practice
Paperback$59.51$59.51FREE delivery Sun, May 3 - Data Governance with Unity Catalog on Databricks: Implement Data and AI Governance with Databricks Data Intelligence Platform
Paperback$56.30$56.30FREE delivery Sun, May 3 - Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh
Paperback$50.99$50.99FREE delivery Sun, May 3 - Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
Paperback$35.99$35.99FREE delivery Sun, May 3 - Delta Lake: Up and Running: Modern Data Lakehouse Architectures with Delta Lake
Paperback$34.74$34.74Delivery Sun, May 3 - Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale
Paperback$43.99$43.99FREE delivery Sun, May 3 - The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling#1 Best SellerDatabase Storage & Design
Paperback$33.75$33.75Delivery Sun, May 3 - Snowflake: The Definitive Guide: Architecting, Designing, and Deploying on the Snowflake Data Cloud
Paperback$52.34$52.34FREE delivery Sun, May 3 - Database Internals: A Deep Dive into How Distributed Data Systems Work
Paperback$36.33$36.33FREE delivery Sun, May 3 - Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems#1 Best SellerMySQL Guides
Paperback$59.99$59.99FREE delivery Sun, May 3 - Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming
Paperback$51.12$51.12FREE delivery Sun, May 3 - Data Governance: The Definitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness
Paperback$45.99$45.99FREE delivery Sun, May 3 - The Enterprise Data Catalog: Improve Data Discovery, Ensure Data Governance, and Enable Innovation
Paperback$41.13$41.13FREE delivery Sun, May 3 - High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
Paperback$35.00$35.00FREE delivery Sun, May 3 - Implementing Data Mesh: Design, Build, and Implement Data Contracts, Data Products, and Data Mesh
Paperback$45.20$45.20FREE delivery Sun, May 3 - Data Contracts: Developing Production-Grade Pipelines at Scale
Paperback$62.35$62.35FREE delivery Sun, May 3 - Deep Learning (Adaptive Computation and Machine Learning series)
Hardcover$61.00$61.00FREE delivery Sun, May 3 - Data Pipelines Pocket Reference: Moving and Processing Data for Analytics
Paperback$16.93$16.93Delivery Sun, May 3 - Data Governance Handbook: A practical approach to building trust in data
Paperback$39.99$39.99FREE delivery Sun, May 3 - Serverless ETL and Analytics with AWS Glue: Your comprehensive reference guide to learning about AWS Glue and its features
Paperback$41.99$41.99FREE delivery Sun, May 3 - Apache Airflow Best Practices: A practical guide to orchestrating data workflow with Apache Airflow
Paperback$44.99$44.99FREE delivery Sun, May 3
From the brand
-
Databases, data science & more
-
Data Science
-
Data Visualization
-
Databases
-
Streaming
-
Sharing the knowledge of experts
O'Reilly's mission is to change the world by sharing the knowledge of innovators. For over 40 years, we've inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success.
Our customers are hungry to build the innovations that propel the world forward. And we help them do just that.
From the Publisher
From the Preface
Welcome to Delta Lake: The Definitive Guide! Since it became an open source project in 2019, Delta Lake has revolutionized how organizations manage and process their data. Designed to bring reliability, performance, and scalability to data lakes, Delta Lake addresses many of the inherent challenges traditional data lake architectures face.
Over the past five years, Delta Lake has undergone significant transformation. Originally focused on enhancing Apache Spark, Delta Lake now boasts a rich ecosystem with integrations across various platforms, including Apache Flink, Trino, and many more. This evolution has enabled Delta Lake to become a versatile and integral component of modern data engineering and data science workflows.
Who This Book Is For
As a team of production users and maintainers of the Delta Lake project, we’re thrilled to share our collective knowledge and experience with you. Our journey with Delta Lake spans from small-scale implementations to internet-scale production lakehouses, giving us a unique perspective on its capabilities and how to work around any complexities.
The primary goal of this book is to provide a comprehensive resource for both newcomers and experts in data lakehouse architectures. For those just starting with Delta Lake, we aim to elucidate its core principles and help you avoid the common mistakes we encountered in our early days. If you’re already well versed in Delta Lake, you’ll find valuable insights into the underlying codebase, advanced features, and optimization techniques to enhance your lakehouse environment.
Throughout these pages, we celebrate the vibrant Delta Lake community and its collaborative spirit! We’re particularly proud to highlight the development of the Delta Rust API and its widely adopted Python bindings, which exemplify the community’s innovative approach to expanding Delta Lake’s capabilities. Delta Lake has evolved significantly since its inception, growing beyond its initial focus on Apache Spark to embrace a wide array of integrations with multiple languages and frameworks. To reflect this diversity, we’ve included code examples featuring Flink, Kafka, Python, Rust, Spark, Trino, and more. This broad coverage ensures that you’ll find relevant examples regardless of your preferred tools and languages.
While we cover the fundamental concepts, we’ve also included our personal experiences and lessons learned. More importantly, we go beyond theory to offer practical guidance on running a production lakehouse successfully. We’ve included best practices, optimization techniques, and real-world scenarios to help you navigate the challenges of implementing and maintaining a Delta Lake–based system at scale.
Whether you’re a data engineer, architect, or scientist, our goal is to equip you with the knowledge and tools to leverage Delta Lake effectively in your data projects. We hope this guide serves as your companion in building robust, efficient, and scalable lakehouse architectures.
Editorial Reviews
About the Author
Tristen Wentling works in machine learning, data engineering, and statistical analysis using Python, Apache Spark, and Scala. He is a machine learning advocate loves the flexibility of neural networks. Tristen holds an M.S. in Mathematics and B.S. in Applied Mathematics.
Scott Haines is a Databricks Beacon and has been working with data systems and distributed systems and architectures for over 15 years. He recently wrote a book encapsulating his journey called Modern Data Engineering with Apache Spark: A Hands-on guide for building mission-critical streaming applications. He enjoys teaching people how to simplify data systems and data-intensive services and takes to the snow in the winter to pursue his love of snowboarding.
Prashanth Babu is a Databricks Certified Developer who helps guide design and implementation of customer use cases by building out reference architectures, best practices, frameworks, MVP, and prototypes, which enables customers to succeed in turning their data into value.
Product details
- Publisher : O'Reilly Media
- Publication date : December 10, 2024
- Edition : 1st
- Language : English
- Print length : 380 pages
- ISBN-10 : 1098151941
- ISBN-13 : 978-1098151942
- Item Weight : 1.43 pounds
- Dimensions : 7 x 0.79 x 9.19 inches
- Best Sellers Rank: #988,460 in Books (See Top 100 in Books)
- #140 in Data Warehousing (Books)
- #304 in Data Modeling & Design (Books)
- #390 in Data Processing
- Customer Reviews:
About the authors

Denny Lee is a long-time Apache Spark™ and MLflow contributor, Delta Lake maintainer, and a Sr. Staff Developer Advocate at Databricks. A hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale data platforms and predictive analytics systems. He has previously built enterprise DW/BI and big data systems at Microsoft, including Azure Cosmos DB, Project Isotope (HDInsight), and SQL Server. He was also the Senior Director of Data Sciences Engineering at SAP Concur. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Delta Lake, Apache Spark, Deep Learning, Machine Learning, and Genomics.

Scott Haines is a seasoned software engineer with over 20 years of experience. He has worn many hats during his career, across the entire software stack, from front to back. He has worked for a wide variety of companies, from startups to global corporations, and across various industries, from video and telecommunications, to news, sports, and gaming, as well as data, insights, and analytics. He has held positions at notable companies, including Hitachi Data Systems, Convo Communications, Yahoo!, Twilio, and joined Nike in early 2022. Scott has enjoyed working on distributed systems, real-time communications platforms, and enterprise-scale data platforms for over a decade and was foundational in helping to drive Apache Spark adoption for stream processing at Twilio. He is an active member of the Apache Spark community, a Databricks Beacon, and speaks regularly at conferences like the Data+AI Summit, Open Data Science Conference, and others.

Tristen Wentling is a Solutions Architect at Databricks where he works with customers in the retail industry. Formerly a data scientist, he also has authored several blog posts covering topics like best practices for production stream applications and building generative AI applications for e-commerce. Outside of technical work, Tristen spends a great deal of free time reading or heading to the beach. Tristen holds an M.S. in Mathematics and B.S. in Applied Mathematics and lives in Clearwater, FL.
Related products with free delivery on eligible orders
Customer reviews
- 5 star4 star3 star2 star1 star4 star82%0%18%0%0%0%
- 5 star4 star3 star2 star1 star2 star82%0%18%0%0%0%
- 5 star4 star3 star2 star1 star1 star82%0%18%0%0%0%
Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.
To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.
Learn more how customers reviews work on AmazonTop reviews from the United States
- 5 out of 5 stars
Learn the Ins and Outs of Delta Lake
Reviewed in the United States on February 20, 2025For anyone wanting to kick start their Delta Lake journey this book is a must have. Not only are you presented with the whys and hows to utilize this novel open table format, but you’ll learn a lot from the authors own stories. Lastly, there is a rich set of accompanying code written in pyspark, python, and even Scala and rust and virtual environments running jupyterlab for some of the chapters.
Sending feedback...Sending feedback...HelpfulThank you for your feedback.Sorry, we failed to record your vote. Please try againThanks, we'll investigate in the next few days.Sorry, We failed to report this review. Please try again - 5 out of 5 stars
Good book on Data Architecture on Modern Data Lakes
Reviewed in the United States on February 13, 2025Good book on Data Architecture on Modern Data Lakes.
One person found this helpfulSending feedback...Sending feedback...HelpfulThank you for your feedback.Sorry, we failed to record your vote. Please try againThanks, we'll investigate in the next few days.Sorry, We failed to report this review. Please try again













