Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimised engine that supports general computation graphs for data analysis.
Tech tags:
Related shared contents:
-
tech12025-11-12
The article discusses the challenges of processing large datasets using single-node frameworks like Polars, DuckDB, and Daft compared to traditional Spark clusters. It highlights the concept of 'cluster fatigue' and the emotional and financial costs associated with running distributed systems. The author conducts a performance comparison of these frameworks on a 650GB dataset stored in Delta Lake on S3, demonstrating that single-node frameworks can effectively handle large datasets without the need for extensive resources. The findings suggest that modern Lake House architectures can benefit from these lightweight alternatives.
-
tech22016-12-05
This article explores the join and aggregation operations in Spark's Catalyst optimization engine. It discusses how Spark generates execution plans for these operations, including SortMergeJoin and HashAggregate, and the underlying mechanisms that ensure efficient data processing. The author highlights the complexities of data shuffling and the importance of distribution and ordering in Spark plans. Overall, the article provides insights into the optimization strategies employed by Spark Catalyst for handling join and aggregation queries.
-
tech22016-08-16
This article explores the performance benefits of using Spark SQL's Catalyst optimizer, particularly focusing on DataFrame transformations. It discusses the four stages of Catalyst optimization, emphasizing the Physical Plan stage and how caching DataFrames can significantly improve query performance. The author provides insights into the execution plans generated by Spark and the implications of using UnsafeRow for memory management. Ultimately, the article concludes that while simple queries may not benefit from Catalyst optimization without caching, performance can be enhanced when DataFrames are cached.
-
tech22017-01-21
This article explores how Apache Spark interacts with YARN for resource management in a cluster environment. It details the roles of YARN's components: Resource Manager, Application Master, and Node Manager, and explains the communication process during Spark application execution. The author discusses common exceptions encountered when running Spark on YARN, emphasizing the importance of understanding these interactions for effective troubleshooting. The article serves as a guide for advanced users looking to optimize Spark applications on YARN.
-
product2025-11-20
Google Cloud has announced the general availability of Iceberg REST Catalog support in BigLake metastore, enhancing open data interoperability across various data engines. This fully-managed, serverless metastore allows users to query data using their preferred engines, including Apache Spark and BigQuery, without the need for data duplication. The integration with Dataplex Universal Catalog provides comprehensive governance and lineage capabilities. Organizations like Spotify are already leveraging this technology to build modern lakehouse platforms.
-
product2025-07-23
Lightning Engine is open source?
-
product2025-06-24
i don't know "OpenLineage standard" before, I guess Datahub should enable to support it as well.
-
poc2025-02-28
-
project2025-01-09
very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.
-
project2025-01-28
Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.
-
project2025-01-27
The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.
-
project2024-12-23
JD.com has developed a comprehensive big data governance framework to manage its extensive data infrastructure, which includes thousands of servers, exabytes of storage, and millions of data models and tasks. The governance strategy focuses on cost reduction, stability, security, and data quality. Key initiatives involve the implementation of audit logs, full-link data lineage, and automated governance platforms. These efforts aim to enhance data management efficiency, ensure data security, and optimize resource utilization across the organization.
-
product2025-01-16
-
spike2024-12-03
Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.
-
project2024-11-22
Improving the data processing efficiency by implementing Apache Iceberg's base-2 file layout for S3.
-
project2024-11-06
-