Apache Spark

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimised engine that supports general computation graphs for data analysis.

Web site

Github repository

Tech tags:

Related shared contents:

  • project
    2026-02-01

    The article discusses Spotify's innovative multi-agent architecture designed to enhance its advertising platform. By addressing the fragmented decision-making processes across various advertising channels, the architecture aims to unify workflows and optimize campaign management through specialized AI agents. This approach allows for more efficient budget allocation, audience targeting, and overall campaign performance, leveraging historical data and machine learning. The article highlights the importance of a programmable decision layer and the challenges faced in implementing this system.

  • tech1
    2026-01-14

    The article discusses how Slack developed a comprehensive metrics framework to enhance the performance and cost-efficiency of their Apache Spark jobs on Amazon EMR. By integrating generative AI and custom monitoring tools, they achieved significant improvements in job completion times and cost reductions. The framework captures over 40 metrics, providing granular insights into application behavior and resource usage. The article outlines the architecture of their monitoring solution and the benefits of AI-assisted tuning for Spark operations.

  • tech1
    2025-11-12

    The article discusses the challenges of processing large datasets using single-node frameworks like Polars, DuckDB, and Daft compared to traditional Spark clusters. It highlights the concept of 'cluster fatigue' and the emotional and financial costs associated with running distributed systems. The author conducts a performance comparison of these frameworks on a 650GB dataset stored in Delta Lake on S3, demonstrating that single-node frameworks can effectively handle large datasets without the need for extensive resources. The findings suggest that modern Lake House architectures can benefit from these lightweight alternatives.

  • tech2
    2016-12-05

    This article explores the join and aggregation operations in Spark's Catalyst optimization engine. It discusses how Spark generates execution plans for these operations, including SortMergeJoin and HashAggregate, and the underlying mechanisms that ensure efficient data processing. The author highlights the complexities of data shuffling and the importance of distribution and ordering in Spark plans. Overall, the article provides insights into the optimization strategies employed by Spark Catalyst for handling join and aggregation queries.

  • tech2
    2016-08-16

    This article explores the performance benefits of using Spark SQL's Catalyst optimizer, particularly focusing on DataFrame transformations. It discusses the four stages of Catalyst optimization, emphasizing the Physical Plan stage and how caching DataFrames can significantly improve query performance. The author provides insights into the execution plans generated by Spark and the implications of using UnsafeRow for memory management. Ultimately, the article concludes that while simple queries may not benefit from Catalyst optimization without caching, performance can be enhanced when DataFrames are cached.

  • tech2
    2017-01-21

    This article explores how Apache Spark interacts with YARN for resource management in a cluster environment. It details the roles of YARN's components: Resource Manager, Application Master, and Node Manager, and explains the communication process during Spark application execution. The author discusses common exceptions encountered when running Spark on YARN, emphasizing the importance of understanding these interactions for effective troubleshooting. The article serves as a guide for advanced users looking to optimize Spark applications on YARN.

  • product
    2025-11-20

    Google Cloud has announced the general availability of Iceberg REST Catalog support in BigLake metastore, enhancing open data interoperability across various data engines. This fully-managed, serverless metastore allows users to query data using their preferred engines, including Apache Spark and BigQuery, without the need for data duplication. The integration with Dataplex Universal Catalog provides comprehensive governance and lineage capabilities. Organizations like Spotify are already leveraging this technology to build modern lakehouse platforms.

  • product
    2025-07-23

    Lightning Engine is open source?

  • product
    2025-06-24

    i don't know "OpenLineage standard" before, I guess Datahub should enable to support it as well.

  • poc
    2025-02-28
  • project
    2025-01-09

    very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.

  • project
    2025-01-28

    Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.

  • project
    2025-01-27

    The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.

  • project
    2024-12-23

    JD.com has developed a comprehensive big data governance framework to manage its extensive data infrastructure, which includes thousands of servers, exabytes of storage, and millions of data models and tasks. The governance strategy focuses on cost reduction, stability, security, and data quality. Key initiatives involve the implementation of audit logs, full-link data lineage, and automated governance platforms. These efforts aim to enhance data management efficiency, ensure data security, and optimize resource utilization across the organization.

  • product
    2025-01-16
  • spike
    2024-12-03

    Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.

  • project
    2024-11-22

    Improving the data processing efficiency by implementing Apache Iceberg's base-2 file layout for S3.

  • project
    2024-11-06

In productions with:

Airbnb Uber Yelp Funding Circle