Apache Spark

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimised engine that supports general computation graphs for data analysis.

Web site

Github repository

Tech tags:

Related shared contents:

  • tech2
    2026-01-21

    The article discusses LinkedIn's transformation of its search technology stack, focusing on the integration of large language models (LLMs) to enhance search experiences. It details the challenges and innovations involved in deploying LLMs at scale, including query understanding, semantic retrieval, and ranking processes. The use of AI-driven job and people search features aims to provide more relevant and personalized results. Additionally, the article highlights the importance of continuous relevance measurement and quality evaluation in maintaining a high-quality search experience.

  • project
    2026-03-06

    This article discusses Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for improved SQL generation and table discovery. The system addresses the challenges of understanding analytical intent and provides a structured approach to data governance and documentation. By encoding historical query patterns and utilizing AI-generated documentation, the agent enhances the efficiency and reliability of data analytics at Pinterest. The article outlines the architecture and operational principles behind the agent's design, emphasizing the importance of context and governance in AI-driven analytics.

  • project
    2023-10-01

    The article discusses the development and implementation of Spot Balancer, a tool created by Notion in collaboration with AWS, which optimizes the use of Spark on Kubernetes by balancing cost and reliability. It highlights the challenges faced when using Spot Instances for Spark jobs and how Spot Balancer allows for better control over executor placement to prevent job failures. The article outlines the transition from Amazon EMR to EMR on EKS and the benefits of dynamic provisioning and efficient resource management. Ultimately, the tool has helped Notion reduce Spark compute costs by 60-90% without sacrificing reliability.

  • tech1
    2026-02-23

    The article introduces the Native Execution Engine for Microsoft Fabric, designed to enhance Apache Spark's performance without requiring code changes. It explains the challenges faced by traditional Spark execution due to increasing data volumes and real-time processing demands. The Native Execution Engine leverages C++ and vectorized execution to optimize Spark workloads, particularly for columnar data formats like Parquet and Delta Lake. The integration of open-source technologies Velox and Apache Gluten is highlighted, showcasing significant performance improvements and cost savings for users.

  • project
    2026-02-01

    The article discusses Spotify's innovative multi-agent architecture designed to enhance its advertising platform. By addressing the fragmented decision-making processes across various advertising channels, the architecture aims to unify workflows and optimize campaign management through specialized AI agents. This approach allows for more efficient budget allocation, audience targeting, and overall campaign performance, leveraging historical data and machine learning. The article highlights the importance of a programmable decision layer and the challenges faced in implementing this system.

  • tech1
    2026-01-14

    The article discusses how Slack developed a comprehensive metrics framework to enhance the performance and cost-efficiency of their Apache Spark jobs on Amazon EMR. By integrating generative AI and custom monitoring tools, they achieved significant improvements in job completion times and cost reductions. The framework captures over 40 metrics, providing granular insights into application behavior and resource usage. The article outlines the architecture of their monitoring solution and the benefits of AI-assisted tuning for Spark operations.

  • tech1
    2025-11-12

    The article discusses the challenges of processing large datasets using single-node frameworks like Polars, DuckDB, and Daft compared to traditional Spark clusters. It highlights the concept of 'cluster fatigue' and the emotional and financial costs associated with running distributed systems. The author conducts a performance comparison of these frameworks on a 650GB dataset stored in Delta Lake on S3, demonstrating that single-node frameworks can effectively handle large datasets without the need for extensive resources. The findings suggest that modern Lake House architectures can benefit from these lightweight alternatives.

  • tech2
    2016-12-05

    This article explores the join and aggregation operations in Spark's Catalyst optimization engine. It discusses how Spark generates execution plans for these operations, including SortMergeJoin and HashAggregate, and the underlying mechanisms that ensure efficient data processing. The author highlights the complexities of data shuffling and the importance of distribution and ordering in Spark plans. Overall, the article provides insights into the optimization strategies employed by Spark Catalyst for handling join and aggregation queries.

  • tech2
    2016-08-16

    This article explores the performance benefits of using Spark SQL's Catalyst optimizer, particularly focusing on DataFrame transformations. It discusses the four stages of Catalyst optimization, emphasizing the Physical Plan stage and how caching DataFrames can significantly improve query performance. The author provides insights into the execution plans generated by Spark and the implications of using UnsafeRow for memory management. Ultimately, the article concludes that while simple queries may not benefit from Catalyst optimization without caching, performance can be enhanced when DataFrames are cached.

  • tech2
    2017-01-21

    This article explores how Apache Spark interacts with YARN for resource management in a cluster environment. It details the roles of YARN's components: Resource Manager, Application Master, and Node Manager, and explains the communication process during Spark application execution. The author discusses common exceptions encountered when running Spark on YARN, emphasizing the importance of understanding these interactions for effective troubleshooting. The article serves as a guide for advanced users looking to optimize Spark applications on YARN.

  • product
    2025-11-20

    Google Cloud has announced the general availability of Iceberg REST Catalog support in BigLake metastore, enhancing open data interoperability across various data engines. This fully-managed, serverless metastore allows users to query data using their preferred engines, including Apache Spark and BigQuery, without the need for data duplication. The integration with Dataplex Universal Catalog provides comprehensive governance and lineage capabilities. Organizations like Spotify are already leveraging this technology to build modern lakehouse platforms.

  • product
    2025-07-23

    Lightning Engine is open source?

  • product
    2025-06-24

    i don't know "OpenLineage standard" before, I guess Datahub should enable to support it as well.

  • poc
    2025-02-28
  • project
    2025-01-09

    very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.

  • project
    2025-01-28

    Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.

  • project
    2025-01-27

    The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.

  • project
    2024-12-23

    JD.com has developed a comprehensive big data governance framework to manage its extensive data infrastructure, which includes thousands of servers, exabytes of storage, and millions of data models and tasks. The governance strategy focuses on cost reduction, stability, security, and data quality. Key initiatives involve the implementation of audit logs, full-link data lineage, and automated governance platforms. These efforts aim to enhance data management efficiency, ensure data security, and optimize resource utilization across the organization.

  • product
    2025-01-16
  • spike
    2024-12-03

    Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.

  • project
    2024-11-22

    Improving the data processing efficiency by implementing Apache Iceberg's base-2 file layout for S3.

  • project
    2024-11-06

In productions with:

Airbnb Uber Yelp Funding Circle