Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimised engine that supports general computation graphs for data analysis.

Web site

Github repository

Tech tags:

Related shared contents:

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

poc

2025-02-28
How Uber Uses Ray® to Optimize the Rides Business

project

2025-01-09

very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.
How Nielsen uses serverless concepts on Amazon EKS for big data processing with Spark workloads

project

2025-01-28

Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.
Introducing Easier Change Data Capture in Apache Spark™ Structured Streaming

project

2025-01-27

The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.
JD.com's Exploration and Practice of Big Data Governance (CN)

project

2024-12-23

JD.com has developed a comprehensive big data governance framework to manage its extensive data infrastructure, which includes thousands of servers, exabytes of storage, and millions of data models and tasks. The governance strategy focuses on cost reduction, stability, security, and data quality. Key initiatives involve the implementation of audit logs, full-link data lineage, and automated governance platforms. These efforts aim to enhance data management efficiency, ensure data security, and optimize resource utilization across the organization.
Introducing Onehouse Compute Runtime to Accelerate Lakehouse Workloads Across All Engines

product

2025-01-16
Turbocharging Efficiency & Slashing Costs: Mastering Spark & Iceberg Joins with Storage-Partitioned

spike

2024-12-03

Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.
How Amazon Ads uses Iceberg optimizations to accelerate their Spark workload on Amazon S3

project

2024-11-22

Improving the data processing efficiency by implementing Apache Iceberg's base-2 file layout for S3.
Right-sizing Spark executor memory

project

2024-11-06

In productions with:

Airbnb Uber Yelp Funding Circle