Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimised engine that supports general computation graphs for data analysis.
Tech tags:
Related shared contents:
-
poc2025-02-28
-
project2025-01-09
very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.
-
project2025-01-28
Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.
-
project2025-01-27
The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.
-
project2024-12-23
JD.com has developed a comprehensive big data governance framework to manage its extensive data infrastructure, which includes thousands of servers, exabytes of storage, and millions of data models and tasks. The governance strategy focuses on cost reduction, stability, security, and data quality. Key initiatives involve the implementation of audit logs, full-link data lineage, and automated governance platforms. These efforts aim to enhance data management efficiency, ensure data security, and optimize resource utilization across the organization.
-
product2025-01-16
-
spike2024-12-03
Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.
-
project2024-11-22
Improving the data processing efficiency by implementing Apache Iceberg's base-2 file layout for S3.
-
project2024-11-06
-