Filter articles by tags or search for specific topics:
Filter articles by tags or search for specific topics:
Original URL: https://gradientflow.substack.com/p/paradigm-shifts-in-data-processing
Added Date: February 5, 2025
Memo: "AI-centric" data processing focuses on preparing and managing large-scale, multimodal datasets efficiently for AI model training, fine-tuning, and deployment, rather than traditional database queries. It involves optimizing computation across heterogeneous resources (CPUs/GPUs), improving data flow efficiency, and enabling scalability—all crucial for building next-generation AI models.
Original URL: https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/
Added Date: February 3, 2025
Memo:
Original URL: https://www.anthropic.com/research/building-effective-agents
Added Date: January 31, 2025
Memo: Workflow: Evaluator-optimizer is interesting.
Added Date: January 30, 2025
Memo: Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.
Original URL: https://www.linkedin.com/blog/engineering/ai/automated-genai-driven-search-quality-evaluation
Added Date: January 29, 2025
Memo:
Original URL: https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/
Added Date: January 28, 2025
Memo: Explained in the three systems (API, data warehouse, AI inference), how to efficiently collect and validate the Lineage metadata.
Original URL: https://www.databricks.com/blog/introducing-easier-change-data-capture-apache-spark-structured-streaming
Added Date: January 27, 2025
Memo: The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.
Original URL: https://aws.amazon.com/blogs/database/how-monzo-bank-reduced-cost-of-ttl-from-time-series-index-tables-in-amazon-keyspaces/
Added Date: January 27, 2025
Memo: Monzo Bank optimized their data retention strategy in Amazon Keyspaces by replacing the traditional Time to Live (TTL) approach with a bulk deletion mechanism. By partitioning time-series data across multiple tables, each representing a specific time bucket, they can efficiently drop entire tables of expired data. This method significantly reduces operational costs associated with per-row TTL deletions.
Original URL: https://netflixtechblog.com/introducing-configurable-metaflow-d2fb8e9ba1c6
Added Date: January 26, 2025
Memo:
Original URL: https://www.alibabacloud.com/blog/introducing-fluss-streaming-storage-for-real-time-analytics_601921
Added Date: January 25, 2025
Memo: