Filter articles by tags or search for specific topics:
Filter articles by tags or search for specific topics:
Original URL: https://medium.astrafy.io/dynamic-data-pipelines-with-airflow-datasets-and-pub-sub-d91c81d75f51
Added Date: November 18, 2024
Memo: Good "dataset" feature since 2.4.0, released on September 19, 2022.
Original URL: https://services.google.com/fh/files/misc/google-cloud-data-vault_b.pdf
Added Date: November 16, 2024
Memo: This paper provides an overview of the Data Vault concept and the business benefits of leveraging it on the cloud-based enterprise database BigQuery.
Original URL: https://xie.infoq.cn/article/2e8ff60f54152bf3d6a76d283
Added Date: November 15, 2024
Memo: Comprehensive explanation of how Alluxio accelerates data access in the cloud.
Original URL: https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30
Added Date: November 14, 2024
Memo: This is mostly like a Netflix level's problem, huge engineering works to build this KV data abstract layer.
Original URL: https://cloud.google.com/blog/products/data-analytics/dataplex-discovers-and-catalogs-cloud-storage-data/
Added Date: November 12, 2024
Memo: I agree with the 'dark data' problem in large organizations, and tools like Dataplex can help by automating data discovery. However, with thousands of tables generated, it raises the question: who will sift through these massive results to identify truly valuable datasets? This process could be very time-consuming.
Original URL: https://engineering.grab.com/transforming-the-analytics-landscape-with-RAG-powered-LLM
Added Date: November 11, 2024
Memo: Using LLM RAG to fetch the right dataset and combined auto enhanced explanation and analysis for the users is really a good idea.
Original URL: https://www.linkedin.com/pulse/what-goes-bronze-silver-gold-layers-medallion-data-lakshmanan-r93nc/
Added Date: November 10, 2024
Memo: This article discusses an approach similar to the raw, curated, and delivery zones we've talked about before. The key concept is to process and manage data in distinct zones or stages to support data governance and optimize data usage. Most data teams will likely need to adopt some version of this architecture to efficiently handle and control large volumes of data assets.
Added Date: November 8, 2024
Memo: QuintoAndar's DAG Builder allows scalable management of 10,000+ Apache Airflow DAGs by using YAML configurations to generate DAGs, minimizing code duplication and standardizing data pipeline creation. By separating DAG structures from workflow-specific parameters, QuintoAndar enables data engineers to create new pipelines through declarative YAML files, streamlining the process and ensuring quality across pipelines. This system improves team productivity, simplifies code maintenance, and reduces the learning curve for new team members.
Original URL: https://cloud.google.com/blog/products/data-analytics/synthetic-data-generation-with-gretel-and-bigquery-dataframes/
Added Date: November 5, 2024
Memo: This guide demonstrates integrating Gretel with BigQuery DataFrames for synthetic data generation. By leveraging BigQuery's pandas-compatible APIs and Gretel's machine learning tools, users can generate and de-identify high-quality synthetic data that maintains data privacy and regulatory compliance. The process includes data de-identification with Gretel's Transform v2 and synthetic data generation with Gretel Navigator Fine Tuning, optimized for handling patient records with complex data relationships.
Added Date: October 28, 2024
Memo: AWS introduces a visual designer in SageMaker Pipelines to simplify fine-tuning and deploying Llama 3.x models. This new UI allows users to create, manage, and automate workflows for continuous model updates using a no-code interface. The article details a sample pipeline for customizing LLMs with SEC financial data, enabling tasks like model evaluation, deployment, and conditional registration based on performance.