Big Data – NEO

Ensuring Exclusive Sub-Task Execution in Multiple Data Pipelines

In modern data engineering, it’s common to have multiple data pipelines running concurrently. However, when these pipelines share a common sub-task, it becomes crucial to ensure that the sub-task does not run simultaneously across multiple pipelines. This can prevent data corruption, race conditions, and other issues related to concurrent executions. To address this challenge, a… Continue reading Ensuring Exclusive Sub-Task Execution in Multiple Data Pipelines

Creating Read-Only External Table in Unity Catalog by Using Existing Delta Table in Azure Storage Account

In this tutorial, we’ll walk through the steps to create a read-only external table in Azure Databricks using an existing Delta table stored in an Azure Storage Account. This allows you to query the data in the Delta table without needing to copy it into your Databricks cluster. Prerequisites: Steps: 1. Create Access Connector for… Continue reading Creating Read-Only External Table in Unity Catalog by Using Existing Delta Table in Azure Storage Account

A simple mistaken occurred leveraging spark in python multiprocessing

Look at this snippet first: It looks fine at the first glance. However, after the validation, the output was incomplete in delta table. At the end, the issue happens in the df_test, which is not a local variable in a function. So, when it ran as multicore, df_test was overwritten. The best way to avoid… Continue reading A simple mistaken occurred leveraging spark in python multiprocessing

Failure Sensor Detection by Pattern Comparison on Time Series

Purpose Find the interesting patterns in the time series data in order to detect the failure sensors. Process Pros Cons Future work

Service Principle on Databricks and Trivial zip file Ingestion

For security reason, we got to use service principle instead of personal token to control databricsk cluster and run the notebooks, queries. The document of either Azure or Databricks didn’t explain the steps very well, as they evolve the products so quick that hadn’t time to keep their documents updating. After several diggings, I found… Continue reading Service Principle on Databricks and Trivial zip file Ingestion

Summary@202205

It is harder and harder to write down a long journal focusing on a single topic since most of the time is spent burning my fat these days. Maybe it is time to list some problems we solved or still listed after half a year without updating. Issue 1. Too many job clusters launched in… Continue reading Summary@202205

A real case of optimazing spark notebook

At the beginning of my optimazation, I tried to find some standard principles that can quickly and smoothly help me. Altough lots of information online that indicate where I can improve my spark, none gives an easy solution to fix the key issues from end to end. So, I summried what I understand through online… Continue reading A real case of optimazing spark notebook

Some working performance/cost improvement tips applying to ADF and Databricks recently

While switching to the cloud, we found some pipelines running slowly and cost increased rapidly. To solve the problems, we did flowing steps to optimize the pipelines or data structures. They are all not hard to be implemented. 1. Set the different triggers for different recurring periods. No matter for what reason, it is very… Continue reading Some working performance/cost improvement tips applying to ADF and Databricks recently

First Glance on GPU Accelerated Spark

Since I started to play with cluster, I thought there was no mission which was not able to be completed by cluster. If there is, add another node. However, except Cuda on the sing-alone machine, I have been rarely touched GPU accelerated cluster as data engineer. Yes, maybe spark ML can utilize GPU since spark… Continue reading First Glance on GPU Accelerated Spark

SCD II or Snapshot for Dimension

SCD II is widely used to process dimensional data with all historical information. Each change in dimensions will be recorded as a new row configurated valid time period which usually on the date granularity. Since SCD II only keeps the changes, it significantly reduce the storage space in the database. Everything looks fine, until big… Continue reading SCD II or Snapshot for Dimension