AI News Hub Logo

AI News Hub

From Local Scripts to Cloud Servers: Demystifying Docker for DataOps

DEV Community
Cliffe Okoth

"...But it works on my machine." If you spend enough time in data engineering or software development, you will inevitably hear this phrase. You might write a brilliant ETL script that works flawlessly on your laptop, but the moment you move that code to a cloud server, everything breaks. The server has the wrong version of Python, missing libraries or conflicting dependencies. This exact problem is why Docker exists. To understand how Docker works in the real world, we are going to break down its role in a live DataOps project: What is Docker? Instead of installing your code, libraries and tools directly onto a computer, you package them all into a template known as a Docker Image. When you run this image, it forms a Container which is an isolated environment. In this project, the orchestrator, Apache Airflow, is hosted on an Azure Virtual Machine. It is supposed to trigger a local worker to extract data, and then execute transformations using dbt SQL models inside Snowflake. This creates a massive dependency headache. Instead of manually installing Airflow on the Azure server and hoping for the best, Docker is initialized to create a container where Airflow is strictly pinned to version 2.10.0. The Dockerfile contains a set of instructions on how to build a an image. Think of it as a recipe. Here is the exact Dockerfile used to build the Airflow orchestrator for this NBA project: FROM apache/airflow:2.10.0-python3.10 # Step 1: Install system-level tools USER root RUN apt-get update && apt-get install -y --no-install-recommends build-essential # Step 2: Switch back to standard user for security USER airflow # Step 3: Install Python packages COPY --chown=airflow:root requirements.txt /requirements.txt RUN pip install --upgrade pip && \ pip install --no-cache-dir -r /requirements.txt # Step 4: Copy the dbt models into the container COPY --chown=airflow:root nba_analytics /opt/airflow/nba_analytics Let's break it down line by line: FROM apache/airflow:2.10.0... FROM command. This tells Docker what "base image" to start with. Instead of building an operating system from scratch, we are telling Docker to go grab the official Apache Airflow 2.10.0 blueprint from the Docker registry. This instantly guarantees we bypass the version conflict issues mentioned earlier. USER root & RUN apt-get...: We temporarily switch to the administrative root user to install system tools, then safely switch back to USER airflow. COPY & RUN pip install: We copy the requirements.txt file from our local computer into the container. The RUN command then executes a terminal command to install all our necessary libraries. The --no-cache-dir flag tells Docker not to save the leftover installation files, keeping the final container lightweight. COPY ... nba_analytics: By copying the nba_analytics folder directly into the container, we ensure our orchestrator has immediate access to the SQL models it needs to run. A Dockerfile is just the blueprint for a single service. However, enterprise tools like Apache Airflow are rarely just one service. Airflow, for instance, requires three separate services to function: a Scheduler, Webserver and Database. (More on Airflow here) To spin up all of these services on our Azure VM, the project utilizes Docker Compose. This requires a docker-compose.yml file, which acts as a master blueprint. services: postgres: image: postgres:13 environment: POSTGRES_DB: airflow airflow-webserver: build: . ports: - "8080:8080" depends_on: - postgres airflow-scheduler: build: . depends_on: - postgres Instead of running long, complex terminal commands to start each piece manually, Docker Compose reads this YAML file and handles the networking automatically. To build the container, you only need to run one command: docker compose up -d Docker then downloads the database, builds your custom Airflow image using your Dockerfile, links them all together and boots up an isolated orchestration server. The -d flag simply tells it to run in "detached" mode, meaning it runs quietly in the background so you can continue using your terminal. By containerizing the orchestrator, this data pipeline achieves perfect environment consistency. It doesn't matter if you deploy this project on an Azure VM, a Google Cloud instance or your laptop, Docker ensures that Airflow 2.10.0 and every other Python library are locked and ready to orchestrate your data.