Linux and Shell Scripting for Data Engineers
The commands and scripts every DE uses daily — files, processes, cron, log analysis, and bash.
Every Data Pipeline Runs on Linux
Almost every server that runs a data pipeline — cloud VMs, Docker containers, Kubernetes pods, Airflow workers, Spark executors — runs Linux. When a pipeline fails at 3 AM, you SSH into a Linux box and diagnose it. When a disk fills up and kills a pipeline, you find the culprit with Linux commands. When you need to quickly inspect a 10 GB log file without loading it into Python, you use Linux tools that do it in seconds.
Linux proficiency for a data engineer is not about memorising every command. It is about being comfortable in a terminal, knowing which tools solve which problems, and being able to write shell scripts that automate the repetitive operational tasks that surround every data pipeline.
Navigation, Files, and Permissions
The Linux filesystem and file permission model are the foundation of everything else. A data engineer who does not understand permissions will spend hours debugging "Permission denied" errors that take seconds to fix once the model is understood.
Navigation and file operations
# ── NAVIGATION ───────────────────────────────────────────────────────────────
pwd # print working directory — where am I?
ls -lah # list files: -l long, -a hidden, -h human-readable sizes
ls -lt # sort by modification time (newest first)
cd /data/pipelines # change to absolute path
cd ../logs # change to relative path (one level up, then into logs)
cd ~ # go to home directory
cd - # go back to previous directory
# ── FINDING FILES ─────────────────────────────────────────────────────────────
find /data -name "*.csv" # find all .csv files under /data
find /data -name "orders_*.parquet" -newer /tmp/checkpoint.txt # newer than checkpoint
find /data -size +1G # find files larger than 1 GB
find /data -mtime -1 # modified in last 24 hours
find /data -empty # find empty files
find /tmp -name "*.tmp" -mtime +7 -delete # find and delete .tmp files older than 7 days
# ── FILE OPERATIONS ───────────────────────────────────────────────────────────
cp source.csv dest.csv # copy file
cp -r /data/raw/ /data/backup/ # copy directory recursively
mv orders_v1.csv orders_v2.csv # rename/move file
rm old_file.csv # delete file (no undo!)
rm -rf /data/tmp/ # delete directory recursively (CAREFUL)
mkdir -p /data/pipeline/2026/03/17 # create nested directories
ln -s /data/real_file.parquet /data/link.parquet # create symbolic link
# ── VIEWING FILE CONTENT ──────────────────────────────────────────────────────
cat small_file.txt # print entire file (only for small files)
less large_file.log # page through large file (q to quit)
head -n 20 orders.csv # first 20 lines
tail -n 50 pipeline.log # last 50 lines
tail -f pipeline.log # FOLLOW log file in real-time (stream new lines)
tail -f pipeline.log | grep ERROR # follow log, show only errors in real-time
wc -l orders.csv # count lines in file
wc -c orders.csv # count bytes in file
# ── DISK USAGE ────────────────────────────────────────────────────────────────
df -h # disk free — show all mounted filesystems
df -h /data # disk free for specific path
du -sh /data/raw/ # disk usage of directory (summary)
du -sh /data/raw/* # disk usage of each item in directory
du -sh /data/* | sort -rh | head -20 # top 20 largest directories under /data
du -sh /data/raw/2026/03/* | sort -rh # check which daily partition is largestFile permissions — understanding and setting them
# ── READING PERMISSIONS ──────────────────────────────────────────────────────
ls -lah /data/pipeline/
# Output:
# -rwxr-x--- 1 pipeline_user data_team 4.2G Mar 17 08:14 orders.parquet
# ^^^ ^^^ ^^^
# | | |__ others: no permissions (---)
# | |______ group (data_team): read+execute (r-x)
# |__________ owner (pipeline_user): read+write+execute (rwx)
#
# First character: - = file, d = directory, l = symbolic link
# Permission breakdown:
# r = read (4) — can read file contents / list directory
# w = write (2) — can modify file / create files in directory
# x = execute (1) — can run as program / enter directory
# ── CHANGING PERMISSIONS ──────────────────────────────────────────────────────
chmod 755 run_pipeline.sh # rwxr-xr-x owner=rwx, group=r-x, others=r-x
chmod 644 config.yaml # rw-r--r-- owner=rw, group=r, others=r
chmod +x run_pipeline.sh # add execute permission for all
chmod -x dangerous_script.sh # remove execute permission
chmod -R 750 /data/secrets/ # recursively set 750 on directory and contents
# Common permission patterns for data engineering:
# 755 — scripts that should be executable by all
# 644 — config files readable by all, writable only by owner
# 600 — secret files (API keys, passwords) — owner read/write only
# 700 — private directories — owner only
# ── OWNERSHIP ─────────────────────────────────────────────────────────────────
chown pipeline_user:data_team orders.parquet # change owner and group
chown -R pipeline_user:data_team /data/output/ # recursive ownership change
sudo chown root:root /etc/cron.d/pipeline # change to root ownership
# ── COMMON PERMISSION ERROR ───────────────────────────────────────────────────
# "Permission denied: /data/output/orders.parquet"
# Diagnose:
ls -lah /data/output/ # check directory permissions
id # check your user and groups
stat /data/output/orders.parquet # detailed file informationgrep, awk, sed, cut — The Data Engineer's Log Analysis Toolkit
Linux text processing tools are the fastest way to investigate pipeline logs, inspect data files, and answer quick questions without writing Python. A data engineer who knows grep, awk, sed, and cut can diagnose most pipeline failures in minutes without loading a single file into a dataframe.
grep — search for patterns in files
# ── BASIC grep ───────────────────────────────────────────────────────────────
grep "ERROR" pipeline.log # find lines containing ERROR
grep -i "error" pipeline.log # case-insensitive search
grep -n "ERROR" pipeline.log # show line numbers
grep -c "ERROR" pipeline.log # count matching lines (not show them)
grep -v "DEBUG" pipeline.log # show lines NOT containing DEBUG
grep -w "ERROR" pipeline.log # whole word match (not ERRORS, ERRORED)
# ── CONTEXT LINES ────────────────────────────────────────────────────────────
grep -A 5 "ERROR" pipeline.log # show 5 lines AFTER each match
grep -B 3 "ERROR" pipeline.log # show 3 lines BEFORE each match
grep -C 5 "ERROR" pipeline.log # show 5 lines BEFORE and AFTER
# ── REGEX PATTERNS ────────────────────────────────────────────────────────────
grep -E "ERROR|CRITICAL" pipeline.log # multiple patterns (extended regex)
grep -E "order_id=[0-9]+" pipeline.log # match order_id with digits
grep -E "^2026-03-17" pipeline.log # lines starting with this date
grep -E "failed after [0-9]+ retries" pipeline.log
# ── SEARCHING ACROSS MULTIPLE FILES ──────────────────────────────────────────
grep -r "ORDER_FAILED" /var/log/pipelines/ # search recursively in directory
grep -l "ERROR" /var/log/pipelines/*.log # list files that contain ERROR (not lines)
grep -h "ERROR" /var/log/pipelines/*.log # suppress filename prefix
# ── REAL DE USE CASES ─────────────────────────────────────────────────────────
# Find all orders that failed in today's log
grep -E "order_id.*FAILED" /var/log/pipeline_$(date +%Y%m%d).log
# Count errors per hour
grep "ERROR" pipeline.log | grep -oE "^d{4}-d{2}-d{2} d{2}" | sort | uniq -c
# Find which pipeline run produced a specific order
grep -r "order_id.*9284751" /var/log/pipelines/
# Check if any critical error occurred
if grep -q "CRITICAL" pipeline.log; then
echo "Critical error found — alerting team"
fiawk — column-based text processing
# awk processes text column by column
# $1=first field, $2=second field, $NF=last field, NR=line number, NF=num fields
# ── BASIC COLUMN EXTRACTION ───────────────────────────────────────────────────
awk '{print $1}' access.log # print first column
awk '{print $1, $4}' access.log # print columns 1 and 4
awk -F',' '{print $3}' orders.csv # CSV: use comma as delimiter, print 3rd column
awk -F' ' '{print $2}' data.tsv # TSV: tab-delimited
# ── FILTERING WITH CONDITIONS ─────────────────────────────────────────────────
awk '$3 > 1000' orders.csv # rows where column 3 > 1000
awk -F',' '$4 == "delivered"' orders.csv # rows where status column = delivered
awk 'NR > 1' orders.csv # skip header row (print from line 2)
awk 'NR==1 || $3 > 1000' orders.csv # keep header (line 1) + rows where col3 > 1000
# ── CALCULATIONS ──────────────────────────────────────────────────────────────
# Sum of column 3 (order amounts):
awk -F',' 'NR>1 {sum += $3} END {print "Total:", sum}' orders.csv
# Count rows and calculate average:
awk -F',' 'NR>1 {sum += $3; count++} END {print "Avg:", sum/count, "Count:", count}' orders.csv
# Count occurrences of each status:
awk -F',' 'NR>1 {counts[$4]++} END {for (s in counts) print s, counts[s]}' orders.csv
# ── STRING OPERATIONS ─────────────────────────────────────────────────────────
awk '{print length($0)}' file.txt # print length of each line
awk '{print toupper($1)}' file.txt # uppercase first column
awk '{gsub(/old/, "new"); print}' file.txt # replace all occurrences in line
# ── REAL DE USE CASE: parse structured log file ───────────────────────────────
# Log format: 2026-03-17 08:14:32 INFO orders_pipeline rows_processed=48234 duration_s=92.4
# Extract pipeline name, rows processed, and duration:
awk '{
split($4, kv, "="); rows = kv[2]
split($5, kv, "="); dur = kv[2]
print $1, $2, rows, dur
}' pipeline.log | head -20sed — stream editor for text transformation
# ── BASIC SUBSTITUTION ───────────────────────────────────────────────────────
sed 's/old/new/' file.txt # replace FIRST occurrence per line
sed 's/old/new/g' file.txt # replace ALL occurrences per line
sed 's/old/new/gi' file.txt # replace all, case-insensitive
sed -i 's/old/new/g' file.txt # IN-PLACE replacement (modifies file!)
sed -i.bak 's/old/new/g' file.txt # in-place with .bak backup
# ── DELETION ──────────────────────────────────────────────────────────────────
sed '/pattern/d' file.txt # delete lines matching pattern
sed '/^$/d' file.txt # delete empty lines
sed '/^#/d' file.txt # delete comment lines starting with #
sed '1d' file.txt # delete first line (strip CSV header)
# ── EXTRACTION ────────────────────────────────────────────────────────────────
sed -n '10,20p' file.txt # print only lines 10-20
sed -n '/START/,/END/p' file.txt # print lines between START and END markers
sed -n '/2026-03-17/p' pipeline.log # print lines from a specific date
# ── REAL DE USE CASES ─────────────────────────────────────────────────────────
# Fix a config file's old URL across an entire directory:
find /etc/pipelines/ -name "*.yaml" -exec sed -i 's/old-db-host/new-db-host/g' {} ;
# Strip CSV header before processing:
sed '1d' orders.csv | wc -l
# Remove Windows line endings (CRLF → LF) from vendor CSV:
sed -i 's/
//' vendor_orders.csv
# Extract just the timestamp from log lines:
sed 's/(^[0-9-]* [0-9:]*).*/\1/' pipeline.log
# ── cut: simpler column extraction than awk ───────────────────────────────────
cut -d',' -f1,3 orders.csv # columns 1 and 3 from CSV
cut -d',' -f2- orders.csv # columns 2 to end
cut -c1-10 pipeline.log # first 10 characters of each linesort, uniq, and pipes — combining tools
# ── SORT ──────────────────────────────────────────────────────────────────────
sort file.txt # alphabetical sort
sort -n file.txt # numeric sort
sort -rn file.txt # reverse numeric (largest first)
sort -t',' -k3 -rn orders.csv # sort CSV by column 3, reverse numeric
sort -u file.txt # sort and remove duplicates (unique)
# ── UNIQ ──────────────────────────────────────────────────────────────────────
# uniq only removes ADJACENT duplicates — always sort first
sort file.txt | uniq # remove all duplicates
sort file.txt | uniq -c # count occurrences of each unique line
sort file.txt | uniq -d # show only lines that ARE duplicated
sort file.txt | uniq -u # show only lines that are NOT duplicated
# ── PIPE COMPOSITION: the real power ─────────────────────────────────────────
# Pipes connect commands: stdout of left → stdin of right
# Count unique order statuses in a CSV:
cut -d',' -f4 orders.csv | sort | uniq -c | sort -rn
# Output:
# 48234 delivered
# 8921 confirmed
# 3847 cancelled
# 1204 placed
# Top 10 IP addresses hitting a web server:
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
# Find lines that appear in file_a but not file_b:
comm -23 <(sort file_a.txt) <(sort file_b.txt)
# How many distinct order_ids in today's log:
grep "order_id" pipeline.log | grep -oE "order_id=[0-9]+" | sort -u | wc -l
# Largest files on the filesystem:
find /data -type f | xargs du -sh | sort -rh | head -20
# Check if a CSV has exactly the expected number of columns:
awk -F',' '{print NF}' orders.csv | sort | uniq -c
# Should show only one count — if multiple values, some rows have wrong column countProcess Management — Running, Monitoring, and Killing Pipelines
Data pipelines are processes. Understanding how Linux manages processes lets you run pipelines in the background, monitor their resource usage, kill stuck jobs, and diagnose why a machine is running slowly.
# ── VIEWING RUNNING PROCESSES ────────────────────────────────────────────────
ps aux # all processes: user, PID, CPU%, MEM%, command
ps aux | grep python # find all Python processes
ps aux | grep pipeline # find pipeline processes
ps -ef --forest # process tree — see parent-child relationships
# ── REAL-TIME MONITORING ──────────────────────────────────────────────────────
top # interactive process monitor
# sort: M=memory, P=CPU, T=runtime
# kill: k then enter PID
htop # better than top (if installed)
# use arrow keys, F9 to kill
# Key metrics in top:
# PID: process ID
# %CPU: CPU usage (can exceed 100% for multi-threaded)
# %MEM: memory as % of total RAM
# VSZ: virtual memory size
# RSS: resident set size (actual RAM used)
# STAT: S=sleeping, R=running, Z=zombie, D=waiting for disk I/O
# ── KILLING PROCESSES ─────────────────────────────────────────────────────────
kill 12345 # send SIGTERM to PID 12345 (graceful stop)
kill -9 12345 # send SIGKILL (force kill — no cleanup)
kill -15 12345 # explicitly send SIGTERM (same as kill)
pkill -f "python pipeline.py" # kill by name pattern
killall python3 # kill all processes named python3
# Use SIGTERM first (allows cleanup), only use SIGKILL if SIGTERM fails
# SIGKILL (-9) does not allow the process to clean up open files or connections
# ── BACKGROUND AND FOREGROUND ─────────────────────────────────────────────────
python pipeline.py & # run in background (& at end)
# prints PID: [1] 12345
jobs # list background jobs in current shell
fg %1 # bring job 1 to foreground
bg %1 # send job 1 to background
Ctrl+Z # suspend current foreground process
Ctrl+C # interrupt (kill) foreground process
# ── NOHUP: run after logout ────────────────────────────────────────────────────
nohup python pipeline.py > output.log 2>&1 &
# nohup: don't terminate when shell closes
# > output.log: redirect stdout to file
# 2>&1: redirect stderr to same place as stdout
# &: run in background
echo $! # print PID of last background command
# ── SCREEN / TMUX: persistent terminal sessions ──────────────────────────────
# Start a named session:
screen -S pipeline_run
tmux new -s pipeline_run
# Detach (leave running): Ctrl+A then D (screen) / Ctrl+B then D (tmux)
# List sessions: screen -ls / tmux ls
# Reattach: screen -r pipeline_run / tmux attach -t pipeline_run
# ── RESOURCE MONITORING ───────────────────────────────────────────────────────
free -h # memory usage (total, used, free, cached)
vmstat 2 10 # system stats every 2 seconds, 10 times
iostat -x 2 # disk I/O stats every 2 seconds
lsof -p 12345 # files opened by process 12345
lsof /data/orders.parquet # which process has this file open
netstat -tulpn # open network connections and ports
ss -tulpn # modern replacement for netstatSignals — what they mean for pipeline processes
# Signals are messages sent to processes
# Key signals for data engineers:
# SIGTERM (15) — graceful shutdown request
# The default signal from kill.
# Well-written pipelines catch SIGTERM and:
# - complete the current row/batch
# - flush write buffers
# - close database connections
# - write checkpoint state
# - exit cleanly
# Python: signal.signal(signal.SIGTERM, handler)
# SIGKILL (9) — immediate termination
# Cannot be caught or ignored by the process
# No cleanup — open files may be corrupted
# Database connections are abandoned (left in pg_stat_activity)
# Parquet files in progress are truncated and corrupt
# Use only as last resort
# SIGINT (2) — interrupt (Ctrl+C)
# Same as SIGTERM in most practical contexts
# Python raises KeyboardInterrupt when it receives SIGINT
# SIGHUP (1) — hangup
# Traditionally sent when terminal disconnects
# Many daemons use SIGHUP to reload configuration without restarting
# Example: pipeline that handles SIGTERM gracefully (Python):
# import signal, sys
# shutdown_requested = False
# def handle_sigterm(signum, frame):
# global shutdown_requested
# shutdown_requested = True
# logger.info("SIGTERM received — will stop after current batch")
# signal.signal(signal.SIGTERM, handle_sigterm)
#
# for batch in read_batches():
# if shutdown_requested:
# logger.info("Shutdown requested — stopping cleanly")
# break
# process_batch(batch)Moving Data Between Machines — scp, rsync, curl, wget
Data engineering involves constant movement of files between machines — from source servers to data lakes, from one cloud region to another, from external SFTP servers to local processing nodes. The right tool depends on the use case.
# ── SCP: simple file copy over SSH ───────────────────────────────────────────
# Local to remote:
scp orders.csv user@server-01:/data/landing/
# Remote to local:
scp user@server-01:/data/output/results.csv ./local/
# Directory (recursive):
scp -r /data/2026/03/ user@server-01:/data/archive/2026/03/
# Specify SSH key:
scp -i ~/.ssh/pipeline_key.pem orders.csv ec2-user@54.1.2.3:/data/
# ── RSYNC: efficient sync with delta transfer ─────────────────────────────────
# rsync only transfers files that have changed — ideal for large directories
rsync -avz /data/local/ user@server:/data/remote/
# -a: archive mode (recursive + preserve permissions/timestamps)
# -v: verbose
# -z: compress during transfer
# Dry run (show what would be transferred without doing it):
rsync -avzn /data/local/ user@server:/data/remote/
# Delete files on destination that no longer exist on source:
rsync -avz --delete /data/local/ user@server:/data/remote/
# Exclude certain file patterns:
rsync -avz --exclude='*.tmp' --exclude='_temp/' /data/local/ user@server:/data/remote/
# Resume an interrupted transfer (use partial flag):
rsync -avz --partial /data/large_file.parquet user@server:/data/
# ── CURL: HTTP data transfer ──────────────────────────────────────────────────
# Download a file:
curl -O https://data.gov.in/storage/f/orders_sample.csv
curl -o output.csv https://api.example.com/export
# Download with authentication:
curl -H "Authorization: Bearer $API_TOKEN" https://api.example.com/data > data.json
# POST request (webhook, API call):
curl -X POST https://api.example.com/ingest -H "Content-Type: application/json" -d '{"batch_id": "2026-03-17", "status": "complete"}'
# Follow redirects and show progress:
curl -L --progress-bar -O https://example.com/large_file.gz
# Download only if remote file is newer than local:
curl -z local_file.csv -O https://example.com/data.csv
# ── WGET: downloading files ───────────────────────────────────────────────────
wget https://example.com/data.csv
wget -q https://example.com/data.csv # quiet (no progress output)
wget -r -np https://example.com/data/ # recursive download (spider directory)
wget --continue large_file.zip # resume interrupted download
# ── MOVING DATA TO/FROM S3 ────────────────────────────────────────────────────
# AWS CLI must be installed and configured
aws s3 cp orders.csv s3://my-bucket/raw/orders.csv
aws s3 cp s3://my-bucket/output/results.csv ./local/
aws s3 sync /data/local/ s3://my-bucket/data/ # sync directory
aws s3 sync s3://my-bucket/data/ /data/local/ # sync from S3 to local
aws s3 ls s3://my-bucket/data/ --recursive # list all files
aws s3 rm s3://my-bucket/tmp/ --recursive # remove all files in prefixCron Scheduling — Automating Pipelines on a Schedule
Cron is the standard Unix scheduling system. For simple pipelines that do not yet warrant a full orchestration tool like Airflow, cron is often the fastest and most reliable way to schedule recurring jobs. Even when using Airflow, knowing cron syntax is essential because Airflow uses it for schedule intervals.
# ── CRON SYNTAX ──────────────────────────────────────────────────────────────
# Format: minute hour day_of_month month day_of_week command
# 0-59 0-23 1-31 1-12 0-7
# (0 and 7 both = Sunday)
# ── COMMON PATTERNS ───────────────────────────────────────────────────────────
# Every minute:
* * * * * /path/to/script.sh
# Every hour at minute 0:
0 * * * * /path/to/script.sh
# Every day at 6:00 AM:
0 6 * * * /path/to/script.sh
# Every day at 6:30 AM IST (= 1:00 AM UTC):
0 1 * * * /path/to/script.sh
# Every weekday (Mon–Fri) at 8 AM:
0 8 * * 1-5 /path/to/script.sh
# Every Monday at 7 AM:
0 7 * * 1 /path/to/script.sh
# 1st of every month at 3 AM:
0 3 1 * * /path/to/script.sh
# Every 15 minutes:
*/15 * * * * /path/to/script.sh
# Every 6 hours:
0 */6 * * * /path/to/script.sh
# At 3:15 AM on the 1st and 15th of each month:
15 3 1,15 * * /path/to/script.sh
# Airflow schedule string examples (same cron syntax):
# schedule='0 6 * * *' → daily at 6 AM UTC
# schedule='0 */4 * * *' → every 4 hours
# schedule='@daily' → shorthand for 0 0 * * *
# schedule='@hourly' → shorthand for 0 * * * *
# ── EDITING THE CRONTAB ───────────────────────────────────────────────────────
crontab -e # edit current user's crontab (opens in $EDITOR)
crontab -l # list current user's crontab
crontab -r # REMOVE entire crontab (careful!)
sudo crontab -u pipeline_user -e # edit another user's crontab
# ── PRODUCTION CRONTAB BEST PRACTICES ────────────────────────────────────────
# 1. Always use absolute paths — cron has a minimal PATH:
0 6 * * * /usr/bin/python3 /data/pipelines/orders_pipeline.py
# 2. Redirect all output to a log file:
0 6 * * * /data/pipelines/run.sh >> /var/log/pipelines/orders.log 2>&1
# 3. Set environment variables explicitly — cron does not load .bashrc:
0 6 * * * source /etc/pipeline_env && /data/pipelines/run.sh
# 4. Use a wrapper script with error handling:
0 6 * * * /data/pipelines/run_with_alerting.sh orders_pipeline
# 5. Add MAILTO to send failures by email:
MAILTO=data-team@company.com
0 6 * * * /data/pipelines/run.sh
# ── SYSTEM CRONTABS ──────────────────────────────────────────────────────────
# /etc/crontab — system-wide crontab (has extra user field)
# /etc/cron.d/ — drop-in crontab files per application
# /etc/cron.daily/ — scripts run daily by anacron
# /etc/cron.hourly/ — scripts run hourly
# System crontab format (has username field):
# min hr dom month dow USER command
0 6 * * * pipeline_user /data/pipelines/run.sh
# ── DEBUGGING CRON JOBS ───────────────────────────────────────────────────────
# 1. Check if cron daemon is running:
sudo systemctl status cron # or: sudo service cron status
# 2. Check cron logs:
grep CRON /var/log/syslog | tail -50 # Ubuntu/Debian
journalctl -u cron --since "1 hour ago" # systemd
# 3. Common cron failure causes:
# - Wrong PATH: add PATH=/usr/local/bin:/usr/bin:/bin at top of crontab
# - Missing environment variables: source env file at start of script
# - Script not executable: chmod +x script.sh
# - Wrong timezone: cron uses UTC; add TZ=Asia/Kolkata to crontab if needed
# - Output not captured: add >> /log/file.log 2>&1 to capture all outputBash Scripting — Writing Production Shell Scripts
Bash scripts wrap data pipeline invocations with the operational logic that pure Python pipelines do not handle well: checking preconditions before running, logging start/end times, sending alerts on failure, locking to prevent duplicate runs, and cleaning up temporary files. Every production data pipeline is wrapped in at least a basic bash script.
The bash script template every pipeline should use
#!/usr/bin/env bash
# ─────────────────────────────────────────────────────────────────────────────
# orders_pipeline.sh — Daily orders ingestion
# Runs: 06:00 AM IST daily via cron
# Owner: data-team@company.com
# ─────────────────────────────────────────────────────────────────────────────
set -euo pipefail
# -e: exit immediately on any command failure
# -u: treat unset variables as errors (not empty string)
# -o pipefail: pipeline fails if any command in it fails (not just the last)
# This trio is the FIRST thing in every production bash script
# ── Configuration ──────────────────────────────────────────────────────────────
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
readonly LOG_DIR="/var/log/pipelines"
readonly LOG_FILE="${LOG_DIR}/${SCRIPT_NAME%.*}_$(date +%Y%m%d).log"
readonly LOCK_FILE="/tmp/${SCRIPT_NAME}.lock"
readonly PYTHON_BIN="/usr/bin/python3"
readonly PIPELINE_SCRIPT="${SCRIPT_DIR}/pipeline/orders_ingestion.py"
# ── Logging ────────────────────────────────────────────────────────────────────
log() {
local level="$1"; shift
echo "$(date '+%Y-%m-%d %H:%M:%S') [${level}] $*" | tee -a "$LOG_FILE"
}
info() { log "INFO" "$@"; }
warning() { log "WARNING" "$@"; }
error() { log "ERROR" "$@"; }
# ── Cleanup on exit ────────────────────────────────────────────────────────────
cleanup() {
local exit_code=$?
rm -f "$LOCK_FILE" # always remove lock file
if [[ $exit_code -ne 0 ]]; then
error "Script exited with code $exit_code"
send_alert "Pipeline FAILED: ${SCRIPT_NAME} (exit code ${exit_code})"
fi
info "Cleanup complete"
}
trap cleanup EXIT # trap: run cleanup() no matter how script exits
# ── Alert function ─────────────────────────────────────────────────────────────
send_alert() {
local message="$1"
# Send to Slack, PagerDuty, email — adapt to your alerting system:
curl -s -X POST "$SLACK_WEBHOOK_URL" -H 'Content-type: application/json' --data "{"text": "🚨 ${message}"}" || true
# '|| true' prevents curl failure from failing the script
}
# ── Lock file: prevent concurrent runs ────────────────────────────────────────
if [[ -f "$LOCK_FILE" ]]; then
pid=$(cat "$LOCK_FILE")
if kill -0 "$pid" 2>/dev/null; then
error "Another instance is already running (PID $pid). Exiting."
exit 1
else
warning "Stale lock file found (PID $pid no longer running). Removing."
rm -f "$LOCK_FILE"
fi
fi
echo $$ > "$LOCK_FILE" # write current PID to lock file
# ── Precondition checks ────────────────────────────────────────────────────────
main() {
info "==== Starting ${SCRIPT_NAME} ===="
# Check required env variables
: "${DATABASE_URL:?DATABASE_URL environment variable is required}"
: "${API_KEY:?API_KEY environment variable is required}"
# ':' is a no-op; '?' makes bash error with message if variable is unset
# Check required files exist
[[ -f "$PIPELINE_SCRIPT" ]] || { error "Pipeline script not found: $PIPELINE_SCRIPT"; exit 1; }
[[ -d "$LOG_DIR" ]] || mkdir -p "$LOG_DIR"
# Check disk space (require at least 10 GB free)
local free_gb
free_gb=$(df -BG /data | awk 'NR==2 {print $4}' | tr -d 'G')
if [[ $free_gb -lt 10 ]]; then
error "Insufficient disk space: only ${free_gb}GB free (need 10GB)"
send_alert "Pipeline blocked: disk space low (${free_gb}GB)"
exit 1
fi
# Determine run date (yesterday by default, or first argument)
local run_date="${1:-$(date -d 'yesterday' +%Y-%m-%d)}"
info "Processing date: $run_date"
# ── Run the pipeline ──────────────────────────────────────────────────────
info "Starting Python pipeline..."
local start_time=$SECONDS
"$PYTHON_BIN" "$PIPELINE_SCRIPT" --date "$run_date" --log-level INFO 2>&1 | tee -a "$LOG_FILE"
local duration=$(( SECONDS - start_time ))
info "Pipeline completed successfully in ${duration}s"
# ── Post-run validation ────────────────────────────────────────────────────
local expected_min_rows=1000
local actual_rows
actual_rows=$(psql "$DATABASE_URL" -t -c "SELECT COUNT(*) FROM silver.orders WHERE order_date = '$run_date'")
actual_rows=$(echo "$actual_rows" | tr -d ' ')
if [[ $actual_rows -lt $expected_min_rows ]]; then
warning "Row count suspicious: got $actual_rows, expected at least $expected_min_rows"
send_alert "Pipeline warning: low row count for $run_date (got $actual_rows)"
fi
info "==== Finished ${SCRIPT_NAME} — rows: $actual_rows ===="
}
main "$@"Variables, conditionals, and loops
#!/usr/bin/env bash
set -euo pipefail
# ── VARIABLES ──────────────────────────────────────────────────────────────────
name="FreshMart"
count=42
today=$(date +%Y-%m-%d) # command substitution
files=$(ls /data/*.csv | wc -l) # command substitution
echo "Company: $name"
echo "Count: ${count}" # braces recommended for clarity
echo "Today: ${today}"
# Read-only variables:
readonly MAX_RETRIES=5
# Arrays:
stores=("ST001" "ST002" "ST003" "ST004" "ST005")
echo "${stores[0]}" # first element
echo "${stores[@]}" # all elements
echo "${#stores[@]}" # array length
# ── CONDITIONALS ──────────────────────────────────────────────────────────────
# Test file/directory existence:
if [[ -f "/data/orders.csv" ]]; then
echo "File exists"
elif [[ -d "/data/" ]]; then
echo "Directory exists but file does not"
else
echo "Neither exists"
fi
# Test string comparison:
status="delivered"
if [[ "$status" == "delivered" ]]; then
echo "Order delivered"
elif [[ "$status" == "cancelled" ]]; then
echo "Order cancelled"
fi
# Test numeric comparison:
row_count=48234
if [[ $row_count -gt 0 ]]; then # -gt greater than
echo "Has rows"
fi
if [[ $row_count -ge 1000 ]]; then # -ge greater than or equal
echo "Enough rows"
fi
# Comparison operators: -eq -ne -gt -ge -lt -le
# Test command success/failure:
if psql "$DATABASE_URL" -c "SELECT 1" > /dev/null 2>&1; then
echo "Database is reachable"
else
echo "Cannot reach database"
exit 1
fi
# ── LOOPS ─────────────────────────────────────────────────────────────────────
# For loop over array:
for store in "${stores[@]}"; do
echo "Processing store: $store"
python3 process_store.py --store "$store"
done
# For loop with range:
for i in {1..10}; do
echo "Attempt $i"
done
# C-style for loop:
for ((i=1; i<=5; i++)); do
echo "Step $i of 5"
done
# While loop:
retry=0
max_retries=5
while [[ $retry -lt $max_retries ]]; do
if python3 pipeline.py; then
echo "Success on attempt $((retry+1))"
break
fi
retry=$((retry+1))
echo "Attempt $retry failed, retrying in $((2**retry))s"
sleep $((2**retry))
done
if [[ $retry -eq $max_retries ]]; then
echo "All $max_retries attempts failed"
exit 1
fi
# Loop over files:
for file in /data/incoming/*.csv; do
[[ -f "$file" ]] || continue # skip if no files match (glob expands literally)
echo "Processing: $file"
process_file "$file"
done
# Loop over lines in a file:
while IFS= read -r line; do
echo "Line: $line"
done < /data/store_list.txt
# ── FUNCTIONS ──────────────────────────────────────────────────────────────────
check_disk_space() {
local path="${1:-/data}"
local min_gb="${2:-10}"
local free_gb
free_gb=$(df -BG "$path" | awk 'NR==2 {print $4}' | tr -d 'G')
if [[ $free_gb -lt $min_gb ]]; then
echo "ERROR: insufficient disk space at $path (${free_gb}GB < ${min_gb}GB)"
return 1
fi
echo "Disk OK: ${free_gb}GB available at $path"
return 0
}
# Call the function:
check_disk_space /data 10 || exit 1String manipulation in bash
# ── STRING OPERATIONS ────────────────────────────────────────────────────────
filename="/data/orders_2026_03_17.csv"
# Extract filename from path (like basename):
echo "${filename##*/}" # orders_2026_03_17.csv
# Extract directory from path (like dirname):
echo "${filename%/*}" # /data
# Remove extension:
echo "${filename%.*}" # /data/orders_2026_03_17
# Get extension:
echo "${filename##*.}" # csv
# String length:
echo "${#filename}" # 30
# Substring (offset:length):
date_part="2026_03_17"
echo "${date_part:0:4}" # 2026 (year)
echo "${date_part:5:2}" # 03 (month)
# Replace:
echo "${filename/csv/parquet}" # replace first occurrence
echo "${filename//csv/parquet}" # replace all occurrences
# Uppercase / lowercase:
status="Delivered"
echo "${status^^}" # DELIVERED
echo "${status,,}" # delivered
# ── DATE MANIPULATION IN BASH ─────────────────────────────────────────────────
today=$(date +%Y-%m-%d) # 2026-03-17
yesterday=$(date -d 'yesterday' +%Y-%m-%d) # 2026-03-16 (Linux)
yesterday=$(date -v-1d +%Y-%m-%d) # 2026-03-16 (macOS)
last_week=$(date -d '7 days ago' +%Y-%m-%d) # 2026-03-10
first_of_month=$(date +%Y-%m-01) # 2026-03-01
# Formatted for log file names:
log_suffix=$(date +%Y%m%d_%H%M%S) # 20260317_081432
# ── ARITHMETIC ────────────────────────────────────────────────────────────────
a=10
b=3
echo $((a + b)) # 13
echo $((a - b)) # 7
echo $((a * b)) # 30
echo $((a / b)) # 3 (integer division)
echo $((a % b)) # 1 (modulo)
echo $((2 ** 8)) # 256 (exponentiation)
count=0
((count++)) # increment in place
((count += 5)) # add 5 in placeEnvironment Variables, PATH, and Shell Config
Data engineering pipelines rely heavily on environment variables for configuration — database URLs, API keys, output paths, and feature flags. Understanding how the Linux environment works prevents configuration bugs that manifest only in production (where cron runs) and not in development (where you manually source your dotfiles).
# ── READING AND SETTING ENV VARS ─────────────────────────────────────────────
echo $HOME # print variable
echo ${DATABASE_URL:-not_set} # print with default if unset
printenv # print all environment variables
printenv DATABASE_URL # print specific variable
env # same as printenv
# Set a variable (current shell only):
export DATABASE_URL="postgresql://user:pass@localhost:5432/orders"
export API_KEY="rzp_live_xxxx"
# Set for one command only (does not persist):
DEBUG=true python3 pipeline.py
# Unset a variable:
unset DATABASE_URL
# ── LOADING ENV FILES ─────────────────────────────────────────────────────────
# In development: load from .env file
# IMPORTANT: .env should be in .gitignore
# Method 1: source the file (variables become part of current shell):
source /etc/pipeline_environment
. /etc/pipeline_environment # same thing (. is alias for source)
# Method 2: export all variables from a file:
set -a # automatically export all variables
source .env
set +a # stop auto-exporting
# Method 3: read file in a script safely (ignore comments and blank lines):
while IFS='=' read -r key value; do
[[ "$key" =~ ^[[:space:]]*# ]] && continue # skip comments
[[ -z "$key" ]] && continue # skip blank lines
export "$key=$value"
done < .env
# ── UNDERSTANDING PATH ────────────────────────────────────────────────────────
echo $PATH # colon-separated list of directories to search for commands
which python3 # which binary will run when you type python3?
type python3 # same, with more detail
# Add a directory to PATH permanently (add to ~/.bashrc or ~/.bash_profile):
export PATH="/usr/local/bin:$PATH" # prepend (checked first)
export PATH="$PATH:/opt/custom_tools" # append (checked last)
# ── SHELL STARTUP FILES ───────────────────────────────────────────────────────
# ~/.bashrc — executed for every non-login interactive shell
# ~/.bash_profile — executed for login shells (SSH sessions)
# ~/.profile — fallback if .bash_profile not found
# /etc/profile — system-wide login shell config
# /etc/bash.bashrc — system-wide .bashrc equivalent
# IMPORTANT FOR DATA ENGINEERS:
# Cron jobs run in a minimal environment — they do NOT source ~/.bashrc
# Variables set in ~/.bashrc are NOT available to cron
# Solutions:
# 1. Source the file explicitly in your cron command:
# 0 6 * * * source ~/.bashrc && python3 pipeline.py
# 2. Set variables in a dedicated env file and source it in your script:
# source /etc/pipeline_environment
# 3. Set variables at the top of the crontab:
# DATABASE_URL=postgresql://...
# 0 6 * * * python3 pipeline.py
# ── USEFUL ENVIRONMENT TRICKS ─────────────────────────────────────────────────
# Check if a command exists before using it:
if command -v aws &>/dev/null; then
echo "AWS CLI is installed"
aws s3 sync ...
else
echo "AWS CLI not found — install it first"
exit 1
fi
# Get script's own directory reliably (works even with symlinks and sourcing):
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Check if running as root:
if [[ $EUID -eq 0 ]]; then
echo "Running as root"
fi
# Check operating system:
if [[ "$(uname)" == "Darwin" ]]; then
echo "macOS"
elif [[ "$(uname)" == "Linux" ]]; then
echo "Linux"
fiDiagnosing a Failed Pipeline at 7 AM Using Only Linux Commands
You receive a PagerDuty alert at 6:47 AM: "orders_pipeline has not completed by 06:45 AM SLA." You SSH into the pipeline server. Here is the exact sequence of Linux commands you run to diagnose and fix it.
# Step 1: Check if the process is still running or crashed
ps aux | grep orders_pipeline
# Output:
# pipeline 18734 98.2 4.1 python3 orders_pipeline.py --date 2026-03-17
# → It IS running but at 98% CPU — something is wrong
# Step 2: Check how long it has been running
ps -p 18734 -o pid,etime,pcpu,pmem,cmd
# Output:
# PID ELAPSED %CPU %MEM CMD
# 18734 02:14:32 98.2 4.1 python3 orders_pipeline.py
# → Running for 2 hours 14 minutes — should finish in 30 minutes
# Step 3: Check disk space (full disk is a common silent killer)
df -h /data
# Output:
# Filesystem Size Used Avail Use% Mounted on
# /dev/sdb1 500G 499G 512M 99% /data
# → DISK IS FULL! 512 MB free on /data
# Step 4: Find what is consuming the space
du -sh /data/* | sort -rh | head -10
# Output:
# 487G /data/raw
# 8G /data/processed
# 2G /data/logs
# 1G /data/tmp
du -sh /data/raw/* | sort -rh | head -10
# 312G /data/raw/2026
# 175G /data/raw/2025
du -sh /data/raw/2026/* | sort -rh
# 288G /data/raw/2026/03
# 24G /data/raw/2026/02
du -sh /data/raw/2026/03/* | sort -rh
# 288G /data/raw/2026/03/17
# → Today's date has 288 GB! A runaway process wrote too much data
# Step 5: Find the culprit files
ls -lth /data/raw/2026/03/17/ | head -20
# -rw-r--r-- 1 pipeline pipeline 288G Mar 17 06:28 orders_debug_dump.csv
# → A 288 GB debug dump file was accidentally enabled
# Step 6: Check the log to confirm
tail -100 /var/log/pipelines/orders_pipeline_20260317.log | grep -i debug
# 2026-03-17 04:32:14 WARNING DEBUG_MODE=true detected — writing full row dump
# Step 7: Kill the stuck pipeline gracefully
kill -15 18734 # SIGTERM first
sleep 5
kill -0 18734 2>/dev/null && kill -9 18734 # force if still alive
# Step 8: Free the disk — delete the debug dump
rm /data/raw/2026/03/17/orders_debug_dump.csv
df -h /data
# Now shows 212G available — problem resolved
# Step 9: Fix the config and restart
sed -i 's/DEBUG_MODE=true/DEBUG_MODE=false/' /etc/pipelines/orders.env
python3 /data/pipelines/orders_pipeline.py --date 2026-03-17 >> /var/log/pipelines/orders_pipeline_20260317.log 2>&1 &
echo "Restarted with PID $!"
# Step 10: Monitor the restart
tail -f /var/log/pipelines/orders_pipeline_20260317.log | grep -E "INFO|ERROR"
# Watch for normal progress messages...
# 2026-03-17 07:03:41 INFO Batch 1 complete: 10000 rows
# 2026-03-17 07:04:28 INFO Batch 2 complete: 10000 rowsTotal time from alert to resolution: 22 minutes. Every command used in this diagnosis was from this module. A data engineer who knows these Linux tools reaches the root cause in minutes. One who does not might spend hours opening tickets, waiting for escalations, or guessing.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Every data pipeline runs on Linux. SSH, navigate, inspect files, check disk usage, and read logs — these are the first actions in every on-call incident. du -sh and df -h diagnose disk problems in seconds. tail -f follows logs in real time. find locates files by name, age, or size.
- ✓grep, awk, sed, and cut are the fastest tools for log analysis. grep -E with multiple patterns, -C for context lines, and -v for exclusion. awk processes column-by-column with calculations. sed substitutes and deletes in-place. Pipe them together: cut | sort | uniq -c | sort -rn gives frequency distributions in one line.
- ✓Linux file permissions are three sets of rwx for owner, group, and others. 755 for executable scripts, 644 for config files, 600 for secret files. chmod +x adds execute permission. Always diagnose "Permission denied" errors with ls -lah to read the permission bits.
- ✓kill -15 (SIGTERM) requests graceful shutdown — the pipeline can clean up. kill -9 (SIGKILL) forces immediate termination with no cleanup — open files may be corrupted. Always try SIGTERM first. nohup command & runs a process that survives shell logout.
- ✓Every production bash script must start with set -euo pipefail. -e exits on any command failure. -u errors on unset variables. -o pipefail fails the whole pipeline if any stage fails. Without these, bash silently continues through errors.
- ✓Cron runs in a minimal environment — it does not load .bashrc. Always use absolute paths in cron commands. Always redirect output: command >> /log/file.log 2>&1. Always set required environment variables explicitly in the script or at the top of the crontab. Use cron -e to edit and grep CRON /var/log/syslog to debug.
- ✓The production bash script template includes: set -euo pipefail, a logging function, a trap for cleanup on exit, a lock file to prevent duplicate runs, precondition checks (disk space, env variables, file existence), output redirection, and post-run validation.
- ✓rsync is more efficient than scp for directory synchronisation — it only transfers changed files. Use --dry-run to preview before executing. Use --delete to keep destination in sync. curl -H "Authorization: Bearer $TOKEN" handles authenticated API downloads. aws s3 sync handles S3 directory synchronisation.
- ✓For text processing at scale: pipe composition is the key. Each tool does one thing well — grep filters, cut extracts columns, sort orders, uniq -c counts, head limits. Chaining them with pipes builds powerful one-line analyses without writing a single Python script.
- ✓The diagnostic sequence for a stuck pipeline: ps aux to check it is running, df -h to check disk, du -sh to find the space consumer, lsof -p PID to check open files, tail -f on the log file to see last activity, and pg_stat_activity to check for database lock waits. These five checks diagnose 90% of production pipeline failures.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.