Notifications

No notifications

/Phase 2

Git & Version Control

Git & Version Control — Track Every Change in Your Data Projects

Version control is essential for data analysts. It lets you track changes to scripts, notebooks, and SQL queries — roll back mistakes, collaborate with teammates, and maintain a clean project history.

Why Git for Data Analytics?

  • Reproducibility: Know exactly what code produced a report
  • Collaboration: Multiple analysts work on the same project without overwriting each other
  • Safety: Undo mistakes by reverting to any previous version
  • Accountability: See who changed what and when via git log and git blame

Core Git Workflow

Working Directory  →  Staging Area  →  Local Repository  →  Remote (GitHub)
   (edit files)       (git add)        (git commit)         (git push)

Essential Commands

CommandPurposeExample
git initInitialize a new repogit init my-analysis
git statusCheck file statesgit status
git addStage files for commitgit add analysis.py
git commitSave a snapshotgit commit -m "Add Q1 analysis"
git logView commit historygit log --oneline
git diffSee uncommitted changesgit diff analysis.py
git branchCreate/list branchesgit branch feature-q2
git checkoutSwitch branchesgit checkout feature-q2
git mergeCombine branchesgit merge feature-q2
git pushUpload to remotegit push origin main
git pullDownload latest changesgit pull origin main

.gitignore for Data Projects

Data files, credentials, and environment folders should never be committed:

# .gitignore for data analytics projects
*.csv
*.xlsx
*.parquet
data/raw/
data/processed/
.env
__pycache__/
.ipynb_checkpoints/
*.pyc
venv/
.DS_Store

Branching Strategy

  • main — production-ready code and final reports
  • feature/q2-analysis — work-in-progress analysis
  • fix/data-cleaning-bug — bug fixes
Merge into main only when work is complete and reviewed.

On this page

Detailed Theory

Git is a time machine for files. Every save (a *commit*) is a snapshot you can return to. Every parallel idea (a *branch*) is its own timeline you can develop without breaking the main one. For analysts this matters because notebooks, SQL files, and dashboards are real code — and code without version control gets lost, overwritten, or unreproducible.

What Git Actually Is

Git tracks the contents of a folder (repo) over time. Three places to know:

PlaceWhat lives thereMove with
Working directoryfiles you're editing
Staging area (index)changes queued for the next commitgit add
Local repo (.git)committed historygit commit
Remote (GitHub)shared copy for the teamgit push / git pull

Almost every Git command is just moving content between these four places.

The 8 Commands That Cover Most Days

git init                 # start a repo (or git clone <url>)
git status               # what changed?
git add file             # stage a change
git commit -m "why"      # save the staged snapshot
git log --oneline        # see history
git branch new-thing     # branch off
git checkout new-thing   # switch branch  (or git switch)
git merge new-thing      # bring it back into main

Learn these well before chasing rebase/cherry-pick/reflog.

Beginner Mistakes to Skip

1. Committing data, secrets, or huge files. Add a .gitignore *before* your first commit: *.csv, .env, __pycache__/, .ipynb_checkpoints/. 2. Useless commit messages like "update" or "fix". Future-you cannot read your mind. 3. Editing on main. Always branch for any non-trivial change. 4. Force-pushing to a shared branch. git push --force rewrites history that others have. Use --force-with-lease and never on main. 5. Pulling without committing local work. Either commit or git stash first — otherwise merge conflicts hit unsaved changes. 6. Running long-running notebooks before committing. Output cells make giant diffs; clear them or use nbstripout.

Intermediate: Branching Mental Model

A branch is just a *pointer to a commit*. Creating one is instant.

main:    A → B → C
                  \
feature:           D → E

Merging feature into main produces F, a merge commit:

main:    A → B → C →───────────F
                  \         /
feature:           D → E →──┘

A *fast-forward* merge happens when main hasn't moved — Git just slides the pointer with no merge commit.

Intermediate: Resolving Merge Conflicts

Git can't auto-merge changes that touch the same lines. It marks them:

<<<<<<< HEAD
revenue = sales * 1.10
=======
revenue = sales * 1.08
>>>>>>> feature

Fix: choose one (or combine), delete the markers, then git add + git commit. git merge --abort backs out if you panic.

Intermediate: Stash, Diff, Restore

git stash               # park dirty work
git stash pop           # bring it back
git diff                # unstaged changes
git diff --staged       # what's about to be committed
git restore file        # discard local edits to file
git restore --staged f  # un-stage f

These four commands save you from "oh no I changed the wrong file" panic.

Intermediate: GitHub PR Workflow

The industry-standard team flow:

1. git checkout -b feat/add-region-filter 2. Commit small, logical changes. 3. git push -u origin feat/add-region-filter. 4. Open a Pull Request on GitHub. Describe *why*. 5. Reviewers comment, you push more commits to the same branch. 6. Squash-merge into main when approved. 7. Delete the branch.

Protect main: required reviews, passing CI, no direct pushes.

Intermediate: Good Commit Messages

<short imperative summary, ≤50 chars>

<optional body explaining WHY, wrapped at 72 chars>

Examples:

  • Fix null handling in Q1 revenue calculation
  • Add region slicer to sales dashboard
  • update, final, asdf
Many teams use Conventional Commits (feat:, fix:, chore:) to auto-generate changelogs.

Advanced: Rebase vs Merge

Both combine histories; they look different.

  • Merge preserves the actual timeline (with merge commits).
  • Rebase *replays* your branch on top of main, producing a linear history.
git checkout feature
git rebase main      # replay feature commits onto current main

Rule of thumb: rebase before pushing, merge after. Never rebase a branch others are working on — it rewrites their history.

Advanced: Undo Toolkit

git commit --amend            # fix the last commit
git reset HEAD~1              # un-commit, keep changes
git reset --hard HEAD~1       # un-commit AND discard (dangerous)
git revert <hash>             # safe undo: makes a new commit
git reflog                    # last-resort recovery of "lost" commits

reflog is your safety net — even "lost" commits live for ~90 days.

Advanced: Tags, Releases & Bisect

  • git tag v1.0.0 — mark a release point. Push with git push --tags.
  • GitHub Releases attach binaries / changelogs to a tag.
  • git bisect does a binary search through history to find which commit introduced a bug — magical when you need it.

Advanced: Notebooks in Git (Analyst-Specific)

Notebooks store outputs and execution counts inline — every re-run is a giant diff. Two fixes:

1. nbstripout — a pre-commit hook that strips outputs. 2. Jupytext — pair the notebook with a clean .py or .md file that's the source of truth in Git.

For data files, use DVC or Git LFS instead of committing CSVs directly.

Advanced: Pre-commit Hooks & CI

Wire up automated checks so bad code never lands:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
  - repo: https://github.com/psf/black
    hooks: [{ id: black }]
  - repo: https://github.com/kynan/nbstripout
    hooks: [{ id: nbstripout }]

GitHub Actions can run the same on every PR — lint, tests, dbt, whatever you need.

Practice Path

1. Initialise a repo for a notebook project, add a .gitignore that excludes data and .ipynb_checkpoints. 2. Create a feature branch, make 3 commits with proper messages, open a PR on GitHub. 3. Trigger an intentional merge conflict, resolve it, finish the merge. 4. Add nbstripout (or pre-commit) and confirm notebook outputs are no longer in diffs.