Last 30 Days
No notifications
Version control is essential for data analysts. It lets you track changes to scripts, notebooks, and SQL queries — roll back mistakes, collaborate with teammates, and maintain a clean project history.
git log and git blameWorking Directory → Staging Area → Local Repository → Remote (GitHub)
(edit files) (git add) (git commit) (git push)| Command | Purpose | Example |
git init | Initialize a new repo | git init my-analysis |
git status | Check file states | git status |
git add | Stage files for commit | git add analysis.py |
git commit | Save a snapshot | git commit -m "Add Q1 analysis" |
git log | View commit history | git log --oneline |
git diff | See uncommitted changes | git diff analysis.py |
git branch | Create/list branches | git branch feature-q2 |
git checkout | Switch branches | git checkout feature-q2 |
git merge | Combine branches | git merge feature-q2 |
git push | Upload to remote | git push origin main |
git pull | Download latest changes | git pull origin main |
Data files, credentials, and environment folders should never be committed:
# .gitignore for data analytics projects
*.csv
*.xlsx
*.parquet
data/raw/
data/processed/
.env
__pycache__/
.ipynb_checkpoints/
*.pyc
venv/
.DS_Storemain — production-ready code and final reportsfeature/q2-analysis — work-in-progress analysisfix/data-cleaning-bug — bug fixesmain only when work is complete and reviewed.Git is a time machine for files. Every save (a *commit*) is a snapshot you can return to. Every parallel idea (a *branch*) is its own timeline you can develop without breaking the main one. For analysts this matters because notebooks, SQL files, and dashboards are real code — and code without version control gets lost, overwritten, or unreproducible.
Git tracks the contents of a folder (repo) over time. Three places to know:
| Place | What lives there | Move with |
| Working directory | files you're editing | — |
| Staging area (index) | changes queued for the next commit | git add |
| Local repo (.git) | committed history | git commit |
| Remote (GitHub) | shared copy for the team | git push / git pull |
Almost every Git command is just moving content between these four places.
git init # start a repo (or git clone <url>)
git status # what changed?
git add file # stage a change
git commit -m "why" # save the staged snapshot
git log --oneline # see history
git branch new-thing # branch off
git checkout new-thing # switch branch (or git switch)
git merge new-thing # bring it back into mainLearn these well before chasing rebase/cherry-pick/reflog.
1. Committing data, secrets, or huge files. Add a .gitignore *before* your first commit: *.csv, .env, __pycache__/, .ipynb_checkpoints/.
2. Useless commit messages like "update" or "fix". Future-you cannot read your mind.
3. Editing on main. Always branch for any non-trivial change.
4. Force-pushing to a shared branch. git push --force rewrites history that others have. Use --force-with-lease and never on main.
5. Pulling without committing local work. Either commit or git stash first — otherwise merge conflicts hit unsaved changes.
6. Running long-running notebooks before committing. Output cells make giant diffs; clear them or use nbstripout.
A branch is just a *pointer to a commit*. Creating one is instant.
main: A → B → C
\
feature: D → EMerging feature into main produces F, a merge commit:
main: A → B → C →───────────F
\ /
feature: D → E →──┘A *fast-forward* merge happens when main hasn't moved — Git just slides the pointer with no merge commit.
Git can't auto-merge changes that touch the same lines. It marks them:
<<<<<<< HEAD
revenue = sales * 1.10
=======
revenue = sales * 1.08
>>>>>>> featureFix: choose one (or combine), delete the markers, then git add + git commit. git merge --abort backs out if you panic.
git stash # park dirty work
git stash pop # bring it back
git diff # unstaged changes
git diff --staged # what's about to be committed
git restore file # discard local edits to file
git restore --staged f # un-stage fThese four commands save you from "oh no I changed the wrong file" panic.
The industry-standard team flow:
1. git checkout -b feat/add-region-filter
2. Commit small, logical changes.
3. git push -u origin feat/add-region-filter.
4. Open a Pull Request on GitHub. Describe *why*.
5. Reviewers comment, you push more commits to the same branch.
6. Squash-merge into main when approved.
7. Delete the branch.
Protect main: required reviews, passing CI, no direct pushes.
<short imperative summary, ≤50 chars><optional body explaining WHY, wrapped at 72 chars>
Examples:
Fix null handling in Q1 revenue calculationAdd region slicer to sales dashboardupdate, final, asdffeat:, fix:, chore:) to auto-generate changelogs.Both combine histories; they look different.
main, producing a linear history.git checkout feature
git rebase main # replay feature commits onto current mainRule of thumb: rebase before pushing, merge after. Never rebase a branch others are working on — it rewrites their history.
git commit --amend # fix the last commit
git reset HEAD~1 # un-commit, keep changes
git reset --hard HEAD~1 # un-commit AND discard (dangerous)
git revert <hash> # safe undo: makes a new commit
git reflog # last-resort recovery of "lost" commitsreflog is your safety net — even "lost" commits live for ~90 days.
git tag v1.0.0 — mark a release point. Push with git push --tags.git bisect does a binary search through history to find which commit introduced a bug — magical when you need it.Notebooks store outputs and execution counts inline — every re-run is a giant diff. Two fixes:
1. nbstripout — a pre-commit hook that strips outputs.
2. Jupytext — pair the notebook with a clean .py or .md file that's the source of truth in Git.
For data files, use DVC or Git LFS instead of committing CSVs directly.
Wire up automated checks so bad code never lands:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- repo: https://github.com/psf/black
hooks: [{ id: black }]
- repo: https://github.com/kynan/nbstripout
hooks: [{ id: nbstripout }]GitHub Actions can run the same on every PR — lint, tests, dbt, whatever you need.
1. Initialise a repo for a notebook project, add a .gitignore that excludes data and .ipynb_checkpoints.
2. Create a feature branch, make 3 commits with proper messages, open a PR on GitHub.
3. Trigger an intentional merge conflict, resolve it, finish the merge.
4. Add nbstripout (or pre-commit) and confirm notebook outputs are no longer in diffs.