Git Basics for Data Analysts

2026-06-13 Data Analysis 4 min read

Git

Git is version control: it records snapshots of your project so you can see what changed, go back to any earlier state, and work without fear of breaking things. For a data analyst, that means SQL scripts, Python notebooks, and configs with a real history — instead of analysis_final_v3_FINAL.sql.

You need a small subset of Git daily. This notebook covers that subset plus the recovery commands for when something goes wrong.

For the environment files you will be ignoring here (.venv, .env), see Python venv & Project Setup for Analysts.

One-Time Setup

# identify yourself (stamped into every commit)
git config --global user.name  "Jake Wang"
git config --global user.email "you@example.com"

# start tracking an existing folder
cd my-analysis
git init

# or copy an existing repository from GitHub
git clone https://github.com/user/repo.git

How Git Thinks: Three Areas

Working directory  ──git add──►  Staging area  ──git commit──►  History
(your actual files)              (what goes in                 (permanent
                                  the next snapshot)            snapshots)

You edit files in the working directory, stage the changes you want to keep with git add, then commit the staged set as one snapshot with a message. Staging exists so one commit can contain exactly one logical change — not everything you happened to touch.

The Daily Loop

git status                        # what changed? always start here
git diff                          # line-by-line changes not yet staged
git add monthly_report.sql        # stage one file
git add .                         # stage everything (check status first!)
git commit -m "Add MoM comparison to monthly report"
git log --oneline                 # history, one line per commit

A good commit message completes the sentence "This commit will ..." — "Add station filter to revenue query", not "update" or "fix stuff". Future you is the reader.

How Often to Commit

One commit per finished thought: a query that now works, a chart that renders, a cleaning step that passes its checks. Small commits make history readable and rollbacks painless. If your message needs the word "and" twice, it should probably be two commits.

What NOT to Commit — .gitignore

Create a .gitignore file in the project root before the first commit:

# secrets — never in version control
.env

# virtual environment (rebuild it from requirements.txt instead)
.venv/

# data files — too big, often sensitive; keep the SCRIPT that creates them
*.csv
*.xlsx
data/

# python noise
__pycache__/
.ipynb_checkpoints/

# OS noise
.DS_Store

The rule: commit code and config, ignore data, secrets, and anything rebuildable. If a secret does get committed, deleting the file in a later commit does not help — the history still contains it. Rotate the key, then clean the history.

# started tracking a file before adding it to .gitignore?
git rm --cached .env        # untrack it but keep the file on disk

Branches

A branch is a parallel line of history — try an idea without touching the working version.

git branch                        # list branches; * marks the current one
git switch -c try-new-segments    # create a branch and move to it
# ...edit, add, commit as usual...
git switch main                   # back to the main line (files change on disk!)
git merge try-new-segments        # bring the branch's commits into main
git branch -d try-new-segments    # delete the merged branch

For solo analysis work, two habits cover most needs: keep main always working, and open a branch for anything experimental or multi-day.

Merge Conflicts

If both branches changed the same lines, the merge stops and marks the file:

<<<<<<< HEAD
WHERE entry_time >= '2025-01-01'
=======
WHERE entry_time >= '2024-06-01'
>>>>>>> try-new-segments

Edit the file to the version you actually want (delete the markers), then git add the file and git commit. A conflict is not an error — it is Git asking you to decide.

Undoing Things

The commands you look up in a mild panic, in increasing order of severity:

Situation	Command	Touches history?
Discard unstaged edits in one file	`git restore file.sql`	No
Unstage a file (keep the edits)	`git restore --staged file.sql`	No
Fix the last commit message / add a forgotten file	`git commit --amend`	Rewrites last commit
Undo a pushed commit safely	`git revert <commit>`	No — adds an opposite commit
Move the branch back, keep file changes	`git reset --soft <commit>`	Yes
Move back and discard everything	`git reset --hard <commit>`	Yes — destructive

# see an old version of a file, or bring it back
git log --oneline -- monthly_report.sql      # which commits touched this file
git show a1b2c3d:monthly_report.sql          # print that version
git restore --source a1b2c3d monthly_report.sql   # restore it into the working dir

Rule of thumb: restore for files, revert for shared history, reset only on commits that were never pushed. If you are not sure, git stash first — it shelves your current changes where you can always get them back (git stash pop).

Working with GitHub (Remotes)

# connect a local repo to an empty GitHub repo
git remote add origin https://github.com/user/my-analysis.git
git push -u origin main            # -u: remember the pairing; later pushes are just `git push`

# daily sync
git pull                           # fetch remote commits and merge them in
git push                           # upload your commits

# see what's on the remote without merging anything
git fetch
git log main..origin/main --oneline

pull = fetch (download) + merge (apply). When working alone on one machine, push at the end of every session — the remote doubles as your backup.

Common DA Workflows

1. Start a New Analysis Project

mkdir revenue-analysis && cd revenue-analysis
git init
# create .gitignore FIRST (see above), then:
git add .gitignore
git commit -m "Initial commit with gitignore"

2. "It Worked Yesterday" — Find What Changed

git log --oneline -- daily_report.sql     # recent commits touching the file
git diff HEAD~1 -- daily_report.sql       # what changed since the previous commit
git restore --source HEAD~1 daily_report.sql   # take yesterday's version back

3. Try a Risky Rewrite

git switch -c rewrite-cohort-query
# ...experiment freely; main is untouched...
# good result → merge it; dead end → switch back and delete the branch

4. Notebooks in Git

Jupyter notebooks store their outputs (tables, charts, execution counts) inside the file, so every re-run shows up as a huge diff. Either clear outputs before committing (Edit → Clear All Outputs), or automate it with nbstripout — and keep heavy logic in .py / .sql files, which diff cleanly.

Common Mistakes

1. Committing Secrets

A .env or API key in history stays in history — even after you delete the file. Add .env to .gitignore before the first commit; if a key leaks anyway, rotate the key first, then rewrite history if needed.

2. Committing Data Files

A 300 MB CSV makes every clone slow forever (history keeps every version). Commit the query or script that produces the data, ignore the data itself.

3. `git reset --hard` as a Reflex

It throws away uncommitted work with no confirmation. Prefer git stash (recoverable) or git restore <file> (one file at a time). If you do lose committed work, git reflog lists where every branch has pointed — committed work is almost always recoverable.

4. One Giant "update" Commit per Week

History becomes useless — you cannot find when a bug appeared, and rollback means losing a week. Commit per logical change with a message that says what changed.

5. Pulling with Uncommitted Changes

git pull onto a dirty working directory can tangle your edits with incoming ones. Commit (or stash) first, then pull. Clean state before sync, always.

Where this fits: keep your SQL migration scripts (see SQL INSERT, UPDATE & DDL for Analysts) and your requirements.txt (see Python venv & Project Setup for Analysts) in the repo — the code and config layer of a project belongs in Git even when the data does not.

← Back to Blog