None of this was specific to git, but I love git, so if this is the push (hehe) you needed, then that’s great!
I don’t like the “cheat sheet” style of most git tutorials, because it makes it seem like magic, and it seems like a bad idea to trust magic. And then one day something weird happens and the 3 magic spells you know don’t work here, and you declare git is broken, or at least too mysterious, and your data is gone. Which usually it isn’t.
Here’s my quick version: Git is a content addressable Merkle tree storage engine, and then they used that to build a version control system on top. What does that word salad mean?
Content Addressable means you want to store some data, like the contents of a file, so you hash that data (make a fingerprint number out of it, if you don’t know the word hash) and then store the data at that hash. Using a summary of the content itself, as an address. This allows you to store “stuff” in such a way that if you have two different “things” they’ll both get their own place to live in the datastore, because they have different content. And when the data is changed, the address changes too, so the new data gets stored somewhere else and the original data is still where it always lived. The only place that content can live, in fact.
Merkle trees are a kind of tree where data lives at an address, and then references data at another address, which itself may reference other addresses, forming a tree. They’re a bit more specific than that, but this’ll do.
So I store my file’s contents using my content addressable storage, this means I get an address for it. How do I use this to represent a file list, though? Easy, I make a data format that’s a list of files’ names, their file permissions, and then their contents hash address. Like a link. This folder has these files with these names, and their contents are here, here, and here. But the magic is that this folder listing is itself data, so it can also be stuffed into the content store and given its own address that now represents the folder, and thus all files in that folder if you follow the links. And if I have subfolders, I can represent those by storing the folder listing in the hash store, and then just linking to a subfolder in the parent folder right alongside the other links to the file contents. Easy!
And here’s the key: if any of the files’ content changes, then the address of the new content will change, leaving the old content at the old address. But that means any directory listing will need to point at the new file’s address, which means its contents change too because of the new address, which means the directory with the new file contents will also get a new address, leaving the old directory with the old files where it was too. And any parent directories will also change by linking to this subdirectory at its new address etc., all the while any unmodified files, and directories containing unmodified files, will continue to have the same address, and so won’t need to be stored again and just continues to be referenced by both versions. Ultimately, this means the state of the entire tree of files in a folder can be boiled down to a single address, the hash of the listing of the root of the folder. Everything else can be followed from there. And if any of the files change, anywhere in the tree, they can all be stored in the store, and a new root address will describe this, different, tree with slightly different contents, and both trees can be stored alongside one another and share all the same files that didn’t change.
Okay, almost there. I assume by now it’s pretty clear how this could be relevant to a version control system, but we need one more thing. What this all gives us is the ability to have any snapshot of a directory structure be fetched by a single address, and the ability to have multiple such snapshots coexist in the same store. But what we need is a connection between those snapshots, and specifically a history of those snapshots, to make them proper “versions”.
The good news is it’s pretty easy! We just make a new kind of data, called a “commit” which links to the address of the root it represents, so we can get its files, and then because it’s data we can put that also into the datastore and get an address for it, called the commit hash because of how important it is. So now we can refer to the commit, and through the links inside it find all the files it represents. But importantly when I make a new commit, I make the new snapshot of the files as previous discussed, and then in the commit I link to this new snapshot, but also link to the address of the commit that came before this one. The previous version. Which now forms a tree of history! This commit hash not only allows me to get all the files it represents, but also to follow to the previous one which lets me see those files too. And if that previous one has a previous one, then I can follow it all the way down the chain to the initial commit. And since we’ve got this commit object anyway, we also allow people to type a human-readable message in there describing the changes, and we mark down the date and time, and their chosen identity, for historical purposes. Might as well, and now those are in the history too.
That’s basically git! But there’s a few loose ends. Branches. Since all of history can be referred to by the address of just the newest commit in that history, this is all branches are. They’re a human name given to a commit meant to be the newest of its history. And when you make a new commit it will move the branch you’re on to point at the new commit, but leave the other branches alone, which allows history to be different on different branches, and you can switch between them freely by name. But because history is all pointing, at some point two branches may both point to the same parent commit, and from that point on history of those branches will be literally identical. This makes it easy to tell where branches diverge.
Tags are just branches that don’t move. They give a name to a particular commit in the history, by its commit hash of course, and are usually used for releases or other such things.
Diffs are everywhere when interacting with git, so you’d be forgiven for thinking they’re part of what git stores, but they’re not really. What’s stored is, as I described above, the full contents of the tree. But I can choose to go through the trees at two different commits and compare their files to produce a patch describing their difference, and it’s often very useful to do so. The most common version of this is to compare the contents of this commit to the one that came before, its parent, to compute the “diff of the commit”. It’s not truly what’s in the commit, but it followed trivially from it.
Collaboration (pushing and pulling) basically just works by sending the remote side any stored objects you have and it doesn’t, and then updating the branch pointers to point at this new stuff. Pulling is the same but I’m getting things from the remote instead.
The index. Ah, the index. So remember when I said you store a file’s contents in the store when making a commit? That would work, but what if I only want to store some of the changes in my working directory, but not all of them? Maybe some are relevant for the commit I’m making, but others are for a different commit? Or they’re just for testing. It might be useful to not have to store literally exactly what’s in the files’ contents. So instead there’s a staging area, called the index, that stores the contents of the files as we’d like to commit them, rather than how they really are. There are commands to add things to the index, and then during a commit it’s the index that informs what gets stored in the store and referenced, not what’s in the actual folder. This is confusing at first, because most tutorials skip it for being complicated and teaches people to ignore it completely. But I think it’s useful.
Okay, this is a monster, so I think I’ll cover “cheat sheet of commands given this context” in a reply to myself.
git add just adds things to the index. It also works to add new files to git, because git only ever works on files it already knows about, so the first time a new file is created, you have to add it so git knows to track it. Still goes in the index, though.
git add -p goes through the diff of your working directory and asks if you’d like this change in the index or not. Notably it doesn’t ask about new files, you’ll still have to add those.
git status, so useful, but also simple. Tells you what branch you’re on, what files have been changed since the version in the index, what files have been changed in the index (and so what’s going to be committed at the next commit), and what files exist that git doesn’t know about and you might want to add.
Speaking of which, having a bunch of files here that aren’t in git can be a hazzard because it makes it really easy to forget about a new file that you actually did want to add. If there’s a file that will be sticking around for a while that you don’t want to add to the repository, you should tell git to ignore it. If it’s a file that everyone who uses this repo will encounter, like a build or some packages that get fetched, it should go in the .gitignore file, which then gets checked in and synced. If it’s something that only you will have, you can instead put it in .git/info/exclude and it will not be checked in and will just exist in your folder. This will help keep the git status relevant and actionable.
git commit stores the index and makes a commit out of it, asking you for a message to go along with it. It also moves your branch head to this new commit, if you’re on a branch, which you should be most of the time.
git commit -a is a useful shortcut for people who know what they’re doing, which is then taught in every intro tutorial to people who don’t know what they’re doing. It just adds all changes before doing a commit, which effectively skips the index as a concept. Which is fine if there’s no temporary or unrelated changes, but often ends up with people not looking over their changes and adding random test garbage to commits without realizing. See git add -p above. It also doesn’t add new files, which means it works without having to think about it 95% of the time, but then people create a new file and don’t check it in for 10 commits and everything is broken for everyone else. This is a mistake anyone can make, even git add -p folk, and the only cure is actually checking git status, and noticing when it’s warning you about new files.
git add . adds all files in the directory to the index. Also a kind of habit some people get into when they “just want all the changes” but also often ends up with a bunch of garbage being accidentally checked in, like API keys or downloads or patch files or whatever else is in their working dir. It does respect the ignore files, though, so it can be useful if you’re careful.
git diff on its own tells you the difference between the files actually in your working directory (the folder on disk) and the index. Not the last commit, like it may seem, but the index, which when empty is equivalent to the last commit. Basically, this tells you the changes you haven’t added yet, but doesn’t list new files.
git diff HEAD does the thing people think, which compares what’s in the working dir with the latest commit. Actually any git diff COMMIT compares the working dir against that commit, and HEAD is a pointer to the current commit.
git diff COMMIT1..COMMIT2 computes the diff between the trees pointed to by those two commits.
git diff --cached is unfortunately named, but this is what shows the diff between what’s in the index and what’s in the latest commit. This is what would be committed if you ran git commit right now. Useful for making sure you haven’t accidentally added a bunch of useless stuff.
git log shows the commit history.
git log -p shows the commit history, but also precomputes the diffs between each commit and its parent so you can see the changes.
Now for the elephant in the room, git merges and rebases. Given the data model I’ve explained to you, merges are easy. We have branches because multiple different commits can claim the same parent, which allows history to diverge. But someday we may want history to come back together again, like if I branch off to work on a feature, and now the feature is done and I want to merge to the main branch. The way this works is that we make a commit that refences multiple parents, tying the two histories together. Simple! But the question is what snapshot do I store with this commit? If I pick the snapshot from either side, the other side’s changes won’t be present. What I want is to blend these snapshots, so git does what’s called a three-way merge. I first find the point where my two branches diverge, their shared common ancestor, and then I find the diff between each of the branches tips and this common ancestor. Then I try to apply these patches to the common ancestor and if both apply cleanly, then I’m done! I store that and point the commit at it, referencing both parents as I said, and now history is tied together.
If there are conflicts, though, git will dump the conflicts into the working directory and say “you figure this out” and then you manually merge what it couldn’t do automatically, and the use git add like normal to tell git “this is what my merge commit should contain”, and then it does.
So that’s merges. It’s great because it represents history, and only references previously existing commit hashes, but it’s also sometimes messy because the true history can be messy. The classic example is a feature branch that wanted to keep up with the main branch, and so has several merge commits from the main branch into the feature branch, which are still part of the history when that later gets merged to the main branch, leading to a commit graph that’s very noisy and has lots of crosses. It works, but people don’t like it.
So then there’s rebase. Before that, let’s talk about git cherry-pick. It has an easy job. It takes a commit, computes the diff between it and its parent to get the “patch”, or set of changes, this commit represented, and then tries to apply that patch, making those changes, here on the current branch. If it succeeds it makes a new commit that has the same message as the one that’s being cherry picked, and if there’s conflicts it asks the human to fix them like normal before doing the add and commit steps. So it’s trying to “pick-up that patch and put it here”, replicating it’s outcome in a new context. And it makes a new commit that looks like the old one for consistency. But this is important! It looks like the old one, but it is not the same as the old one. Remember, what gives a commit it’s identity is its hash, and its hash comes from its content. And the content is not the diff. That’s computed. The content is the commit message, which is the same, but also the parent commit which is totally different, and the snapshot of the entire set of files, which will also be totally different. Sure, the patch will be the same because it was based on the original, but presumably the other files on this branch aren’t the same, and maybe even other parts of the files this patch touches will be different. That’s the point of the cherry-pick, to take this change set and transplant it into a new context. Well, that new context has new file contents and a new parent, which means new hashes, which means this commit has a new commit hash and is effectively totally different, despite having the same message. And if there were conflicts, it might not even end up with the same patch, just a similar one.
Okay, so that’s git cherry-pick. But what if I’m on a branch with multiple commits that I want to “catch up” to the main branch. I can just find all the commits this branch has that the main branch doesn’t, switch to the main branch, and then cherry pick the old commits one after the other. Now I’ll be on a new branch, on a new commit, but it will “feel like” the old one, with the same changes, but updated to be “re-based” on the new main branch. As in, the branch branches off main at a different point. The base is different. It was rebased. Get it!?
You can use git rebase -i to actually see what it’s about to do beforehand. It finds a bunch of commits and then gets ready to pick them.
This can be great, but can also be a nightmare. Mostly because the hashes of everything has changed. When collaborating with people, they’ll see a branch be at one commit, and then the next time they look it’ll have jumped to a completely different set of commits that don’t follow from the one they used to know. They’re not in the history of the new commits, it’s just different. This makes them grumpy.
And because the new commits are unique, if you’ve messed up your history before you can end up with the “same” commit multiple times in history,. because actually they’re different rebased copies of each other. And rebasing a previous merge commit and be a real beast because it just makes things more complicated.
Anyway, it’s not a problem problem, it’s just something to be careful about.
And now I’m running out of time, but there’s one more thing I want to talk about, which is my best friend git reflog.
git reflog is just a log of all the commit hashes you’ve ever been at, and why it changed. Using this you can recover from almost anything you do within git. Bad rebase? That’s okay, branches are just pointers to commit hashes, and the old commits hashes are still there, same as they ever were. And the reflog remembers what those hashes were. Accidentally reset your branch to a bad place? Git reflog knows how to find your way home. Deleted a branch that still had a change on it you forgot to merge? The name may be gone, but the hash isn’t. Reflog knows its old address, and you can just point a new name there, or inspect its log by hash, or cherry-pick it.
Git reflog loves you.
And now I have to go, but maybe I’ll say more later.
A lot of it is familiarity and opinions. I was never as familiar with mercurial and so I liked git better. Mercurial is a longer word so for the rest of this I’m going to call it hg. I had friends that liked hg, but it’s been years so some of what I say may be wrong or vibey.
I think the main thing hg has going for it is that it works closer to how people think git works. There’s no concept of the index, it just adds all the changes from your working dir like git commit -a. I’m pretty sure rather than storing the full contents of files like git, and then computing the diffs for display, I believe hg actually stores the changes as a series of patches.
And if I remember correctly for that reason patches on hg “belong to” a branch rather than branches pointing at commits in git. This makes things like cherry-picks and rebases harder and thus less “normal” operations, and IIRC it was a bigger deal in hg to accidentally commit to the wrong branch, whereas with git you can use the reflog to reset the branch to where it was trivially, and that commit you made is still floating in the store with an address even with no branch pointing at it, so you can just point a branch at it still, or cherry-pick it to another branch or whatever. Nothing was lost.
But the main thing people talked about was the simplicity and intuitiveness of the commands. And I think a lot of that comes from the fact that hg worked the way people thought it did and the way people used it. So it was intuitive.
Whereas git, as I described in my main post and it’s follow-ups, is actually an addressable tree storage system with a version control system built on top, which gives it immense power and flexibility, but only if you teach people what git really is. It is intuitive once you know what is actually doing, but most git tutorials assume people can’t understand because it’s “too complicated”, or that they won’t bother to learn because it’s a side quest on the goal to just get tracking versions.
So the tutorials teach git as though it’s mercurial: like there isn’t an index, like changes are patches, like history is linear, and then yeah from that perspective the commands are unintuitive. Why do I have to add files with git add, but then commit with git commit -a all the time? Why would I need to pass a flag or it’ll do nothing? Shouldn’t that be the default? And then when fixing merge conflicts, I use git add for that too? The command I only use for new files? Why? What are all the flags to git reset? Why does that un-add stuff, but also rollback changes? Why when I checkout a commit am I in a broken “detached head” state, and the thing I was meant to use was git reset again? That’s random. I did a rebase, it didn’t go well, and now git “broke my branch” and my changes are gone.
And so they’ll go for 15 years of their career not knowing how the tool they use every day works, running the same 4 command strings they learned from a tutorial for beginners, and then sometimes something “weird” will happen and they’ll be confused or angry. Because they didn’t take the 30 minutes it takes way back at the start to teach git as git, at which point the commands names still are a smidge weird, but their operation is crystal clear and consistent.
And git reflog heals basically all wounds.
So yeah, that’s my impression of hg from way back. Simpler and more limited, which had the benefit of therefore also being easier to use and more intuitive because it implemented exactly what people thought it did, so there was match-up between interface and implementation.
None of this was specific to git, but I love git, so if this is the push (hehe) you needed, then that’s great!
I don’t like the “cheat sheet” style of most git tutorials, because it makes it seem like magic, and it seems like a bad idea to trust magic. And then one day something weird happens and the 3 magic spells you know don’t work here, and you declare git is broken, or at least too mysterious, and your data is gone. Which usually it isn’t.
Here’s my quick version: Git is a content addressable Merkle tree storage engine, and then they used that to build a version control system on top. What does that word salad mean?
Content Addressable means you want to store some data, like the contents of a file, so you hash that data (make a fingerprint number out of it, if you don’t know the word hash) and then store the data at that hash. Using a summary of the content itself, as an address. This allows you to store “stuff” in such a way that if you have two different “things” they’ll both get their own place to live in the datastore, because they have different content. And when the data is changed, the address changes too, so the new data gets stored somewhere else and the original data is still where it always lived. The only place that content can live, in fact.
Merkle trees are a kind of tree where data lives at an address, and then references data at another address, which itself may reference other addresses, forming a tree. They’re a bit more specific than that, but this’ll do.
So I store my file’s contents using my content addressable storage, this means I get an address for it. How do I use this to represent a file list, though? Easy, I make a data format that’s a list of files’ names, their file permissions, and then their contents hash address. Like a link. This folder has these files with these names, and their contents are here, here, and here. But the magic is that this folder listing is itself data, so it can also be stuffed into the content store and given its own address that now represents the folder, and thus all files in that folder if you follow the links. And if I have subfolders, I can represent those by storing the folder listing in the hash store, and then just linking to a subfolder in the parent folder right alongside the other links to the file contents. Easy!
And here’s the key: if any of the files’ content changes, then the address of the new content will change, leaving the old content at the old address. But that means any directory listing will need to point at the new file’s address, which means its contents change too because of the new address, which means the directory with the new file contents will also get a new address, leaving the old directory with the old files where it was too. And any parent directories will also change by linking to this subdirectory at its new address etc., all the while any unmodified files, and directories containing unmodified files, will continue to have the same address, and so won’t need to be stored again and just continues to be referenced by both versions. Ultimately, this means the state of the entire tree of files in a folder can be boiled down to a single address, the hash of the listing of the root of the folder. Everything else can be followed from there. And if any of the files change, anywhere in the tree, they can all be stored in the store, and a new root address will describe this, different, tree with slightly different contents, and both trees can be stored alongside one another and share all the same files that didn’t change.
Okay, almost there. I assume by now it’s pretty clear how this could be relevant to a version control system, but we need one more thing. What this all gives us is the ability to have any snapshot of a directory structure be fetched by a single address, and the ability to have multiple such snapshots coexist in the same store. But what we need is a connection between those snapshots, and specifically a history of those snapshots, to make them proper “versions”.
The good news is it’s pretty easy! We just make a new kind of data, called a “commit” which links to the address of the root it represents, so we can get its files, and then because it’s data we can put that also into the datastore and get an address for it, called the commit hash because of how important it is. So now we can refer to the commit, and through the links inside it find all the files it represents. But importantly when I make a new commit, I make the new snapshot of the files as previous discussed, and then in the commit I link to this new snapshot, but also link to the address of the commit that came before this one. The previous version. Which now forms a tree of history! This commit hash not only allows me to get all the files it represents, but also to follow to the previous one which lets me see those files too. And if that previous one has a previous one, then I can follow it all the way down the chain to the initial commit. And since we’ve got this commit object anyway, we also allow people to type a human-readable message in there describing the changes, and we mark down the date and time, and their chosen identity, for historical purposes. Might as well, and now those are in the history too.
That’s basically git! But there’s a few loose ends. Branches. Since all of history can be referred to by the address of just the newest commit in that history, this is all branches are. They’re a human name given to a commit meant to be the newest of its history. And when you make a new commit it will move the branch you’re on to point at the new commit, but leave the other branches alone, which allows history to be different on different branches, and you can switch between them freely by name. But because history is all pointing, at some point two branches may both point to the same parent commit, and from that point on history of those branches will be literally identical. This makes it easy to tell where branches diverge.
Tags are just branches that don’t move. They give a name to a particular commit in the history, by its commit hash of course, and are usually used for releases or other such things.
Diffs are everywhere when interacting with git, so you’d be forgiven for thinking they’re part of what git stores, but they’re not really. What’s stored is, as I described above, the full contents of the tree. But I can choose to go through the trees at two different commits and compare their files to produce a patch describing their difference, and it’s often very useful to do so. The most common version of this is to compare the contents of this commit to the one that came before, its parent, to compute the “diff of the commit”. It’s not truly what’s in the commit, but it followed trivially from it.
Collaboration (pushing and pulling) basically just works by sending the remote side any stored objects you have and it doesn’t, and then updating the branch pointers to point at this new stuff. Pulling is the same but I’m getting things from the remote instead.
The index. Ah, the index. So remember when I said you store a file’s contents in the store when making a commit? That would work, but what if I only want to store some of the changes in my working directory, but not all of them? Maybe some are relevant for the commit I’m making, but others are for a different commit? Or they’re just for testing. It might be useful to not have to store literally exactly what’s in the files’ contents. So instead there’s a staging area, called the index, that stores the contents of the files as we’d like to commit them, rather than how they really are. There are commands to add things to the index, and then during a commit it’s the index that informs what gets stored in the store and referenced, not what’s in the actual folder. This is confusing at first, because most tutorials skip it for being complicated and teaches people to ignore it completely. But I think it’s useful.
Okay, this is a monster, so I think I’ll cover “cheat sheet of commands given this context” in a reply to myself.
Okay, cheat sheet time!
git addjust adds things to the index. It also works to add new files to git, because git only ever works on files it already knows about, so the first time a new file is created, you have toaddit so git knows to track it. Still goes in the index, though.git add -pgoes through the diff of your working directory and asks if you’d like this change in the index or not. Notably it doesn’t ask about new files, you’ll still have to add those.git status, so useful, but also simple. Tells you what branch you’re on, what files have been changed since the version in the index, what files have been changed in the index (and so what’s going to be committed at the next commit), and what files exist that git doesn’t know about and you might want to add.Speaking of which, having a bunch of files here that aren’t in git can be a hazzard because it makes it really easy to forget about a new file that you actually did want to add. If there’s a file that will be sticking around for a while that you don’t want to add to the repository, you should tell git to ignore it. If it’s a file that everyone who uses this repo will encounter, like a build or some packages that get fetched, it should go in the
.gitignorefile, which then gets checked in and synced. If it’s something that only you will have, you can instead put it in.git/info/excludeand it will not be checked in and will just exist in your folder. This will help keep the git status relevant and actionable.git commitstores the index and makes a commit out of it, asking you for a message to go along with it. It also moves your branch head to this new commit, if you’re on a branch, which you should be most of the time.git commit -ais a useful shortcut for people who know what they’re doing, which is then taught in every intro tutorial to people who don’t know what they’re doing. It just adds all changes before doing a commit, which effectively skips the index as a concept. Which is fine if there’s no temporary or unrelated changes, but often ends up with people not looking over their changes and adding random test garbage to commits without realizing. Seegit add -pabove. It also doesn’t add new files, which means it works without having to think about it 95% of the time, but then people create a new file and don’t check it in for 10 commits and everything is broken for everyone else. This is a mistake anyone can make, evengit add -pfolk, and the only cure is actually checking git status, and noticing when it’s warning you about new files.git add .adds all files in the directory to the index. Also a kind of habit some people get into when they “just want all the changes” but also often ends up with a bunch of garbage being accidentally checked in, like API keys or downloads or patch files or whatever else is in their working dir. It does respect the ignore files, though, so it can be useful if you’re careful.git diffon its own tells you the difference between the files actually in your working directory (the folder on disk) and the index. Not the last commit, like it may seem, but the index, which when empty is equivalent to the last commit. Basically, this tells you the changes you haven’t added yet, but doesn’t list new files.git diff HEADdoes the thing people think, which compares what’s in the working dir with the latest commit. Actually anygit diff COMMITcompares the working dir against that commit, andHEADis a pointer to the current commit.git diff COMMIT1..COMMIT2computes the diff between the trees pointed to by those two commits.git diff --cachedis unfortunately named, but this is what shows the diff between what’s in the index and what’s in the latest commit. This is what would be committed if you rangit commitright now. Useful for making sure you haven’t accidentally added a bunch of useless stuff.git logshows the commit history.git log -pshows the commit history, but also precomputes the diffs between each commit and its parent so you can see the changes.Now for the elephant in the room, git merges and rebases. Given the data model I’ve explained to you, merges are easy. We have branches because multiple different commits can claim the same parent, which allows history to diverge. But someday we may want history to come back together again, like if I branch off to work on a feature, and now the feature is done and I want to merge to the main branch. The way this works is that we make a commit that refences multiple parents, tying the two histories together. Simple! But the question is what snapshot do I store with this commit? If I pick the snapshot from either side, the other side’s changes won’t be present. What I want is to blend these snapshots, so git does what’s called a three-way merge. I first find the point where my two branches diverge, their shared common ancestor, and then I find the diff between each of the branches tips and this common ancestor. Then I try to apply these patches to the common ancestor and if both apply cleanly, then I’m done! I store that and point the commit at it, referencing both parents as I said, and now history is tied together.
If there are conflicts, though, git will dump the conflicts into the working directory and say “you figure this out” and then you manually merge what it couldn’t do automatically, and the use
git addlike normal to tell git “this is what my merge commit should contain”, and then it does.So that’s merges. It’s great because it represents history, and only references previously existing commit hashes, but it’s also sometimes messy because the true history can be messy. The classic example is a feature branch that wanted to keep up with the main branch, and so has several merge commits from the main branch into the feature branch, which are still part of the history when that later gets merged to the main branch, leading to a commit graph that’s very noisy and has lots of crosses. It works, but people don’t like it.
So then there’s rebase. Before that, let’s talk about
git cherry-pick. It has an easy job. It takes a commit, computes the diff between it and its parent to get the “patch”, or set of changes, this commit represented, and then tries to apply that patch, making those changes, here on the current branch. If it succeeds it makes a new commit that has the same message as the one that’s being cherry picked, and if there’s conflicts it asks the human to fix them like normal before doing theaddandcommitsteps. So it’s trying to “pick-up that patch and put it here”, replicating it’s outcome in a new context. And it makes a new commit that looks like the old one for consistency. But this is important! It looks like the old one, but it is not the same as the old one. Remember, what gives a commit it’s identity is its hash, and its hash comes from its content. And the content is not the diff. That’s computed. The content is the commit message, which is the same, but also the parent commit which is totally different, and the snapshot of the entire set of files, which will also be totally different. Sure, the patch will be the same because it was based on the original, but presumably the other files on this branch aren’t the same, and maybe even other parts of the files this patch touches will be different. That’s the point of the cherry-pick, to take this change set and transplant it into a new context. Well, that new context has new file contents and a new parent, which means new hashes, which means this commit has a new commit hash and is effectively totally different, despite having the same message. And if there were conflicts, it might not even end up with the same patch, just a similar one.Okay, so that’s
git cherry-pick. But what if I’m on a branch with multiple commits that I want to “catch up” to the main branch. I can just find all the commits this branch has that the main branch doesn’t, switch to the main branch, and then cherry pick the old commits one after the other. Now I’ll be on a new branch, on a new commit, but it will “feel like” the old one, with the same changes, but updated to be “re-based” on the new main branch. As in, the branch branches off main at a different point. The base is different. It was rebased. Get it!?You can use
git rebase -ito actually see what it’s about to do beforehand. It finds a bunch of commits and then gets ready to pick them.This can be great, but can also be a nightmare. Mostly because the hashes of everything has changed. When collaborating with people, they’ll see a branch be at one commit, and then the next time they look it’ll have jumped to a completely different set of commits that don’t follow from the one they used to know. They’re not in the history of the new commits, it’s just different. This makes them grumpy.
And because the new commits are unique, if you’ve messed up your history before you can end up with the “same” commit multiple times in history,. because actually they’re different rebased copies of each other. And rebasing a previous merge commit and be a real beast because it just makes things more complicated.
Anyway, it’s not a problem problem, it’s just something to be careful about.
And now I’m running out of time, but there’s one more thing I want to talk about, which is my best friend
git reflog.git reflogis just a log of all the commit hashes you’ve ever been at, and why it changed. Using this you can recover from almost anything you do within git. Bad rebase? That’s okay, branches are just pointers to commit hashes, and the old commits hashes are still there, same as they ever were. And the reflog remembers what those hashes were. Accidentally reset your branch to a bad place? Git reflog knows how to find your way home. Deleted a branch that still had a change on it you forgot to merge? The name may be gone, but the hash isn’t. Reflog knows its old address, and you can just point a new name there, or inspect its log by hash, or cherry-pick it.Git reflog loves you.
And now I have to go, but maybe I’ll say more later.
What’s your opinion on mercurial? Was it worse than git?
A lot of it is familiarity and opinions. I was never as familiar with mercurial and so I liked git better. Mercurial is a longer word so for the rest of this I’m going to call it hg. I had friends that liked hg, but it’s been years so some of what I say may be wrong or vibey.
I think the main thing hg has going for it is that it works closer to how people think git works. There’s no concept of the index, it just adds all the changes from your working dir like
git commit -a. I’m pretty sure rather than storing the full contents of files like git, and then computing the diffs for display, I believe hg actually stores the changes as a series of patches.And if I remember correctly for that reason patches on hg “belong to” a branch rather than branches pointing at commits in git. This makes things like cherry-picks and rebases harder and thus less “normal” operations, and IIRC it was a bigger deal in hg to accidentally commit to the wrong branch, whereas with git you can use the reflog to reset the branch to where it was trivially, and that commit you made is still floating in the store with an address even with no branch pointing at it, so you can just point a branch at it still, or cherry-pick it to another branch or whatever. Nothing was lost.
But the main thing people talked about was the simplicity and intuitiveness of the commands. And I think a lot of that comes from the fact that hg worked the way people thought it did and the way people used it. So it was intuitive.
Whereas git, as I described in my main post and it’s follow-ups, is actually an addressable tree storage system with a version control system built on top, which gives it immense power and flexibility, but only if you teach people what git really is. It is intuitive once you know what is actually doing, but most git tutorials assume people can’t understand because it’s “too complicated”, or that they won’t bother to learn because it’s a side quest on the goal to just get tracking versions.
So the tutorials teach git as though it’s mercurial: like there isn’t an index, like changes are patches, like history is linear, and then yeah from that perspective the commands are unintuitive. Why do I have to add files with
git add, but then commit withgit commit -aall the time? Why would I need to pass a flag or it’ll do nothing? Shouldn’t that be the default? And then when fixing merge conflicts, I usegit addfor that too? The command I only use for new files? Why? What are all the flags togit reset? Why does that un-add stuff, but also rollback changes? Why when I checkout a commit am I in a broken “detached head” state, and the thing I was meant to use wasgit resetagain? That’s random. I did a rebase, it didn’t go well, and now git “broke my branch” and my changes are gone.And so they’ll go for 15 years of their career not knowing how the tool they use every day works, running the same 4 command strings they learned from a tutorial for beginners, and then sometimes something “weird” will happen and they’ll be confused or angry. Because they didn’t take the 30 minutes it takes way back at the start to teach git as git, at which point the commands names still are a smidge weird, but their operation is crystal clear and consistent.
And
git reflogheals basically all wounds.So yeah, that’s my impression of hg from way back. Simpler and more limited, which had the benefit of therefore also being easier to use and more intuitive because it implemented exactly what people thought it did, so there was match-up between interface and implementation.