This article explains the principles of version control systems, which are important in software development, and introduces how Git’s content-address file system efficiently manages the history of files. It also covers how commits and references make it easy to track and recover changes to files.
Principles of software versioning with Git
Modern software development is all about collaboration. With multiple developers modifying and updating code at the same time, it’s not uncommon for changes to conflict or have unwanted consequences. This is where a version control system becomes a key tool for collaboration. More than just recording changes to files, a version control system minimises collaborative conflicts and provides the ability to revert back to a point in time. Tracking and recording all changes during the software development process is an essential function, without which large development projects would be unable to stay organised.
When you’re working on a long document on your computer, sometimes you want to go back to a previous state. If you’re using Bahasa Indonesia, you can do this by pressing the Ctrl+Z keyboard shortcut, but as you write more, you may find that you’re actually better off before you revert. But even if you want to go back to the way things were before you reverted, you can’t, because you don’t have the history from that point.
This happens when developing software just as it does when writing documentation. This problem is exacerbated when developing software to meet ever-changing requirements. To solve this problem, software engineers use version control systems. A version control system is software that manages the history of documents (files), of which Git is very popular among programmers. In this article, we’ll learn about the nature of software version control and take a look at how Git’s internal structure is designed.
Version control systems and file systems
To understand version control systems, you first need to know what software is. Software is written in a programming language that computers can understand. This written documentation is called code, and it might be written in a single file, but as the software grows, it’s split into multiple files. At this point, you need to manage the history of the different files.
Computers use file systems to manage files. A file system is the part of a computer’s operating system that stores and manages files. In the Windows operating system that we commonly use, you can easily understand the structure of files and folders by looking at the desktop or Explorer. A file is located in a specific folder and has a name. The combination of a file’s location and name is called its path, and when you have multiple files, you can distinguish them by the path of each file. Such a file system is called a location-address file system, meaning that the location of a file acts as an address to point to that file.
Git’s content-address file system
Git introduces the content-address file system to overcome the limitations of the positional-address file system described earlier. In a positional-address file system, the location where a file is stored is the only way to identify that file. That is, the location of a file is fixed, and the contents of the file stored there may change over time. In this case, the previous contents of a file in the same path are completely overwritten, leaving no history.
In contrast, in the content-address file system that Git uses, the content of the file itself serves as the ID that distinguishes the file. When the contents of a file change, a new ID is generated and each change is stored as a separate file. This ensures that every file has a history of changes. This is an important technique for ensuring that version control is thorough.
However, storing all of a file’s change history can result in a lot of files and take up a lot of storage space. To solve this problem, Git uses compression techniques to manage storage space efficiently.
Hash functions and how Git works
Git implements a content-address file system using hash functions. A hash function is a function that returns a string of constant length for an arbitrary input value. Git uses the SHA-1 hash function, which feeds the contents of a file into a hash function and uses the resulting 40-character hexadecimal string as the file’s ID. If this string varies, it means that the contents of the file have changed.
Hash functions do more than just return a string; they return a completely different result when the input varies slightly, so you can track changes to a file exactly. For example, if you put the string Hello and the string hello into a hash function, you’ll get completely different results. This property allows Git to keep a tight rein on changes to files.
Let’s take a look at how Git works: When you run git init in a specific folder, Git sets up a contents-address file system in that folder. When you write a file and save it with the git add command, Git puts the contents of the file into a hash function and stores the resulting ID. If you then modify the file and use the git add command again, the file’s history is built up.
You can also bundle changes to multiple files at once, which is called a commit. The git commit command lets you save all the changes you’ve made to a file as a single bundle, and you can then revert the file to a specific point in time based on that commit.
Managing references and commits
Commits store the history of changes to each file and give it a unique ID. The problem with commits in Git is that they can be 40 digits long, making it difficult to remember them. To solve this problem, Git introduces the concept of a reference. A reference is a short, friendly name for a commit, and by default, a reference named master is used. This allows you to manage commits using an easily remembered name instead of the commit’s ID.
Git is also designed so that each commit references the last commit. This allows you to see all the previous commits referenced in a commit, even if you only know the reference to the last commit.
Git’s versatility and extensibility
Git can be used in many different areas, not just software development. For tasks that require collaboration with multiple people, such as writing research reports or managing project materials, the ability to record the changes in each version and revert to a point in time when needed is invaluable. As you can see, Git is a tool that can be used effectively not only by software developers, but also for a variety of other tasks, such as documentation and data management.
Conclusion
Git overcomes the limitations of the location-address file system with its content-address file system, allowing you to efficiently manage the history of your files. It also provides features like commits and references to make versioning easy for users. It’s complex under the hood, but once you understand how it works, Git is a powerful and useful tool for software version control.