Modern internet services rely on the non-stop operation of servers to provide reliable service 24 hours a day, 365 days a year. Hardware and software technologies, load balancing, and proprietary technologies from platforms like Facebook make this possible.
Many internet services are actively operated and developed these days. Social networking services such as Facebook and Twitter are among the most prominent. These services provide users with a place to communicate and share information, and have become an essential tool of modern society. Online games and mobile games are also a type of Internet service. They have evolved beyond mere entertainment to become platforms for users around the world to compete and collaborate in real time. Mobile games, in particular, have exploded in popularity because they can be played anywhere, anytime.
Many of us have experienced slow site access, error pages, or a message that the server is under maintenance when using these services. This can be extremely frustrating for users, and can cause them to lose trust, especially if it happens at a critical moment. Users often refer to this situation as “the server is dead”. Why do servers suffer and die, and why do users lose access to the services they want as a result?
To answer this question, we first need to understand the role of servers at the heart of Internet services. A server is a centralized computer system that provides services to users. It handles requests from many users at the same time, loading web pages and transferring data. If your server isn’t working properly, your users won’t be able to use your services properly. For this reason, stable and reliable server operation is a critical factor in the success of internet services.
Non-stop operation technology is literally the ability to provide internet services 24 hours a day, 365 days a year, without stopping. Users of an internet service that is well equipped with non-stop operation technology can access the service at any time they want. This is essential for maximizing user convenience while also keeping the service provider’s revenue stable. The revenue of an internet service is proportional to the uptime of the service multiplied by the number of users connected at the same time. In other words, increasing the uptime of the service or increasing the number of users connected at the same time is the way to increase revenue for internet service providers. The latter depends on how you market or design your service, while the former is a challenge for engineers.
There are two main categories of non-stop operation technologies. There are two main categories: hardware and software. In the case of Internet services, the program running on the server computer is called a server application. In this case, the server computer is hardware and the server application is software. Hardware nonstop operation technology refers to a general server computer, and software nonstop operation technology refers to a general server application that needs to do something to make it nonstop.
How do you make a server computer that never stops? One way is to connect CPUs or hard disks in parallel. Computers can only work with zeros and ones. This is why we use the binary system to represent numbers. There is also a corresponding number for each character. This is called ASCII code. The capitalized letter “A” is the number 66, and the letter “B” is the number 67. So every letter and number can be represented by 0 and 1.
Sometimes, computers will unintentionally switch between zeros and ones. In this case, the computer will freeze. This is because the number or letter you’re trying to represent has changed. The CPU and the hard disk are the two most common components where this happens. The CPU is the part that calculates the arithmetic, and the hard disk is the part that stores the results. The data stored on the hard disk is sometimes used again by the CPU, so both parts need to work well to provide normal internet service.
Two is better than one Connect two CPUs in parallel. Try running a single operation on both CPUs and compare the results. If the results are different, one of them is having problems and you should try the operation again. Assume that there is a 10% chance that the operation is incorrect on one CPU. The number of cases where the result of the operation is different on both CPUs is (true, false) and (false, true). These two cases are not a problem in the way we described above, because if the result of the two operations is different, they are redone. However, if the result is (false, false), the computer will unfortunately freeze. However, the probability of this hanging is only 1% to the power of 10% squared. The actual number of possible CPU failures is much smaller than 1%, so it’s unlikely that two, three, or more CPUs in parallel will cause a computer to freeze due to a CPU problem.
The same is true for hard disks. The data that is normally stored on a hard disk also changes from 0 to 1 or 1 to 0 at some point in time. Usually, a hard disk has a built-in function to determine if the data is normal or abnormal. It stores the number of 1s for every 10 consecutive 0s or 1s. When the computer reads it, it compares the current number of 1s with the number of 1s stored, and if it’s different, it knows that the data is abnormal. However, there is no way to recover from this on a normal hard disk. Therefore, server computers with non-stop operation technology are equipped with multiple hard disks to store the same data. And when the CPU needs the data stored, it changes the abnormal data to normal data and hands it over to the CPU.
How do you create a server application that never crashes? First, use a verification program that can detect errors in your program early. Second, run your server application on multiple server machines, one by one. The first cause of a server application hanging is a programming error. The second reason for a server application to hang is an update to the server application that adds functionality to the Internet service. Programming errors are not unique to Internet services, as they have been a problem since the first computers were developed. Therefore, special verification programs exist to detect programming errors early. To some extent, these verification programs can prevent server applications from freezing.
If you want to add a new feature to your Internet service, you need to stop the server application and run the server application with the new feature. Because both server applications cannot be running on the same server machine at the same time, the order in which they are stopped and started must be honored. Users will not be able to use the service during the time between the stop and the launch. But what if you run a server application on multiple server machines, one after the other? This solves the problem because you can add new functionality and let the other server applications take care of things while the server application is down. The challenge, however, is that you need to implement seamless communication between multiple server machines.
One of the techniques used to do this is called load balancing. Load balancing is a technique for spreading the load across multiple servers, distributing work evenly across all servers so that no particular server is overwhelmed. This technology is especially important for large-scale internet services. For example, during events with millions of users accessing at the same time, poor load balancing increases the likelihood of a server going down. Therefore, load balancing is an essential component of implementing zero-downtime operations technology.
All of the above techniques should be in place by default in order to implement zero downtime. Facebook, Twitter, Instagram, and other popular social networking sites already implement all of these techniques. However, they are not perfect. Even with parallel CPUs and hard disks, there is still a chance of error, albeit a lower one than before. Also, verification programs don’t catch all the problems in server applications. In addition to the internet services mentioned above, many developers have developed and implemented their own non-disruptive technologies. I’ve often experienced slow images or slow posts, but I’ve never experienced Facebook’s servers dying. That’s because of Facebook’s proprietary zero downtime technology.
This kind of non-stop operation technology is directly related to a company’s competitiveness. An uninterrupted service provides users with trust, which is important for gaining loyal users. Conversely, services that are frequently down can accelerate user churn. Therefore, until both ISPs and users are able to offer and receive non-stop service, non-stop operations technology needs more research. This will help us build a more stable and reliable Internet environment.