This is the first in a series on server monitoring. The primary focus of these posts is monitoring in a *nix environment.
Monitoring servers is important. Whether it be finding an issue in a test environment prior to deploying or debugging an issue in production, we need access to information on our server to be able to tease apart what went wrong.
In much earlier days, monitoring a server was a very manual process that relied on humans to experience issues and other humans to log into a server, open individual logs, run individual commands, and see if the output of any of the above yielded results that explained the original situation. Luckily, tooling has evolved to make this process more streamlined and centralized, but it’s still up to us to figure out what server information we need to gather.
Before You Get Started
Although you could just dive right in and check your server’s every hiccup, for your own sanity and troubleshooting abilities, it’s important to take a step back and formulate what information you’re hoping to gain. Some important ideas to keep in mind:
- Know your application
- Be mindful of changing needs
Know your application
When you set up your application, you should make sure you have a fairly good idea of the behaviors you expect to see. Specifically, you should have a fairly good idea how the application will perform when there’s too little memory, CPU, disk, or other system resources. Will it lag? Will it become unresponsive? Will it not load at all? What are the thresholds for these behaviors and what is the difference between a “blip” (e.g. CPU spike) and a real problem?
None of these are necessarily easy questions to answer, but you will need to know what to expect and what timeline to expect it on to prevent an oversensitive alerting system. (See my article on alert fatigue.)
Be mindful of changing needs
As time marches forward, your needs will change. There might be new features implemented, others removed, or other shifting requirements that will change your application’s expected behavior. When this happens, you’ll need to re-evaluate how you expect your application to function on your server so you know what resource usages to expect and what behaviors you can anticipate when those resources dwindle.
Hit the Ground Running
When you are monitoring a *nix server, you’re generally going to be looking at the following:
- CPU usage
- Memory usage
- Disk usage
- Page Faults
- Network Activity
- Swap usage
CPU usage: checking the CPU usage allows you to see what percentage of your processor is being used. Depending on your specific needs, you’ll likely need to know the total usage and the breakdown by process or user. The additional granularity of looking at processes and users will make it a lot easier to troubleshoot when an issue arises.
Memory usage: how much memory is being used in total as well as by individual processes and users. Depending on your needs you can monitor the memory usage as a percentage and/or by GB/MB.
Disk usage: how much of your disk is being used. Similar to memory usage you can monitor your disk usage by percentage and/or by GB. You should also keep track of the inode usage. Indoes are what stores the information for the filesystem objects in a *nix operating system. Running out of inodes isn’t necessarily common, but it is something to check if you happen to know that the application running on your server tends to have a lot of small files as some CI/CD tools do.
Page faults: simply put, a page is a block of virtual memory. If you are monitoring a server hosted by a cloud service provider such as AWS, GCP, Azure, DigitalOcean, or Linode to name a few, then your server is using a paging system to map the virtual memory of your server instance to the physical memory of the hardware hosting it. A page fault is essentially what happens when the operating system tries to load something to virtual memory from physical memory and the physical memory address tied to the object doesn’t exist. Although it isn’t uncommon to have a small number of page faults that the operating system has a built in page fault handler to manage, if you see a spike in page faults you should definitely start looking for a deeper problem.
Uptime: For a virtual server, uptime is how long your server has been running. This is something to monitor mainly to see if your server has experienced an unexpected reboot.
Network activity: Everything is abuzz on the internet these days, including your server. How granular you need to monitor your network traffic will depend a lot on what your server is hosting, but in general, even a server running a static site that isn’t actively being accessed will see some I/O for other processes running on the server – it shouldn’t drop to 0.
Swap usage: Swap is disk space that is reserved to supplement memory usage when free memory is running low. How aggressively your system makes use of swap depends on how you’ve configured it’s “swappiness” on a scale of 1 (low) to 100 (high). It isn’t uncommon to use a value of 10 or less if you want to allow some swapping when there are memory issues within the system without disabling it entirely (0). When you’re looking at “swap usage”, you’re generally going to want to look at the swap rate to see if there’s been a significant uptick in swap usage. If so, start taking a look at your memory and other resources to see if you can dig up a cause.
Using the Command Line
Frequently when managing a server, you will need to use the command line in addition to whatever monitoring / alerting tools you have brought into the mix. With that in mind, here are a few command line tools that you can use to view the above information. Note that not all of these are installed by default with all Linux distributions, so you may need to install them with your package installer or from source depending on your distribution.
top – A handy tool that, amongst other things, allow you to view the uptime, memory usage, CPU usage, and swap amongst other things. You can also view other information including other running processes, the command use to run that process, etc. You can exit top by hitting the q key.
htop – Very similar to running top with the command combination zxcVm1t0 (adding a W to the end of that will save this configuration to your .toprc file), but interactive. When in htop you can sort the provided information using the function keys indicated at the bottom of the screen. Similar to top, you can exit by hitting the q key.
tcpdump – This is a powerful tool for monitoring network packets. For example, you could use it to listen to all the network traffic on your server instance, or limit to only listening to the traffic from specific source and/or destination ports. You can exit by using ctrl+C.
netstat – Allows you to view the what and how of networking connections on your server, including routing table information as TCP/UDP connections and their processes. A common combination for the latter is -tlnpu. If needed, you can also run the command with -c for continuous monitoring.
nmon – a.k.a. Nigel’s monitor, allows you to view incredibly detailed information about your server. Similar to htop, nmon is interactive, so you go through a series of menus to see information about your processor, disks, etc. To quit, use q or ctrl+C.
uptime – An easy way to see how long your server has been running. If you would like to see the uptime in duration rather than the timestamp the server started, use –pretty.
/proc/meminfo – You can cat or use your editor of choice to view this file, which provides a deep dive in the current active / free memory usage.
free – To view the total free and used memory with no process information, use free. By default the values are given in bytes, so I recommend appending -h if you’d like to see the output in mega- or giga- bytes.
df – Allows you to view the total disk usage of all volumes mounted on your system. Defaults to 1k block size, so I’d recommend appending -h to see output in mega- and/or giga- bytes.
du – Recursively prints the disk usage of objects in a file system for all objects (files, directories) in a specified directory. User beware: if no directory is specified then root ( / ) is used, which will quickly overtake your terminal! If you would just like to see how much disk is used by the items in a directory, use -sh.
Depending on your Linux distribution, you may need to install some of these tools with your package manager.
Reading the above should help you get started on your server monitoring journey. Keep an eye out for the next post in this series, Monitoring Server Security, where I’ll be covering some security considerations to be aware of when monitoring servers.