Thoughts from 20 years ago — Help! I’ve lost my memory! Sunworld Online
[Just for fun, posting this for the third time]
Originally published in Unix Insider 10/1/95
Stripped of adverts, url references fixed and comments added to bring it up to date ten years later in 2006.
Dear Adrian, After a reboot I saw that most of my computer’s memory was
free, but when I launched my application it used up almost all the
memory. When I stopped the application the memory didn’t come back!
Take a look at my vmstat output:
% vmstat 5
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
This is before the program starts:
0 0 0 330252 80708 0 2 0 0 0 0 0 0 0 0 1 18 107 113 0 1 99
0 0 0 330252 80708 0 0 0 0 0 0 0 0 0 0 0 14 87 78 0 0 99
I start the program and it runs like this for a while:
0 0 0 314204 8824 0 0 0 0 0 0 0 0 0 0 0 414 132 79 24 1 74
0 0 0 314204 8824 0 0 0 0 0 0 0 0 0 0 0 411 99 66 25 1 74
I stop it, then almost all the swap space comes back, but the free memory does not:
0 0 0 326776 21260 0 3 0 0 0 0 0 0 1 0 0 420 116 82 4 2 95
0 0 0 329924 24396 0 0 0 0 0 0 0 0 0 0 0 414 82 77 0 0 100
0 0 0 329924 24396 0 0 0 0 0 0 0 0 2 0 1 430 90 84 0 1 99
I checked that there were no application processes running. It looks like a huge memory leak in the operating system. How can I get my memory back?
— RAMless in Ripon
This remains one of the most frequently asked questions of all time. The original answer is still true for many Unix variants. However while
writing his book on Solaris Internals, Richard McDougall worked out
how to fix Solaris to make it work better, and to make this apparent
problem go away. The result was one of the most significant
performance improvements in Solaris 8, but the first edition of his
book was written before Solaris 8 came out, so doesn’t describe the
The short answer
Launch your application again. Notice that it starts up more quickly than it did the first time, and with less disk activity. The application code and its data files are still in memory, even though they are not active. The memory they occupy is not “free.” If you restart the same application it finds
the pages that are already in memory. The pages are attached to the inode cache entries for the files. If you start a different application, and there is insufficient free memory, the kernel will scan for pages that have not been touched for a long time, and “free” them. Once you quit the first application, the memory it occupies is not being touched, so it will be freed quickly for use by other applications.
In 1988, Sun introduced this feature in SunOS 4.0. It still applies to all versions of Solaris 1 and 2. The kernel is trying to avoid disk reads by caching as many files as possible in memory. Attaching to a page in memory is around 1,000 times faster than reading it in from disk. The kernel figures that you paid good money for all of that RAM, so it will try to make good use of it by retaining files you might need.
Since Solaris 8, the memory in the file cache is actually also on the free list, so you do see vmstat free memory reduce when you quit a program. You also should expect large amounts of file I/O to cause high scan rates on older Solaris releases, and for there to be no scanning at all on Solaris 8 systems. If Solaris 8 scans at all, then it has truly run out of memory and is overloaded.
By contrast, Memory leaks appear as a shortage of swap space after the misbehaving program runs for a while. You will probably find a process that has a larger than expected size. You should restart the program to free up the swap space, and check it with a debugger that offers a leak-finding feature (run it with the libumem version of the malloc library that instruments memory leaks).
The long (and technical) answer
To understand how Sun’s operating systems handle memory, I will explain how the inode cache works, how the buffer cache fits into the picture, and how the life cycle of a typical page evolves as the system uses it for several different purposes.
The inode cache and file data caching
Whenever you access a file, the kernel needs to know the size, the access permissions, the date stamps and the locations of the data blocks on disk. Traditionally, this information is known as the inode for the file. There are many filesystem types. For simplicity I will assume we are only interested in the Unix filesystem (UFS) on a local disk. Each filesystem type has its own inode cache.
The filesystem stores inodes on the disk; the inode must be read into memory whenever an operation is performed on an entity in the filesystem. The number of inodes read per second is reported as iget/s by the sar
-a command. The inode read from disk is cached in case it is needed again, and the number of inodes that the system will cache is influenced by a kernel parameter called ufs_ninode. The kernel keeps inodes on a linked list, rather than in a fixed-size table.
As I mention each command I will show you what the output looks like. In my case I’m collecting sar data automatically using cron.
sar, which defaults to reading the stored data for today. If you have no stored data, specify a time interval and sar will show you current activity.
% sar -aSunOS hostname 5.4 Generic_101945-32 sun4c 09/18/9500:00:01 iget/s namei/s dirbk/s
01:00:01 4 6 0
All reads or writes to UFS files occur by paging from the filesystem. All pages that are part of the file and are in memory will be attached to the inode cache entry for that file. When a file is not in use, its data is cached in memory, using an inactive inode cache entry. When the kernel reuses an inactive inode cache entry that has pages attached, it puts the pages on the free list; this case is shown by sar -g as %ufs_ipf. This number is the percentage of UFS inodes that were overwritten in the inode cache by iget and that had reusable pages associated with them. The kernel flushes the pages, and updates on disk any modified pages. Thus, this %ufs_ipf number is the percentage of igets with page flushes. Any non-zero values of %ufs_ipf reported by sar -g indicate that the inode cache is too small for the current workload.
% sar -gSunOS hostname 5.4 Generic_101945-32 sun4c 09/18/9500:00:01 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
01:00:01 0.02 0.02 0.08 0.12 0.00
For SunOS 4 and releases up to Solaris 2.3, the number of inodes that the kernel will keep in the inode cache is set by the kernel variable ufs_ninode. To simplify: When a file is opened, an inactive inode will be reused from the cache if the cache is full; when an inode becomes inactive, it is discarded if the cache is over-full. If the cache limit has not been reached then an inactive inode is placed at the back of the reuse list and invalid inodes (inodes for files that longer exist) are placed at the front for immediate reuse. It is entirely possible for the number of open files in the system to cause the number of active inodes to exceed ufs_ninode; raising ufs_ninode allows more inactive inodes to be cached in case they are needed again.
Solaris 2.4 uses a more clever inode cache algorithm. The kernel maintains a
reuse list of blank inodes for instant use. The number of active inodes is no longer constrained, and the number of idle inodes (inactive but cached in case they are needed again) is kept between ufs_ninode and 75 percent of ufs_ninode by a new kernel thread that scavenges the inodes to free them and maintains entries on the reuse list. If you use sar -v to look at the inode cache, you may see a larger number of existing inodes than the reported “size.”
% sar -vSunOS hostname 5.4 Generic_101945-32 sun4c 09/18/9500:00:01 proc-sz ov inod-sz ov file-sz ov lock-sz
01:00:01 66/506 0 2108/2108 0 353/353 0 0/0
The buffer cache is used to cache filesystem data in SVR3 and BSD Unix. In SunOS 4, generic SVR4, and Solaris 2, it is used to cache inode, indirect block, and cylinder group blocks only. Although this change was introduced in 1988, many people still incorrectly think the buffer cache is used to hold file data. Inodes are read from disk to the buffer cache in 8-kilobyte blocks, then the individual inodes are read from the buffer cache into the inode cache.
Life cycle of a typical physical memory page
This section provides additional insight into the way memory is used. The sequence described is an example of some common uses of pages; many other possibilities exist.
1. Initialization — A page is born
When the system boots, it forms all free memory into pages, and allocates a kernel data structure to hold the state of every page in the system.
2. Free — An untouched virgin page
All the memory is put onto the free list to start with. At this stage the content of the page is undefined.
3. ZFOD — Joining an uninitialized data segment
When a program accesses data that is preset to zero for the very first time, a minor page fault occurs and a Zero Fill On Demand (ZFOD) operation takes place. The page is taken from the free list, block-cleared to contain all zeroes, and added to the list of anonymous pages for the uninitialized data segment. The program then reads and writes data to the page.
4. Scanned — The pagedaemon awakes
When the free list gets below a certain size, the pagedaemon starts to look for memory pages to steal from processes. It looks at all pages in physical memory order; when it gets to the page, the page is synchronized with the memory management unit (MMU) and a reference bit is cleared.
5. Waiting — Is the program really using this page right now?
There is a delay that varies depending upon how quickly the pagedaemon scans through memory. If the program references the page during this period, the MMU reference bit is set.
6. Pageout Time — Saving the contents
The pageout daemon returns and checks the MMU reference bit to find that the program has not used the page so it can be stolen for reuse. The pagedaemon checks to see if anything had been written to the page; if it contains no data, a page-out occurs. The page is moved to the pageout queue and marked as I/O pending. The swapfs code clusters the page together with other pages on the queue and writes the cluster to the swap space. The page is then free and is put on the free list again. It remembers that it still contains the program data.
7. Reclaim — Give me back my page!
Belatedly, the program tries to read the page and takes a page fault. If the page had been reused by someone else in the meantime, a major fault would occur and the data would be read from the swap space into a new page taken from the free list. In this case, the page is still waiting to be reused, so a minor fault occurs, and the page is moved back from the free list to the program’s data segment.
8. Program Exit — Free again
The program finishes running and exits. The data segments are private to that particular instance of the program (unlike the shared-code segments), so all the pages in the data segment are marked as undefined and put onto the free list. This is the same state as Step 2.
9. Page-in — A shared code segment
A page fault occurs in the code segment of a window system shared library. The page is taken off the free list, and a read from the filesystem is scheduled to get the code. The process that caused the page fault sleeps until the data arrives. The page is attached to the inode of the file, and the segments reference the inode.
10. Attach — A popular page
Another process using the same shared-library page faults in the same place. It discovers that the page is already in memory and attaches to the page, increasing its inode reference count by one.
11. COW — Making a private copy
If one of the processes sharing the page tries to write to it, a copy-on-write (COW) page fault occurs. Another page is grabbed from the free list, and a copy of the original is made. This new page becomes part of a privately mapped segment backed by anonymous storage (swap space) so it can be changed, but the original page is unchanged and can still be shared. Shared libraries contain jump tables in the code that are patched, using COW as part of the dynamic linking process.
12. File Cache — Not free
The entire window system exits, and both processes go away. This time the page stays in use, attached to the inode of the shared library file. The inode is now inactive but will stay in the inode cache until it is reused, and the pages act as a file cache in case the user is about to restart the window system again. The change made in Solaris 8 was that the file cache is the tail of the free list, and any file cache page can be reused immediately for something else without needing to be scanned first.
13. fsflush — Flushed by the sync
Every 30 seconds all the pages in the system are examined in physical page order to see which ones contain modified data and are attached to a vnode. The details differ between SunOS 4 and Solaris 2, but essentially any modified pages will be written back to the filesystem, and the pages will be marked as clean. This example sequence can continue from Step 4 or Step 9 with minor variations. The fsflush process occurs every 30 seconds by default for all pages, and whenever the free list size drops below a certain value, the
pagedaemon scanner wakes up and reclaims some pages. A recent change in Solaris 10, backported to Solaris 8 and 9 patches, makes fsflush run much more efficiently on machines with very large amounts of memory. However, if you see fsflush using an excessive amount of CPU time you should increase “autoup” in /etc/system from its default of 30s, and you will see fsflush usage reduce proportionately.
Now you know
I have seen this missing-memory question asked about once a month since 1988! Perhaps the manual page for vmstat should include a better explanation of what the values are measuring. This answer is based on some passages from my book Sun Performance and Tuning. The book explains in detail how the memory algorithms work and how to tune them. However the book doesn’t cover the changes made in Solaris 8.
Originally published at perfcap.blogspot.com.