Tuesday 21 October 2008

Tuning NFS

This is a brief blog outlining some general tweaking guidelines for NFS. Generally as NFS itself is relatively simplistic. You will notice the most change in performance by tweaking you network environment and the storage systems.
Below are a couple of idea which will hopefully improve your NFS performance.

One of the ways to improve performance in an NFS environment is to limit the amount of data the file system must retrieve from the server. This limits the amount of network traffic generated, and, in turn, improves file system performance. Metadata to be retrieved on the client and updated on the server includes:

  • access time
  • modification time
  • change time
  • ownership
  • permissions
  • size

Under most local file systems this data is cached in RAM and written back to
disk when the operating system finds the time to do so. The conditions under which NFS runs are far more constrained. An enormous amount of excess network traffic would be generated by writing back file system metadata when a client node changes it. On the other hand, by waiting to write back metadata, other client nodes do not see updates from each other, and applications that rely on this data can experience excess latencies and race conditions. Depending on which applications use NFS, the attribute cache can be tuned to provide optimum performance. The following mount options affect how the attribute cache retains attributes:

  • acregmin – The amount of time (in seconds) the attributes of a regular file must be retained in the attribute cache.
  • acregmax – The amount of time (in seconds) the attributes of a regular file may remain in cache before the next access to the file must refresh them from the server.
  • Acdirmin – Same as acregmin, but applied to directory inodes
  • Acdirmax – Sam as acregmax, but applied to directory inodes

There are also two settings: actimeo, which sets all four of the above numbers to the same value, and noac, which completely disables the attribute cache. By increasing these values, one can increase the amount of time attributes remain in cache, and improve performance.

One of the most important client optimization settings are the NFS data transfer buffer sizes, specified by the mount command options rsize and wsize. E.g.:

# mount server:/data /data -o rsize=8192,wsize=8192

This setting allows the NFS server to reduce the overhead of client-server communication, allowing it to send larger transactions to the NFS server when the server is available. By default, most NFS clients set their read and write size to 8Kb, allowing a read or write NFS transaction to transfer up to 8Kb of file data. This transaction consists of an NFS read/write transaction request and a set of packets. In the case of a write, the payload is data carrying packets; if it’s read, they are response packets. By increasing the read and write size, fewer read/write transactions are required, which means less network traffic and better performance.

However setting the rsize and wsize to a figure above your mtu (usually set to 1500) will cause IP fragmentation when using NFS over UDP. IP Fragmentation and re-assembly require a significant amount of CPU resource at both ends of a network connection.
In addition, packet fragmentation also exposes your network traffic to greater unreliability, since a complete RPC request must be retransmitted if a UDP packet fragment is dropped for any reason. Any increase of RPC retransmissions, along with the possibility of increased timeouts, are the single worst impediment to performance for NFS over UDP.
Packets may be dropped for many reasons. If your network is complex, fragment routes may differ, and may not all arrive at the Server for reassembly. NFS Server capacity may also be an issue, since the kernel has a limit of how many fragments it can buffer before it starts throwing away packets. With kernels that support the /proc filesystem, you can monitor the files /proc/sys/net/ipv4/ipfrag_high_thresh and /proc/sys/net/ipv4/ipfrag_low_thresh. Once the number of unprocessed, fragmented packets reaches the number specified by ipfrag_high_thresh (in bytes), the kernel will simply start throwing away fragmented packets until the number of incomplete packets reaches the number specified by ipfrag_low_thresh.

Two mount command options, timeo and retrans, control the behavior of UDP requests when encountering client timeouts due to dropped packets, network congestion, and so forth. The -o timeo option allows designation of the length of time, in tenths of seconds, that the client will wait until it decides it will not get a reply from the server, and must try to send the request again. The default value is 7 tenths of a second. The -o retrans option allows designation of the number of timeouts allowed before the client gives up, and displays the Server not responding message. The default value is 3 attempts. Once the client displays this message, it will continue to try to send the request, but only once before displaying the error message if another timeout occurs. When the client reestablishes contact, it will fall back to using the correct retrans value, and will display the Server OK message.

If you are already encountering excessive retransmissions (see the output of the nfsstat command), or want to increase the block transfer size without encountering timeouts and retransmissions, you may want to adjust these values. The specific adjustment will depend upon your environment, and in most cases, the current defaults are appropriate.
Most startup scripts, Linux and otherwise, start 8 instances of nfsd. In the early days of NFS, Sun decided on this number as a rule of thumb, and everyone else copied. There are no good measures of how many instances are optimal, but a more heavily-trafficked server may require more. You should use at the very least one daemon per processor, but four to eight per processor may be a better rule of thumb. If you want to see how heavily each nfsd thread is being used, you can look at the file(s) in /proc/fs/nfsd. The last ten numbers on the n'th line in that file indicate the number of seconds that the thread usage was at that percentage of the maximum allowable. If you have a large number in the top three deciles, you may wish to increase the number of nfsd instances. This is done upon starting nfsd using the number of instances as the command line option, and is specified in the NFS startup script (/etc/rc.d/init.d/nfs on Red Hat) as RPCNFSDCOUNT. See the nfsd(8) man page for more information.

In general, server performance and server disk access speed will have an important effect on NFS performance. However the above will help you to start tweaking the system to improve the performance of your NFS resources.

These are just a couple of suggestions as to how to improve the general performance of your NFS systems. The list above is no where near exhaustive, however it points you in the right direction. Tuning the underlying technologies is generally the easiest way to see gain in performance with NFS and NFS itself in not that complex a file sharing protocol.

No comments: