Server Recovery

Wednesday, 21 October 2009

Birmingham Hippodrome

We are fortunate in having so many great customers, and this month we have been lucky enough to have released a case study from Birmingham Hippodrome. A customer introduced to us by one of our partners, eSpida.

We have been working with eSpida for around 7 years now, and we have worked with them on many projects. Including implementing systems as diverse as SAP on Linux, Microsoft Exchange Servers and Linux file servers. They have a great attitude and we thoroughly enjoy working with them.

To read more about the work we have done with Birmingham Hippodrome and eSpida, the case study is available here. For more info on eSpida visit: www.espida.co.uk

Thursday, 18 June 2009

Ensuring High Availability on a budget

In trying to meet the conflicting demands of costs and maintaining uptime, do you have to look for more creative ways to spend your IT budget?

Many organisations are turning to Linux for a proven and reliable solution. As the SteelEye Solution Centre for UK, Ireland and the Gulf region, Open Minds have been providing organisations with Linux clustering and Disaster Recovery for over 10 years, providing enterprise-grade clustering for the Linux market. SteelEye LifeKeeper for Linux provides enterprise-grade high availability for the most demanding deployments, supporting configurations built on commodity servers and storage whilst removing the need for shared storage and enterprise edition licences.

If you would like to find out more about how we provide enterprise grade Linux solutions to our customers, then please reply to this email with your telephone number and we will contact you to discuss your requirements in greater detail. The first five enquires will receive a copy of “Free Software, Free Society” selected essays of Richard M. Stallman, and all enquiries will be sent a free Linux distribution CD courtesy of the Linux Emporium. For more information on Linux adoption, click here

Friday, 29 May 2009

Celebrate with us...

Open Minds are pleased to announce that the SteelEye Protection Suite for Windows was recently awarded the 2009 Global Product Excellence – Disaster Recovery Award by the Info Security Products Guide Awards. The Guide is the world’s most comprehensive guide on Info Security with the awards recognising and honouring excellence in all areas of information security.

The SteelEye Protection Suite for Windows provides a highly functional and cost effective combination of data replication and application recovery for your file servers and IIS servers.
To see for yourself why SteelEye Protection Suite for Windows was awarded the 2009 Global Product Excellence – Disaster Recovery Award, please send us and email to evalrequest@openminds.co.uk to request a free 30 day product evaluation license.

Furthermore, to celebrate SteelEye’s success, if you order the award winning SPS for Windows before 18th June 2009, we will give you LifeKeeper Protection Suite for Bespoke Applications, normally costing £350.00, free of charge. To take advantage of this offer, please quote 11990f when ordering your SPS for Windows.

Warm regards

The Open Minds Team

Thursday, 30 April 2009

What is the cost to your business if your servers fail? FREE Calculator & free workshop

When you look at your most critical servers, can you immediately identify what the impact on the business would be if these servers went down?

What would be the lost revenue?

What would be the lost productivity?

How about the damage to your reputation?

Or,
All of the above?

Visit our cost of Downtime Calculator, to build a business case for building a more resilient IT infrastructure.

By completing the Downtime Calculator, you now have an indication of the cost to your company should your servers fail, but you also need to identify two key objectives for your critical servers:

1. Recovery Time Objective (RTO):
How long does it take your business to recover from a server failure?

2. Recovery Point Objective (RPO):
At what point will you recover up to?

For example, your current operating system takes 5 hours to recover from server failure (RTO) but can only recover to the latest back-up which occurred at midnight (RPO):
Wednesday, 12noon, server failure occurs, it will be 5pm before server failure recovery (RTO), but, the latest back-up occurred midnight Tuesday (RPO), which would result in up to a 17 hour window of Downtime, depending on your company’s operating hours.

If you apply the 17 hour window of Downtime to your Calculator within System Restoration Cost, you will identify your IT department’s internal costs. This is only a fraction of the cost and lost productivity that the company will experience during server failure.

Your business case complete: In most cases, the cost of one hour’s server Downtime to the company will pay for a high availability or a disaster recovery solution. Delivering replication, helping you to achieve a virtually immediate recovery point, and, failover which helps you to achieve a minimal recovery time with the added benefit that system users will be unaware that there has been any server failure.

If you are interested in finding out how to plan for server failure and supporting your business case argument, then, please reply to this email or call 0121 313 3943 before 15th May and we will offer you a free place on our Disaster Recovery and High Availability Planning Workshop, worth £250.

For further information about the content of the Disaster Recovery and High Availability Planning Workshop, please refer to the document available here.

Wednesday, 8 April 2009

Do you test your systems recovery?

Do you remember the amount of investment spent in time and money when you decided to implement your recovery solution?

Are you confident that your recovery systems will work should the worst happen?

To maintain the resilience of your LifeKeeper solution, we have devised a 5-Point Checklist:

When did you last complete a system test?

LifeKeeper systems need to be tested on a regular basis. It is recommended that you perform a failover test every month, but as a minimum, tests should be performed every 3 months. This is to make sure that when recovery on the back up server is invoked, it will run smoothly.

How often do you monitor the status of the mirror?

If the mirroring stops, it means that your data is no longer replicated. Therefore, it is critical to regularly monitor the status of your mirror. You can never really do this too often, so make it a regular practice to look at your mirror status at least daily.

Have you set up email notification for your servers?

Being notified by email or text every time a failover or switchover happens is extremely easy to set up. It is an easy way to be notified every time you need to pay some attention to your LifeKeeper servers.

Do you check your event logs regularly?

For the general health and housekeeping of the server, check event logs daily. Al events such as hardware failures, LifeKeeper information, quotas, application specific actions, logons are logged in your event logs. Failover can sometimes be prevented by detecting errors early enough and taking corrective action.

Do you verify all configuration changes? Eg passwords

If you perform changes to configurations such as passwords, settings, files, you will also need to ensure that these changes won’t affect the functionality of LifeKeeper. To do this you need to check your event logs. It is also advisable to test changes by performing a switchover. Of course, it is best to do this outside of normal hours, or in periods of low usage.

If you need assistance assessing the resilience of your high availability solution, Open Minds are able to perform a Health Check, either remotely or on-site to ensure that your LifeKeeper servers are running smoothly, ready, should the worst happen.

For more details about the Health Check please read the Open Minds Health Check document or contact us on sales@openminds.co.uk or 0121 313 3943.

Monday, 1 December 2008

Hyper V and Xen virtualisation

There has been a certain buzz about the place here at Open Minds these last couple of weeks We are eagerly anticipating the release of several new products. The main event coming up in at the end of November is the release of DataKeeper. Datakeeper includes not just data replication for Windows server (which we have always been able to do with LifeKeeper Data Replication for Windows). It also includes replication of Hyper-V. We are finding that there is already a groundswell of demand for DataKeeper.

As far as virtualisation products go, we also have a recovery solution VirtualCentre recovery, usually the one single point of failure in a VMWare solution. We were the first in the market to produce this. For a preview video of the solution see the following link: http://www.steeleye.com/downloads/videos/datakeeper-and-hyper-v-wsfc/

Tuesday, 21 October 2008

Tuning NFS

This is a brief blog outlining some general tweaking guidelines for NFS. Generally as NFS itself is relatively simplistic. You will notice the most change in performance by tweaking you network environment and the storage systems.
Below are a couple of idea which will hopefully improve your NFS performance.

One of the ways to improve performance in an NFS environment is to limit the amount of data the file system must retrieve from the server. This limits the amount of network traffic generated, and, in turn, improves file system performance. Metadata to be retrieved on the client and updated on the server includes:

access time
modification time
change time
ownership
permissions
size

Under most local file systems this data is cached in RAM and written back to
disk when the operating system finds the time to do so. The conditions under which NFS runs are far more constrained. An enormous amount of excess network traffic would be generated by writing back file system metadata when a client node changes it. On the other hand, by waiting to write back metadata, other client nodes do not see updates from each other, and applications that rely on this data can experience excess latencies and race conditions. Depending on which applications use NFS, the attribute cache can be tuned to provide optimum performance. The following mount options affect how the attribute cache retains attributes:

acregmin – The amount of time (in seconds) the attributes of a regular file must be retained in the attribute cache.
acregmax – The amount of time (in seconds) the attributes of a regular file may remain in cache before the next access to the file must refresh them from the server.
Acdirmin – Same as acregmin, but applied to directory inodes
Acdirmax – Sam as acregmax, but applied to directory inodes

There are also two settings: actimeo, which sets all four of the above numbers to the same value, and noac, which completely disables the attribute cache. By increasing these values, one can increase the amount of time attributes remain in cache, and improve performance.

One of the most important client optimization settings are the NFS data transfer buffer sizes, specified by the mount command options rsize and wsize. E.g.:

# mount server:/data /data -o rsize=8192,wsize=8192

This setting allows the NFS server to reduce the overhead of client-server communication, allowing it to send larger transactions to the NFS server when the server is available. By default, most NFS clients set their read and write size to 8Kb, allowing a read or write NFS transaction to transfer up to 8Kb of file data. This transaction consists of an NFS read/write transaction request and a set of packets. In the case of a write, the payload is data carrying packets; if it’s read, they are response packets. By increasing the read and write size, fewer read/write transactions are required, which means less network traffic and better performance.

However setting the rsize and wsize to a figure above your mtu (usually set to 1500) will cause IP fragmentation when using NFS over UDP. IP Fragmentation and re-assembly require a significant amount of CPU resource at both ends of a network connection.
In addition, packet fragmentation also exposes your network traffic to greater unreliability, since a complete RPC request must be retransmitted if a UDP packet fragment is dropped for any reason. Any increase of RPC retransmissions, along with the possibility of increased timeouts, are the single worst impediment to performance for NFS over UDP.
Packets may be dropped for many reasons. If your network is complex, fragment routes may differ, and may not all arrive at the Server for reassembly. NFS Server capacity may also be an issue, since the kernel has a limit of how many fragments it can buffer before it starts throwing away packets. With kernels that support the /proc filesystem, you can monitor the files /proc/sys/net/ipv4/ipfrag_high_thresh and /proc/sys/net/ipv4/ipfrag_low_thresh. Once the number of unprocessed, fragmented packets reaches the number specified by ipfrag_high_thresh (in bytes), the kernel will simply start throwing away fragmented packets until the number of incomplete packets reaches the number specified by ipfrag_low_thresh.

Two mount command options, timeo and retrans, control the behavior of UDP requests when encountering client timeouts due to dropped packets, network congestion, and so forth. The -o timeo option allows designation of the length of time, in tenths of seconds, that the client will wait until it decides it will not get a reply from the server, and must try to send the request again. The default value is 7 tenths of a second. The -o retrans option allows designation of the number of timeouts allowed before the client gives up, and displays the Server not responding message. The default value is 3 attempts. Once the client displays this message, it will continue to try to send the request, but only once before displaying the error message if another timeout occurs. When the client reestablishes contact, it will fall back to using the correct retrans value, and will display the Server OK message.

If you are already encountering excessive retransmissions (see the output of the nfsstat command), or want to increase the block transfer size without encountering timeouts and retransmissions, you may want to adjust these values. The specific adjustment will depend upon your environment, and in most cases, the current defaults are appropriate.
Most startup scripts, Linux and otherwise, start 8 instances of nfsd. In the early days of NFS, Sun decided on this number as a rule of thumb, and everyone else copied. There are no good measures of how many instances are optimal, but a more heavily-trafficked server may require more. You should use at the very least one daemon per processor, but four to eight per processor may be a better rule of thumb. If you want to see how heavily each nfsd thread is being used, you can look at the file(s) in /proc/fs/nfsd. The last ten numbers on the n'th line in that file indicate the number of seconds that the thread usage was at that percentage of the maximum allowable. If you have a large number in the top three deciles, you may wish to increase the number of nfsd instances. This is done upon starting nfsd using the number of instances as the command line option, and is specified in the NFS startup script (/etc/rc.d/init.d/nfs on Red Hat) as RPCNFSDCOUNT. See the nfsd(8) man page for more information.

In general, server performance and server disk access speed will have an important effect on NFS performance. However the above will help you to start tweaking the system to improve the performance of your NFS resources.

These are just a couple of suggestions as to how to improve the general performance of your NFS systems. The list above is no where near exhaustive, however it points you in the right direction. Tuning the underlying technologies is generally the easiest way to see gain in performance with NFS and NFS itself in not that complex a file sharing protocol.