I am writing this because I had a server doing this for a very long time before I pinned down the cause. This will include a lot of what I found on the internet and my own personal fix that worked for me.
Example Event:
Event Source: Application Popup
Event Category: None
Event ID: 333
Date: 3/23/2009
Time: 2:44:53 PM
User: N/A
Computer: SERVER1
Description:
An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in, or write out, or flush, one of the files that contain the system’s image of the Registry.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 00 00 00 00 01 00 6c 00 ……l.
0008: 00 00 00 00 4d 01 00 c0 ….M..À
0010: 00 00 00 00 4d 01 00 c0 ….M..À
0018: 00 00 00 00 00 00 00 00 ……..
0020: 00 00 00 00 00 00 00 00 ……..
Symptoms: Every third of a second or so there was an event id 333 error logged in the application event log on the server. This would start after the server has been up for a few hours to days and will stop for a period after the server was rebooted. The error occurred so often I was reaching 30,000 instances of the error in 24 hours. About 36 hours after the event started occurring no one was able to login active directory, and to get the server back up it required a manual hard reboot.
Things I found on the internet: This error is often caused by lack of resources on the server. Either there is not enough addressable memory, your disk speed it too low, or something is functioning very sub optimally. Everything I found on the internet pointed to checking these:
[ad]
- /3GB /PAE and /USERVA switches in the boot ini. It it worth evaluating if you have these set up right. More often than not they are not needed at all. This post on /3gb and /PAE andthis article on /userva
- Check performance. Use process explorer to see if there is something creating a excessive number of threads, or handles. Use it to monitor tasks and make sure no process is being greedy with CPU time or memory usage.
- Disable Hot add Memory. This has been useful to me on terminal servers before, though it did not fix my issue I have found a significant number of posts where it did. Check out this article.
- Out of date firmware and drivers. This especially goes for RAID controllers and hard disks. This can often times be really hard to upgrade without breaking the data on your disks. If you are using a RAID controller with 6 year old firmware, chances are that it is not preforming optimally on your 2 TB RAID 5 configuration. I would start by upgrading in this path: 1. RAID controller driver 2. Motherboard BIOS 3. Motherboard chipset drivers 4. RAID controller firmware 5. Hard disk firmware. Be very careful to check with the vendors before changing firmware on your disks or controller that it is not data destructive.
- Update SQL. If you are running SQL make sure you have all the updates, and that your memory usage in SQL is not set higher than 1/2 of your total available memory unless it is a detected SQL server.
- Disk Quotas. Turn them off, i really don’t think anyone uses them anyways. If you do make sure you don’t have any service running under a account that is subject to disk quotas. for example if you have a 500 mb disk quota for all users, and your print spooler is running under a service account. If a few people print big things, its not going to work and you will get 333 errors among other errors.
- Page File. Your page file should be 1.5 times your total system RAM. On a server with 1 gb of RAM you would set this to 1.5 gb, do not allow for system managed pagefile.
- Ntbackup.exe. This can cause strange system hangs if you are having a issue with VSS, are running SQL, or exchange, or any other application that is very did write intensive. I found several posts that pointed to nightly systemstate backups as the culprit. You can disable this temporarily for troubleshooting.
- Old Antivirus. Make sure you have the latest scan engine on your antivirus. There were some examples of people that updated their antivirus and the issue was resolved like magic.
- Virus/spyware. This goes with out saying in most cases, but make sure there is nothing sketchy running on your server.
My solution: For me the the solution was pagefile and disk quota. The page file was system managed and moved to a disk off the primary partition. This disk had disk quota limitations on it. When things got intense and the server wanted more disk space for the page file quota management would slap its hand and make it put it back. This was decreasing my performance and using causing a ton of errors. I was effectively able to use 1.5gb of my pagefile on a server that was running exchange and SQL. I turned off disk Quotas and change the pagefile to a set size, since doing this I have had no issues at all.
[ad]