Skip to content

Crashing XenServer 7 with Nvidia M60 (and Dell R730)

During the last weeks we had to – or better are – facing a strange problem with Dell R730 / Dell R7910, NVIDIA M60 and Citrix XenServer 7.0. It’s not solved until know. I will update this Blog-Post when there are new findings (or solutions). Before I come to the problem I would like to give you some background information’s about the whole project:
Our Company gets a new CAD program. For the training we decided to use Virtual Desktops with Graphics acceleration instead of classical Workstations. Round about 40 users are trained at the same time. Thus we discussed the topic with Dell and at the end we bought (on recommendations from Dell) the following Hardware (twice):

Dell 7910 Rack Workstations
512GB RAM
2X Intel Xeon E5-2687W v4 (3.00GHz)
1X NVIDIA M60

We did some initial testing’s and everything worked without any problems. With ~10 Key Users we “simulated” a training to check how many users we get on one machine and if we have any problems. Luckily a small NVIDIA vGPU profile (M60-0B – yes B) was fast enough. So we can handle up to 64 users with the two Workstations – more than enough for our trainings.

Due to some problems in the project the trainings had to be delayed – fine for us because we thought everything was working and we had nothing to do in this area. In the meantime we checked the current CAD Hardware. This check showed that around 30 clients must be replaced before the new CAD software goes live – because it will not work on these clients. Another 30 need to be replaced during the year. A lot of discussions and calculations started if we now start to provide productive CAD users also a Virtual Desktop instead of a classical physical one. Again we had discussions with Dell and ended with the following configuration:

Dell R730
576GB RAM (512 for VMs – 64 for Hypervisor)
2X Intel Xeon E5-2667 v4 (3.20GHz)
1X NVIDIA M60

The system was ordered three times (for redundancy). At this point it was not clear how many users we can get on one Server – we planned to use these systems to do more tests and find the best NVIDIA vGPU Profile which fulfills the user requirements.

After the hardware arrive it was directly installed. At the same time the trainings started – our problems as well. After a few training days suddenly one of the R7910 froze – no machines or the Hypervisor (XenServer 7) itself reacted any longer. We had to hard reboot the whole system. These continued to happen on both systems in the next days. There were no Crash Dumps on the XenServer or Event-Logs in the iDRAC Server Log. Thus we decided to activate one of the R730 to get a more stable training environment and investigate more relaxed the problem. However – they also had a problem. Instead of freezing there was a hard Hardware-Reset after some time. In the iDRAC Server Logs the following errors appeared:
01_dell_r730_nvidia_m60

Fri Dec 02 2016 10:43:16    CPU 2 machine check error detected.   
Fri Dec 02 2016 10:43:10    An OEM diagnostic event occurred.   
Fri Dec 02 2016 10:43:10    A fatal error was detected on a component at bus 0 device 2 function 0.   
Fri Dec 02 2016 10:43:10    A bus fatal error was detected on a component at slot 6.   

And shortly after that:
02_dell_r730_nvidia_m60

Fri Dec 02 2016 10:45:24    CPU 1 machine check error detected.   
Fri Dec 02 2016 10:45:24    Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.   
Fri Dec 02 2016 10:45:24    Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.   
Fri Dec 02 2016 10:45:24    A problem was detected related to the previous server boot.

Furthermore the boot screen showed the following message:
03_dell_r730_nvidia_m60

Ok – Time for a Ticket at Dell. The support person told me that there are two known problems when the M60 is used:

  1. The M60 is connected with a wrong cable and does not get enough power
  2. The power cable of the M60 is in the Airflow of the card

We checked both. The results were quite promising:
R7910:
A wrong cable was used – so the M60 can’t get enough power. The wrong one looks like this:
04_dell_r730_nvidia_m60

The correct one has two plugs instead of one. It’s an CPU-8-pin auxiliary power cable (see Nvidia M60 Product Brief – page 11): 
05_dell_r730_nvidia_m60
The side with one plug is connected to the M60.

Of course this cannot be directly connected to the power-plug on the riser card. Thus you also need the following cable: 
06_dell_r730_nvidia_m60
As you can see this has also two plugs on one side – connect these two plugs with the two from the correct cable and the white with the rise card.

Another thing you need to make sure is that the power cable is not in the Airflow of the M60 cooling. It looks like it often happens that the cable is not under the M60 (like shown in the picture) and instead behind it.
07_dell_r730_nvidia_m60

The rest of the cable fits next to the card.
08_dell_r730_nvidia_m60

Furthermore Dell advised us to change the Power Supply Options. You can find them on the iDRAC: Overview => Hardware => Power Supplies. On the right side there is now an Option Power Configuration.
09_dell_r730_nvidia_m60

They suggested changing the Redundancy Policy to Not Redundant and Hot Spare to Disable.
10_dell_r730_nvidia_m60

After correcting this it first looked like it fixed the problem – unfortunately it did not. After a few days in production on of the 7910 froze again. Later I discussed this topic again with another Dell Engineering. After that we changed the setting back to the following:
Redundancy Policy:
11_dell_r730_nvidia_m60
You can find a detailed description of the settings in the iDRAC User Guide on page 161.

At this point Dell thought they have another customer with the same configuration (Dell R730 + Nvidia M60 + XenServer 7). Therefore they started to check for other hardware differences between both systems. They found the following differences and replaced our ones with the one of the other customer:

Power Supply:
We had one from Lite-On Technology.
12_dell_r730_nvidia_m60

In addition the other had one from Delta Electronics INC.
13_dell_r730_nvidia_m60

Furthermore their motherboard had a different revision number. It was changed to one with the same revision. (sorry didn’t make a picture of both). To make sure even one M60 was replaced.

No replacements made any difference. Later we figured out that the other customer was using XenServer 6.5 and not 7. During all the tests I found that rebooting 32VMs several times (often just one reboot was enough) led to the problem – thus it was reproducible Smile

In the meantime the case was escalated. Dell did many Hardware replacements (really uncomplicated) but nothing changed. Interestingly even now it didn’t look like Dell Engineering was involved. We created a case at NVIDA and contacted System Engineers from NVIDA and Citrix. In parallel I created a post on the NVIDIA Grid Forums – especially BJones gave some helpfull feedback.

After discussing the problem we did a few more tests:

Testing different vGPU Drivers:

Host VM
361.45 362.56
367.43 369.17
367.64 369.71

The Problem was always the same

Test Result
Switch to Passthrough GPU Problem solved
Remove Driver from Guest Problem solved
Use different vGPU Problem exists
Reduce Memory to 512GB (from 576GB) Host freezes instead of crash
Reduce Memory to 128GB Problem solved
Replace Xeon v4 CPU with v3 Problem solved
Use XenServer 6.5 Problem solved

There had been more tests – but to be honest I don’t remember every detail.

One of the things we also changed was to add the iommu=dom0-passthrough parameter to the xen boot-line. Although we didn’t have the problem that the driver inside the VM did not start – the behavior changed a little bit. It first looked like the problem was fixed – but at the end only more reboots had been necessary to crash the system. After that a crash dump appeared on the XenServer. This had never happened before. Unfortunately the crash dump didn’t contain any useful information.

At the moment we are working with Citrix Engineering and Nvidia (although it’s mainly focused on Citrix because we don’t see the Nvidia driver included in the problem). One of the other thoughts was that a Microcode Patch from Intel could help to solve the problem. This patch is not included in the current Dell Bios. The current version is 0X1F – currently the Dell Bios has the 0X1E integrated. Fujitsu has updated his Bios at the end of December with 0X1F. We installed the update manually – but it didn’t change anything. (If I have some time later I will publish a separate blog post how to do that…).

Our current workaround is to use v3 CPUs. The performance is lower – but it was the option where we didn’t have to change anything in our environment (no reinstallation or reconfiguration) except the CPU.

Since yesterday we are running another test with v4 CPUs. We disabled a new feature from the CPU that was adopted in Xen 4.6 (=> XenServer 7). Until now we have no crashes *fingers crossed*. User tests will follow tomorrow.

One last (personal) thing at the end:
I really appreciate it that Dell did many Hardware-Replacements really fast and uncomplicated. Nevertheless it never felt like the Dell engineering was involved in the problem solution. Citrix and Nvidia Engineers asked us a lot of questions – but we didn’t receive any from Dell engineering.
The point I am more worried about is how Dell “continues” to support an environment with XenServer. During the discussions we often got the information like “We don’t support that because it’s not on the Citrix HCL” – interestingly on the XenServer HCL you can find the Dell R720 (a quite old system) in conjunction with the Nvidia M60. On the other hand the R730 is listed with the M10. Furthermore Dell send a R730 to Citrix for testing it and adding it to the HCL. The “normal” way I know is that the Hardware-Vendor does all the tests for the HCL and just sends Citrix the results. We have asked our Dell representatives for an official statement – hopefully we will receive that soon and it’s positive about XenServer Support from Dell (and they don’t try to replace  XenServer with VMWare…).

Nvidia vGPU Driver fails to load with error Code 43 on XenServer 7 Hosts with more than 512GB-Ram

When you try to start a VM with an attached vGPU and the XenServer itself has more than 512GB-Ram it might happen that the driver does not start. You just get an error message with Code 43:
01_nvidia_error_43
(Sorry for poor quality.)

In some situations it’s even not possible to boot the whole VM:

An emulator required to run this VM failed to start

02_nvidia_error_43

If you google for the problem you find the following Knowledge-Base-Article from Nvidia.

Unfortunately the information’s in this article only partly worked for us. The first command should automatically add the required parameters to the grub.cfg

Command line: (/opt/xensource/libexec/xen-cmdline –set-dom0 iommu=dom0-passthrough)

Alternatively there is a second method described how to fix it manually:

Or by editing the bootloader (/etc/grub.conf) grub.conf to contain:iommu=Dom0-passthrough

To check the grub.cfg you have to open a console session and browse to one of these folders (depending on the servers boot configuration (BIOS or UEFI Boot)):

BIOS:
/boot/grub
03_nvidia_error_43

UEFI:
/boot/efi/EFI/xenserver
04_nvidia_error_43

Open grub.cfg with vi:
vi grub.cfg
05_nvidia_error_43

Originally the file looks like this:
06_nvidia_error_43

Now run the following command:
07_nvidia_error_43

/opt/xensource/libexec/xen-cmdline –set-dom0 iommu=dom0-passthrough

After that it looks like this:
08_nvidia_error_43
The iommu parameter is added to module2.

Although it looks like the parameter was added correctly, the error still occurred. I discussed it with Ronald Graß (Citrix Engineer). He thought that the parameter should be added to line multiboot2 (the “xen-line”) and not to module2. Thus I moved it up – the problem was fixed. Later I talked with a XenServer Engineer about it and he confirmed that the multiboot2 line is the correct one. After changing the grub.cfg looked like this:
09_nvidia_error_43

Later I had a second system with the same problem. So I opened the grub.cfg and added iommu=Dom0-passthrough (the second method from the article). This time it did not work Sad smile. We did many tests and even reinstalled the whole server – the error still occurred. When I compared the working one and the other one – there was only a really tiny difference between the grub.cfg files:
Working:

iommu=dom0-passthrough

Not working:

iommu=Dom0-passthrough

As you can see – the only difference was that the parameter in the not working one was with a capital D. So I replaced it with a lowercased d and rebooted the system. Booom – problem fixed. So the parameter is case-sensitive – write everything lower case.

Scheduled reboots not working on Citrix XenDesktop 7.12

Just a short one:
We had the problem that our scheduled reboots for a Delivery Group under XenDesktop 7.12 (not the new tagged based rebooting) did not work any longer. If you have the same problem contact Citrix Support and ask for private LC6766. This contains a new BrokerAgent.exe for the VDA which should fix the problem.

Citrix XenServer 7 shows VMs created with XenDesktop 7.12 MCS as I/O not optimized

After upgrading to XenDesktop 7.12 VMs show I/O not optimized and Install I/O drivers and Management Agent when they are created using MCS.
01_io_not_optimized

When checking the master VM everything was fine:
02_io_not_optimized

If you try to google for the problem, you only find hints to reboot the VM several times to fix it – not helpful on pooled (random) machines. Thus I restarted the Master-VM several times, created a new Snapshot and applied it to the Machine Catalog. This did not solve the problem. I opened a Ticket at Citrix. Luckily they already knew the problem and gave me a hotfix to solve it. The hotfix replaces the following files:

C:\Program Files\Common Files\Citrix\HCLPlugins\Hypervisor\v2.20.0.0\XenServer
HypervisorsCommon.dll
XenServerPlugin

To replace them you need to stop the following services:

Citrix Host Service
Citrix Machine Creation Service
Broker Service

Don’t forget to Backup the existing ones before adding the Hotfix ones.
03_io_not_optimized

After replacing the files start the stopped services. Now you need to create a new Machine Catalog – otherwise the fix will not work.

That’s it – to get the fix open a ticket at Citrix and reference to private LC6769.

Automatic Shortcut generation for local installed applications in a Citrix XenDesktop / XenApp 7.x environment

I think you all know that Citrix is still not able to detect if an application is installed locally when you publish a Desktop under XenApp / XenDesktop 7.x (in XenApp 6.5 and before that was working really good…). Instead you have to create a Shorcut-Template-Folder containing Shortcuts for all local applications and a Receiver Registry-Key. Furthermore you need to add KEYWORDS:prefer=SHORTCUTNAME to each application . Now Receiver checks if a Shortcut with the specified name exists. If it exists this Shortcut is copied to the Startmenu instead of creating one which starts another ICA-Session. The downside of this that this way it’s not detected that a published application was started – but that’s another story.

As announced during DCUGTecCon in Kassel I have created a script that creates the required Shortcuts. The script checks for all applications with the KEYWORDs:prefer= setting, exports the Application Icon and creates the Shortcuts in a configured folder. The following Script from Andreas Nick is required. You can use Andreas Script to generate any Shortcut with PowerShell. You must save the script in the same folder like my script and name it create-shortcut.ps1. Beside that you have to exclude the following line:

[ValidateSet("AllUsersDesktop", "AllUsersStartMenu", "AllUsersPrograms", "AllUsersStartup", "Desktop", "Favorites", "SendTo", "StartMenu", "Startup")]

The script must be executed on a Delivery Controller. If you only want to create the Shorcut for a specific Application you can use the paramter –ApplicationName “APPLICATIONNAME”.

You can download the script here.

A little bit of background informations about the script idea:
During the preparation for the DCUGTecCon I had some discussions with Andreas. He told me about his script to generate Shortcuts with PowerShell. At the same time, I had to create some shortcuts for some new published applications which are also locally available on a published desktop. Thus I had the idea to automate this with Andreas Script. A few days later the following script was born. I have already done some optimizations.

Microsoft Office shows Network Printers multiple times when using Citrix Universal Print Server

We had some users reporting that all Office programs (Word, Excel, Outlook,…) show each Network printer multiple times in a XenDesktop 7 environment. The printers where connected from a central Print Server using the Citrix Universal Print Server. All users affected where using the same VDA Catalog.
office_printer_duplicated_01

When you checked the Devices and Printers area the network printers where only shown once.
office_printer_duplicated_02

Furthermore other applications only showed the printers one time.
office_printer_duplicated_03

After searching for some time I found a (hidden) Citrix Blog (!) post from 2014 which describes the issue for XenApp 6.5. Ok let’s check the mentioned Registry-Key

HKLM\SYSTEM\CurrentControl\Print\Providers

office_printer_duplicated_04

And look – it was the same problem. In the order key the Universal Printer was listed three times – and guess: The network printers are also shown three times.
office_printer_duplicated_05

So I removed the doubled entries and only kept the first Universal Printer. Moreover I removed empty lines at the end of the list.
office_printer_duplicated_06

After removing the entries and restarting the Print Spooler (with the corresponding Citrix Service) the duplicate printers in Word were gone.

Citrix PVS Reverse Imaging with Windows Backup

During the last weeks I had different problems with the “known” Reverse-Imaging Technics. One of them was the after reverse Imaging a Windows 2012 R2 Server the Server Manager didn’t start any longer.
pvs_reverse_imaging_windows_backup_01

Faulting application name: ServerManager.exe, version: 6.3.9600.17238, time stamp: 0x53d0b3e7
Faulting module name: ntdll.dll, version: 6.3.9600.18233, time stamp: 0x56bb4ebb
Exception code: 0xc0000374
Fault offset: 0x00000000000f1b70
Faulting process id: 0x1214
Faulting application start time: 0x01d1de6cee400844
Faulting application path: E:\Windows\system32\ServerManager.exe
Faulting module path: E:\Windows\SYSTEM32\ntdll.dll
Report Id: 2c2107c1-4a60-11e6-80f6-0050568817e7
Faulting package full name:
Faulting package-relative application ID:

Another was that the Server Manager wasn’t able to refresh the Roles and Features. So I wasn’t able to see (or add) any installed Roles / Features.
pvs_reverse_imaging_windows_backup_02

I tried different options to fix the problems – e.g. dism – none of them worked.

Sometimes even the complete reverse imaging failed with the following error.

Failed to load registry key {_BCD_} on \Boot\BCD. The system cannot find the path specified. (0x00000003)
Volume to Volume stopped
Volume to Volume lasted  834,7 seconds
Failed to convert Boot Configuration Data. The system cannot find the path specified. (0x00000003)

Also it didn’t make a difference if I used the P2PVS Tool….
pvs_reverse_imaging_windows_backup_03

…or the old BNImage.
pvs_reverse_imaging_windows_backup_04
The errors were different – but the reversed image was not usable.

So I started to look for a different method and found one: The integrated Windows Backup.

Before you can start you need to extend (or add) a separate drive from your Master-VM so that a backup of the whole system drive fits on it. I just extended the Cache Disk. You can’t use the local system drive for the backup.
pvs_reverse_imaging_windows_backup_05

As you can see my Cache Disk has now 200GB – more than enough for a 100GB System drive.
pvs_reverse_imaging_windows_backup_06

The next step is to install Windows Server Backup Feature from the Server Manager.
pvs_reverse_imaging_windows_backup_07

Confirm the installation.
pvs_reverse_imaging_windows_backup_08

And Close the installation when it’s finished.
pvs_reverse_imaging_windows_backup_09

Now start the Windows Server Backup and select Local Backup on the left side. On the right side choose Backup Once.
pvs_reverse_imaging_windows_backup_10

You can now only select Different options.
pvs_reverse_imaging_windows_backup_11

In the next step it’s important to select Custom – otherwise you can’t exclude e.g. the Cache Drive.
pvs_reverse_imaging_windows_backup_12

Now you have to Add Items for the backup.
pvs_reverse_imaging_windows_backup_13

Select System state, System Reserved and Local disk (C:).
pvs_reverse_imaging_windows_backup_14

Your selection should now look like this:
pvs_reverse_imaging_windows_backup_15

As the Backup Destination pick the local drive (which you extended or added at the beginning).
pvs_reverse_imaging_windows_backup_17

A last check if everything is correct before you can start the Backup.
pvs_reverse_imaging_windows_backup_18

Windows is now creating the backup which will take some time (like reverse imaging…).
pvs_reverse_imaging_windows_backup_19

After the backup it is important to clean the local system disc (not the PVS one the system is booted from Winking smile)
pvs_reverse_imaging_windows_backup_20

You can try to just format it…
pvs_reverse_imaging_windows_backup_21
… but I sometimes had problems with just formatted disks. So I suggest to remove the current disk from the VM and attach a new one with the same size.

The next step is to boot the VM from a Windows-Installation-Iso (with the same OS you plan to reverse image). Select your Language/Time/Keyboard and continue.
pvs_reverse_imaging_windows_backup_22

Now don’t choose Install now instead click on Repair your computer.
pvs_reverse_imaging_windows_backup_23

Go to Troubleshoot…
pvs_reverse_imaging_windows_backup_24

… and select System Image Recovery.
pvs_reverse_imaging_windows_backup_25

The shown OS Version must match your installed OS.
pvs_reverse_imaging_windows_backup_26

The wizard now detects the backup on the local disc – only if you created multiple ones you need to Select a system image.
pvs_reverse_imaging_windows_backup_27

No changes needed here.
pvs_reverse_imaging_windows_backup_28

Start the Restore with Finish.
pvs_reverse_imaging_windows_backup_29

You need to confirm that the local disc will be formatted and the system is restored.
pvs_reverse_imaging_windows_backup_30

The restoring starts…
pvs_reverse_imaging_windows_backup_31

.. after a successful restore an automatic reboot follows. The VM will still boot from the PVS.
pvs_reverse_imaging_windows_backup_32

So open the properties of your Master-Target Device and select Boot from Hard Disk.
pvs_reverse_imaging_windows_backup_33

That’s it. After another reboot your VM can start from the local Hard-Drive.

Hope the article was helpful for some of you Smile

%d bloggers like this: