Xen DomU Networking Stops working under load

koan

Hello,

I have a 2.6.21 xen DomU running pv under a 2.6.21 Dom0.

In general everything works well, no problems. However, if I load the network card, it (almost) stops sending or receiving packets.

Everything is normally pretty low access - I am running a couple of database servers and snmp server and a few other things in the domu, but the traffic frequency is extremely low - one or two users max.

If I start something like bittorrent, it will kill the networking almost immediately. I was running a Samba server on the domu for a while, but whenever I would save a file of significant size, it would kill the network.

If I xm console in, the network card appears fine, and there are a small number of bytes ticking up on transmit and receive. Nothing in messages or dmesg.

The dom0 network is fine, and I am using a single physical nic to bridge to. I have three other vms (all hvm) running on the same Xen box and they don't suffer any network issues.

I am not sure how to proceed with diagnosing this...

Cheers,

Paul

koan · Posted: Sun Aug 24, 2008 1:38 pm Post subject:

The domU can ping itself incidentally, and restarting the domU interface doesn't help.

Sometimes it seems to right itself, sometimes it needs a reboot. tcpdump on the dom0 doesn't show anything.

Also, I use this domU for asterisk, and when the network is working, there are no problems with calls. So the issue seems to be more about taxing the virtual nic than packet frequency.

koan · Posted: Sun Aug 31, 2008 11:51 am Post subject:

Well, ok.

I have changed the kernel in the domU from 2.6.21 to 2.6.25 - same problem. In the dom0 I have changed the network card without change.

The domU doesn't recognise anything is wrong in the messages or dmesg - it just cannot connect to anything. If I shut it down, it hangs but isn't connectible via the xm console.

If I xm destroy, it destroys. If I then try to xm create the domain again, I get

bbgermany · Posted: Mon Sep 01, 2008 2:39 pm Post subject:

How do you create the xenbr interface? Im doing it this way:

/etc/conf.d/net

koan · Posted: Mon Sep 01, 2008 11:48 pm Post subject:

I am running the network-bridge script for the bridge create - with a slight mod as I have multiple addresses on my physical nic, and these were not getting set up correctly on the bridge.

Your script sets up the bridge normally, but then also does this:

bbgermany · Posted: Tue Sep 02, 2008 6:17 am Post subject:

iirc, i used the gentoo wiki entry to configure my xen. there was this. i use multiple addresses on the bridge as well. iproute2 did the trick for me.

bb
_________________
Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB

koan · Posted: Tue Sep 02, 2008 2:20 pm Post subject:

The forward delay and stp settings are default in current gentoo Xen installs. The hello interval relates to the frequency that bpdu is issued, and so it is unlikely to have any baring on my issue.

I'll give it a test at some point, because right now I have exhausted the leads available to me - at least, the ones I can think of. Well I do have another, and that is to build Xen with a stock kernel from another distribution, to see if it helps. But that represents a whole new set of difficulties, as I couldn't find a stock kernel that did everything I wanted, which is why I came back to gentoo...

maslo64 · n00b Joined: 04 Sep 2008 Posts: 14 Location: Slovakia

Hello Koan,
I have exactly same issue with Xen.When I am starting domU I noticed message:

koan · Posted: Thu Sep 04, 2008 10:35 pm Post subject:

Hi,

I am not using bonded nics, or IPv6.

Xen 3.3 came into unstable a couple of days ago, so I upgraded, but the problem still remains.

Someone on the Xensource mailing list suggested lowering the NIC rate so that the domU never transfers at a speed that breaks networking, but last time I tried to test, the break happened at 3.6Mbs. That is pretty slow!

So the changes I have made are:

1) Change domU kernel (2.6.21, 2.6.24, 2.6.25)
2) Change domU userland (gentoo, ubuntu)
3) Change dom0 physical nic + driver (Realtek 8169 -> 8168)
4) Change Xen version (3.2.1 -> 3.3)

The only thing I haven't changed is the dom0 kernel. I am using a stock 2.6.21 gentoo kernel, so it would be great if anyone watching this that has pv domUs working under a 2.6.21 kernel would post their .config so I can compare it to mine.

Paul

maslo64 · n00b Joined: 04 Sep 2008 Posts: 14 Location: Slovakia

Hmm, so I switched back to eth0 -> xenbr0 configuration and disabled IPV6 and everything is ok now.
I am going to try different bonding modes. And if the problem persist i thing I have to try NAT

bbgermany · Posted: Fri Sep 05, 2008 10:09 am Post subject:

What kind of bond do ya use? Maybe your switch doesnt support the mode and so packages get lost at transfer.

bb
_________________
Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB

maslo64 · n00b Joined: 04 Sep 2008 Posts: 14 Location: Slovakia

I was using mode=1 , but when I was testing pluging-> unplugin cables , connections was not restored.
Then I tried mode=0 which was working fine from dom0 , but issue with domU appear.

bbgermany · Posted: Fri Sep 05, 2008 10:57 am Post subject:

Did you try as mode 5 (balance-tlb) or 6 (balance-alb) as well? Mode 0 is round-robbing and 1 is active-backup. If youre switch supports Link Aggregation Control Protocol (LACP), you should consider mode 4 (802.3ad).

bb
_________________
Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB

maslo64 · n00b Joined: 04 Sep 2008 Posts: 14 Location: Slovakia

I am going to test this today in the evening , because i don`t have console access to server and try again and again and again

.

maslo64 · n00b Joined: 04 Sep 2008 Posts: 14 Location: Slovakia

Still no luck with bonding , I tried 6 modes for bonding , but still no progres. I am going to downgrade kernels for dom0 and domU from 2.6.21 -> 21.6.18-r12 and check if it helps. I also set "sethello 2" as its known bug for xen as I found and you have right about this.

koan · Posted: Sat Sep 06, 2008 2:17 pm Post subject:

Ok,

It looks like the bonding issue isn't related to the original issue report on this thread, but thats ok, we can share

Anyway, in an effort to eliminate the nic as the source of the problem I used another with different drivers, but the problem remained. They were both realtek however, and it was pointed out that this would not necessarily discount a driver issue.

I am testing with a 10/100 tulip card and it is looking promising. No lockups yet and I have shifted a couple of gigs across the link.

maslo64 · n00b Joined: 04 Sep 2008 Posts: 14 Location: Slovakia

I am sorry if I mess up your thread with my own problem

Anyway, I looks that I have reached the solution. My configuration is HP DL 380G5 and network card is :

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)

After i recompiled kernel with driver for NW card as modul, it reported "Call trace" to dmesg log.
I tried to compile and install drivers from Broadcom website ,but can`t (don`t know how, all my attempts was unsuccessfull) howto enable
ZLIB_INFLATE in kernel, thus I can`t load the module.

Everythinkg looks fine when I compiled it as kernel module thru 'make0 menuconfig' and set mode=5

So this are my configs:

Hibbelharry · Posted: Sun Sep 07, 2008 7:49 pm Post subject:

You might try disabling checksum offloading to hardware by using ethtool. This solved some network dying problems wit xen for me.

Greetz
Hibbelharry

maslo64 · n00b Joined: 04 Sep 2008 Posts: 14 Location: Slovakia

Hello Hibbelharry,
My problem is solved as I can tell now. I can`t add [solved] to topis as this isn`t my thread and I mess up koan's thread

btw. koan helped switching to diferent drivers ?

bbgermany · Posted: Mon Sep 08, 2008 8:39 am Post subject:

BlackEye · l33t Joined: 04 Dec 2002 Posts: 756 Location: Germany

Is there a solution for this problem?
I have exact the same problem as the original poster!

Instead of samba I discovert the problem by using nfs. Copying large files (several MBs) over NFS and the virtual network of xen is unrecoverable dying. I need to restart the whole dom0 to be able to restart the domU and using the network again.
By reduceing the rsize and wsize of nfs I observed that this problem my not appear again. However - this could be related to the fact that the transmission is slower with these changes and maybe the bug isn't affected then. The strange thing is, that I could copy large files using netcat between domU and dom0 without any problems.
I'm afraid that this is a security issue. I could lower the rsize and wsize but what happens if one is able to send a large packet though the pipe to crash my connection.

Is there any real solution for this problem? New kernels? New bugfixes? Or any other hints?

I use xen 3.3 with 2.6.21-xen kernel sources (dom0 and domU).
Any help would be really appreciated!

Greetings,
Martin

koan · Posted: Fri Oct 17, 2008 7:54 am Post subject:

I am currently running with a non-Realtek based 10/100 card, and haven't experienced any issues with the network failing even at max.

However, I do want this to be a gigabit connection, so I have a dlink gig card waiting to test, and I'll report back if I get good results (or not).

My other plan is to use the SUSE Xen patchset against the Gentoo 2.6.25 kernel to see if that helps. I have the kernel built, but it remains to be seen if it even boots - other people have working Gentoo installs with this mix of kernel, but I don't know whether this will address the networking problem.

What network card are you using?

BlackEye · l33t Joined: 04 Dec 2002 Posts: 756 Location: Germany