View previous topic :: View next topic |
Author |
Message |
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Sun Aug 24, 2008 1:10 pm Post subject: Xen DomU Networking Stops working under load |
|
|
Hello,
I have a 2.6.21 xen DomU running pv under a 2.6.21 Dom0.
In general everything works well, no problems. However, if I load the network card, it (almost) stops sending or receiving packets.
Everything is normally pretty low access - I am running a couple of database servers and snmp server and a few other things in the domu, but the traffic frequency is extremely low - one or two users max.
If I start something like bittorrent, it will kill the networking almost immediately. I was running a Samba server on the domu for a while, but whenever I would save a file of significant size, it would kill the network.
If I xm console in, the network card appears fine, and there are a small number of bytes ticking up on transmit and receive. Nothing in messages or dmesg.
The dom0 network is fine, and I am using a single physical nic to bridge to. I have three other vms (all hvm) running on the same Xen box and they don't suffer any network issues.
I am not sure how to proceed with diagnosing this...
Cheers,
Paul |
|
Back to top |
|
|
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Sun Aug 24, 2008 1:38 pm Post subject: |
|
|
The domU can ping itself incidentally, and restarting the domU interface doesn't help.
Sometimes it seems to right itself, sometimes it needs a reboot. tcpdump on the dom0 doesn't show anything.
Also, I use this domU for asterisk, and when the network is working, there are no problems with calls. So the issue seems to be more about taxing the virtual nic than packet frequency. |
|
Back to top |
|
|
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Sun Aug 31, 2008 11:51 am Post subject: |
|
|
Well, ok.
I have changed the kernel in the domU from 2.6.21 to 2.6.25 - same problem. In the dom0 I have changed the network card without change.
The domU doesn't recognise anything is wrong in the messages or dmesg - it just cannot connect to anything. If I shut it down, it hangs but isn't connectible via the xm console.
If I xm destroy, it destroys. If I then try to xm create the domain again, I get
Code: | Error: Device 0 (vif) could not be connected. Hotplug scripts not working. |
Nothing appears in the xen-hotplug.log.
xend.log gives:
Code: |
...
[2008-08-31 19:14:11 5531] DEBUG (DevController:595) hotplugStatusCallback /local/domain/0/backend/vif/11/0/hotplug-status.
[2008-08-31 19:15:51 5531] DEBUG (XendDomainInfo:1897) XendDomainInfo.destroy: domid=11
...
|
So it tries for a while and then destroys the VM.
All the other VMs are working fine at this point, but if I shutdown and attempt to restart any, they will fail to get the vif too.
Adding interfaces to the bridge seems to work fine, so I guess the problem must be in the creation of the vif interface. Or at least, the vif breaks, and then xen is no longer able to create a new one.
I am not sure at what stage this takes place - prior to vif-script by the look of it... |
|
Back to top |
|
|
bbgermany Veteran
Joined: 21 Feb 2005 Posts: 1844 Location: Oranienburg/Germany
|
Posted: Mon Sep 01, 2008 2:39 pm Post subject: |
|
|
How do you create the xenbr interface? Im doing it this way:
/etc/conf.d/net
Code: |
config_eth0=( "null" )
config_eth1=( "null" )
bridge_xenbr0="eth0 eth1"
config_xenbr0=( "192.168.23.252 netmask 255.255.255.0" )
routes_xenbr0=( "default via 192.168.23.1" )
dns_servers=( "192.168.23.20" )
dns_domain="xxx.xxx"
dns_search="xxx.xxx xxx.yyy"
brctl_xenbr0=(
"setfd 0"
"sethello 0"
"stp off"
)
|
/etc/xen/xend-config.sxp
Code: |
(network-script network-dummy)
|
This solved the script issues while creating the bridge.
bb _________________ Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB |
|
Back to top |
|
|
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Mon Sep 01, 2008 11:48 pm Post subject: |
|
|
I am running the network-bridge script for the bridge create - with a slight mod as I have multiple addresses on my physical nic, and these were not getting set up correctly on the bridge.
Your script sets up the bridge normally, but then also does this:
Code: |
"setfd 0"
"sethello 0"
"stp off"
|
My bridge has forward delay set to zero, and stp off. So the only difference is that you have the hello time set to zero, whereas mine is 2 seconds.
With stp off, I am not sure that hello time does anything - but googling it I have found a number of occasions where setting hello time to something other than zero fixed some Xen networking issues (high numbers of interrupts).
Can you remember why you have it set it to zero? |
|
Back to top |
|
|
bbgermany Veteran
Joined: 21 Feb 2005 Posts: 1844 Location: Oranienburg/Germany
|
Posted: Tue Sep 02, 2008 6:17 am Post subject: |
|
|
iirc, i used the gentoo wiki entry to configure my xen. there was this. i use multiple addresses on the bridge as well. iproute2 did the trick for me.
bb _________________ Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB |
|
Back to top |
|
|
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Tue Sep 02, 2008 2:20 pm Post subject: |
|
|
The forward delay and stp settings are default in current gentoo Xen installs. The hello interval relates to the frequency that bpdu is issued, and so it is unlikely to have any baring on my issue.
I'll give it a test at some point, because right now I have exhausted the leads available to me - at least, the ones I can think of. Well I do have another, and that is to build Xen with a stock kernel from another distribution, to see if it helps. But that represents a whole new set of difficulties, as I couldn't find a stock kernel that did everything I wanted, which is why I came back to gentoo... |
|
Back to top |
|
|
maslo64 n00b
Joined: 04 Sep 2008 Posts: 14 Location: Slovakia
|
Posted: Thu Sep 04, 2008 10:17 pm Post subject: |
|
|
Hello Koan,
I have exactly same issue with Xen.When I am starting domU I noticed message:
Code: |
Bringing up eth0
* dhcp
* Running dhcpcd ...err, eth0: Failed to lookup hostname via DNS: Name or service not known
[ ok ]
* eth0 received address 192.168.1.122/24
|
After login to system is everything fine, but when i am doing something like "emerge -eauDN world" some packages are transfered to domU and after while it looks that bridge is down, and again after while network intercase is working again .
Below is my bonding setup for eth0 and eth1 in dom0
Code: |
config_eth0=( "null" )
config_eth1=( "null" )
RC_NEED_bond0=("net.eth0 net.eth1")
slaves_bond0="eth0 eth1"
config_bond0=( "null" )
RC_NEED.xenbr0="net.bond0"
bridge_xenbr0="bond0"
bridge_xenbr0="bond0"
config_xenbr0=("dhcp")
brctl_xenbr0=(
"setfd 0"
"sethello 0"
"stp off"
)
|
I am now trying to test if it`s not caused by bonding or ipv6.
Any help help will be appreciated |
|
Back to top |
|
|
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Thu Sep 04, 2008 10:35 pm Post subject: |
|
|
Hi,
I am not using bonded nics, or IPv6.
Xen 3.3 came into unstable a couple of days ago, so I upgraded, but the problem still remains.
Someone on the Xensource mailing list suggested lowering the NIC rate so that the domU never transfers at a speed that breaks networking, but last time I tried to test, the break happened at 3.6Mbs. That is pretty slow!
So the changes I have made are:
1) Change domU kernel (2.6.21, 2.6.24, 2.6.25)
2) Change domU userland (gentoo, ubuntu)
3) Change dom0 physical nic + driver (Realtek 8169 -> 8168)
4) Change Xen version (3.2.1 -> 3.3)
The only thing I haven't changed is the dom0 kernel. I am using a stock 2.6.21 gentoo kernel, so it would be great if anyone watching this that has pv domUs working under a 2.6.21 kernel would post their .config so I can compare it to mine.
Paul |
|
Back to top |
|
|
maslo64 n00b
Joined: 04 Sep 2008 Posts: 14 Location: Slovakia
|
Posted: Fri Sep 05, 2008 10:02 am Post subject: |
|
|
Hmm, so I switched back to eth0 -> xenbr0 configuration and disabled IPV6 and everything is ok now.
I am going to try different bonding modes. And if the problem persist i thing I have to try NAT |
|
Back to top |
|
|
bbgermany Veteran
Joined: 21 Feb 2005 Posts: 1844 Location: Oranienburg/Germany
|
Posted: Fri Sep 05, 2008 10:09 am Post subject: |
|
|
What kind of bond do ya use? Maybe your switch doesnt support the mode and so packages get lost at transfer.
bb _________________ Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB |
|
Back to top |
|
|
maslo64 n00b
Joined: 04 Sep 2008 Posts: 14 Location: Slovakia
|
Posted: Fri Sep 05, 2008 10:27 am Post subject: |
|
|
I was using mode=1 , but when I was testing pluging-> unplugin cables , connections was not restored.
Then I tried mode=0 which was working fine from dom0 , but issue with domU appear. |
|
Back to top |
|
|
bbgermany Veteran
Joined: 21 Feb 2005 Posts: 1844 Location: Oranienburg/Germany
|
Posted: Fri Sep 05, 2008 10:57 am Post subject: |
|
|
Did you try as mode 5 (balance-tlb) or 6 (balance-alb) as well? Mode 0 is round-robbing and 1 is active-backup. If youre switch supports Link Aggregation Control Protocol (LACP), you should consider mode 4 (802.3ad).
bb _________________ Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB |
|
Back to top |
|
|
maslo64 n00b
Joined: 04 Sep 2008 Posts: 14 Location: Slovakia
|
Posted: Fri Sep 05, 2008 12:41 pm Post subject: |
|
|
I am going to test this today in the evening , because i don`t have console access to server and try again and again and again . |
|
Back to top |
|
|
maslo64 n00b
Joined: 04 Sep 2008 Posts: 14 Location: Slovakia
|
Posted: Sat Sep 06, 2008 10:20 am Post subject: |
|
|
Still no luck with bonding , I tried 6 modes for bonding , but still no progres. I am going to downgrade kernels for dom0 and domU from 2.6.21 -> 21.6.18-r12 and check if it helps. I also set "sethello 2" as its known bug for xen as I found and you have right about this. |
|
Back to top |
|
|
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Sat Sep 06, 2008 2:17 pm Post subject: |
|
|
Ok,
It looks like the bonding issue isn't related to the original issue report on this thread, but thats ok, we can share
Anyway, in an effort to eliminate the nic as the source of the problem I used another with different drivers, but the problem remained. They were both realtek however, and it was pointed out that this would not necessarily discount a driver issue.
I am testing with a 10/100 tulip card and it is looking promising. No lockups yet and I have shifted a couple of gigs across the link. |
|
Back to top |
|
|
maslo64 n00b
Joined: 04 Sep 2008 Posts: 14 Location: Slovakia
|
Posted: Sat Sep 06, 2008 3:41 pm Post subject: |
|
|
I am sorry if I mess up your thread with my own problem
Anyway, I looks that I have reached the solution. My configuration is HP DL 380G5 and network card is :
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
After i recompiled kernel with driver for NW card as modul, it reported "Call trace" to dmesg log.
I tried to compile and install drivers from Broadcom website ,but can`t (don`t know how, all my attempts was unsuccessfull) howto enable
ZLIB_INFLATE in kernel, thus I can`t load the module.
Everythinkg looks fine when I compiled it as kernel module thru 'make0 menuconfig' and set mode=5
So this are my configs:
Code: |
master ~ # cat /etc/conf.d/net
config_eth0=( "null" )
config_eth1=( "null" )
RC_NEED_bond0=("net.eth0 net.eth1")
slaves_bond0="eth0 eth1"
config_bond0=( "null" )
RC_NEED_xenbr0="net.bond0"
bridge_xenbr0="bond0"
config_xenbr0=("dhcp")
brctl_xenbr0=(
"setfd 0"
"sethello 2"
"stp off"
)
|
Code: |
master ~ # cat /etc/modules.autoload.d/kernel-2.6
bnx2
bonding miimon=100 mode=5
loop max_loop=256
master ~ # uname -a
Linux 2.6.18-xen-r12 #9 SMP Sat Sep 6 14:41:34 CEST 2008 x86_64 Intel(R) Xeon(R) CPU E5420 @ 2.50GHz GenuineIntel GNU/Linux
master ~ # cat /xen/reference/gentoo.xen.cfg
kernel = "/boot/vmlinuz-2.6.21-xenU"
memory = 1024
name = "reference"
vif = [ 'mac=00:16:3E:6A:49:54, bridge=xenbr0' ]
dhcp = "dhcp"
disk = ['file:/xen/reference/gentoo64.img,sda1,w', \
'file:/xen/reference/gentoo64_lvm.img,sdc1,w', \
'file:/xen/reference/swap_disk.img,sds1,w', ]
root = "/dev/sda1 ro"
extra = "gentoo=nodevfsi"
master ~ #
|
app-emulation/xen-3.3.0
app-emulation/xen-tools-3.3.0
For me it looks that somewhere in /usr/src/linux/drivers/net/bonding/* isn`t everyting right when using bnx2. |
|
Back to top |
|
|
Hibbelharry Tux's lil' helper
Joined: 27 May 2003 Posts: 88 Location: Bremen, Northern Germany
|
Posted: Sun Sep 07, 2008 7:49 pm Post subject: |
|
|
You might try disabling checksum offloading to hardware by using ethtool. This solved some network dying problems wit xen for me.
Greetz
Hibbelharry |
|
Back to top |
|
|
maslo64 n00b
Joined: 04 Sep 2008 Posts: 14 Location: Slovakia
|
Posted: Mon Sep 08, 2008 6:33 am Post subject: |
|
|
Hello Hibbelharry,
My problem is solved as I can tell now. I can`t add [solved] to topis as this isn`t my thread and I mess up koan's thread
btw. koan helped switching to diferent drivers ? |
|
Back to top |
|
|
bbgermany Veteran
Joined: 21 Feb 2005 Posts: 1844 Location: Oranienburg/Germany
|
Posted: Mon Sep 08, 2008 8:39 am Post subject: |
|
|
Hibbelharry wrote: | You might try disabling checksum offloading to hardware by using ethtool. This solved some network dying problems wit xen for me.
Greetz
Hibbelharry |
Im having checksum offload disabled for tx not rx. Did you disable both?
Code: |
zeus ~ # ethtool -k eth1
Offload parameters for eth1:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
zeus ~ #
|
bb _________________ Desktop: Ryzen 5 5600G, 32GB, 2TB, RX7600
Notebook: Dell XPS 13 9370, 16GB, 1TB
Server #1: Ryzen 5 Pro 4650G, 64GB, 16.5TB
Server #2: Ryzen 4800H, 32GB, 22TB |
|
Back to top |
|
|
BlackEye l33t
Joined: 04 Dec 2002 Posts: 756 Location: Germany
|
Posted: Fri Oct 17, 2008 6:50 am Post subject: |
|
|
Is there a solution for this problem?
I have exact the same problem as the original poster!
Instead of samba I discovert the problem by using nfs. Copying large files (several MBs) over NFS and the virtual network of xen is unrecoverable dying. I need to restart the whole dom0 to be able to restart the domU and using the network again.
By reduceing the rsize and wsize of nfs I observed that this problem my not appear again. However - this could be related to the fact that the transmission is slower with these changes and maybe the bug isn't affected then. The strange thing is, that I could copy large files using netcat between domU and dom0 without any problems.
I'm afraid that this is a security issue. I could lower the rsize and wsize but what happens if one is able to send a large packet though the pipe to crash my connection.
Is there any real solution for this problem? New kernels? New bugfixes? Or any other hints?
I use xen 3.3 with 2.6.21-xen kernel sources (dom0 and domU).
Any help would be really appreciated!
Greetings,
Martin |
|
Back to top |
|
|
koan Apprentice
Joined: 01 May 2006 Posts: 169 Location: Melbourne
|
Posted: Fri Oct 17, 2008 7:54 am Post subject: |
|
|
I am currently running with a non-Realtek based 10/100 card, and haven't experienced any issues with the network failing even at max.
However, I do want this to be a gigabit connection, so I have a dlink gig card waiting to test, and I'll report back if I get good results (or not).
My other plan is to use the SUSE Xen patchset against the Gentoo 2.6.25 kernel to see if that helps. I have the kernel built, but it remains to be seen if it even boots - other people have working Gentoo installs with this mix of kernel, but I don't know whether this will address the networking problem.
What network card are you using? |
|
Back to top |
|
|
BlackEye l33t
Joined: 04 Dec 2002 Posts: 756 Location: Germany
|
Posted: Fri Oct 17, 2008 9:59 am Post subject: |
|
|
koan wrote: | What network card are you using? |
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
The Realtek Ethernet Controller seems to have some issues with xen. I read something about this on the net. Unfortunately I can't change the NIC because this is a root-server which I haven't any direct access to.
About the kernel source I use, maybe this link is interesting for you too -> https://forums.gentoo.org/viewtopic-t-709908.html
There you can get a new vanilla with xen patches. This is the kernel I currently use on my dom0.
If I use NFS with this kernel and without setting rsize and wsize I got horrible transferrates. If I manually set rsize and wsize to 8192 I got a vast better result (you can see my post in the other thread about this).
However - I dont know if this is the real solution for this problem or not.
About the NIC: Although I found some issues about the realtek in conjunction with xen - I really don't know why this should have anything to do with it because the whole transfer between dom0 and the domUs are getting over the virtual devices.. |
|
Back to top |
|
|
|