Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
System is very slow since a few month (complex setup, RAID)
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Fri Dec 04, 2015 4:35 pm    Post subject: System is very slow since a few month (complex setup, RAID) Reply with quote

Hello.

I installed the machine in 2010. Was fast and good.

In 2014 I changed the disks to get more space. I move files, and data, without reinstalling any software. But, I changed one detail: the HOME partition was changed from EXT4 to ZFS. I now have 5 HDD.

Since that disk change, the system got a bit slower. But it is now significantly slower than before. And I don't understand why.

Long story short: Gentoo is on RAID6 over 5 disks. HOME is on ZFS raidz2 over 5. BIG (say, /mnt/tmp) is on a more complex setup; but in short, it's also raid6 over 5 disks.

1: all apps are slower and slower. But ... deadly slow. Rox-filer used to be able to open a folder with 200 pictures in 3 or 5s. Now, it needs about 1s to generate each preview for each picture. That's about 3mn for a 200 pics folder, while it used to be below 10s.

2: some apps freese for long time; E17 shows blinking red decorations

3: some times, the whole X will freese (mouse wont move) for 20s, or up to 8mn.

When things are slow, or frozen, the load increases hugely (from 0.6, in average, to 2, 5, or 8). But ... the CPU remains usually 75, or even 92% iddle. And the HDD led blinks very slowly. Like 3 dots per seconde.


Code:
519# for n in a d e ; do hdparm -tT /dev/sd$n ; done

/dev/sda:
 Timing cached reads:   6230 MB in  2.00 seconds = 3115.74 MB/sec
 Timing buffered disk reads:  402 MB in  3.01 seconds = 133.72 MB/sec

/dev/sdd:
 Timing cached reads:   6414 MB in  2.00 seconds = 3207.49 MB/sec
 Timing buffered disk reads:  454 MB in  3.00 seconds = 151.29 MB/sec

/dev/sde:
 Timing cached reads:   7010 MB in  2.00 seconds = 3506.11 MB/sec
 Timing buffered disk reads:  368 MB in  3.00 seconds = 122.50 MB/sec
0 0 2015-12-04_17-21-14 17:20:21 @pts/3 root@uranus:/tmp
520# hdparm -tT /dev/mapper/Big_2014-Big_2014

/dev/mapper/Big_2014-Big_2014:
 Timing cached reads:   8128 MB in  2.00 seconds = 4065.51 MB/sec
 Timing buffered disk reads:  750 MB in  3.03 seconds = 247.82 MB/sec
0 0 2015-12-04_17-22-09 17:21:48 @pts/3 root@uranus:/tmp


During these tests, the HDD led was 100% on.

These were raw hardware values.

Now, let's see what's after filesystem:

Code:
511# dd if=/dev/zero of=/tmp/plop conv=sync bs=1M count=1k   # SYSTEM
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.2483 s, 95.5 MB/s
0 0 2015-12-04_17-14-24 17:14:12 @pts/3 root@uranus:/tmp
512# o=/mnt/big/tmp/plop ; dd if=/dev/zero of="$o" conv=fsync bs=1M count=1k ; rm "$o"        # BIG
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 68.9786 s, 15.6 MB/s
0 0 2015-12-04_17-16-12 17:15:03 @pts/3 root@uranus:/tmp
513# o=/home/plop ; dd if=/dev/zero of="$o" conv=fsync bs=1M count=1k ; rm "$o"       # HOME
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 7.20762 s, 149 MB/s


Each disk can perform around 150MB/s; or, at least 120MB/s. Considering raid6, the theoretical speed should be between 450 and 360MB/s. BIG at array level got 240MB/s, and ... about 30MB/s after file system (ext4).

How can such a loss be possible ?

In the end, ZFS is the fastest one, with 150MB/s (instead of 360-450 !!!).

I have also booster my RAM, from 4GB, to 20GB.

The system was deadly faster than this in 2010. Any filesystem could perform at least 180MB/s, but, more genrally 230MB/s. In 2010, no test on filesystems were below 160MB/s.

My use ratio are: (space / inodes)
system: 47% / 23%
BIG: 63% / 1%
Home: 45% / 1%

What the hell could make the system so slow, when ...
- it used to be 1000% faster in the past
- it's still reasonably fast on block level
- I have no SMART error (or any other kind of error in any log I have)
- I have free space (and free inodes)
- I am not strugnling on CPU
- I am not strugling on bus
- I have very low IO

- I don't know which apps may need to write, or read
- some times, I have heavy disk access, but I have no clue which app are making them.

My typical use is: Thunderbird, Firefox, Pidgin, Chrome. Sometimes OpenOffice. The whole lot used to be fluent 5 years ago with 4GB RAM, and 4 disks in raid 5; and is now slow with 5 times more RAM.

I constantly look at top, and always have at least 8GB RAM free. Thunderbird has leaks, I restart it every 4 or 8h. And I reboot the whole system at least once a day, as soon as the free RAM comes down to 0.

This is not a 486 with 64k RAM; if the issue was apps eating ressources, I should see low free memory, or low iddle CPU.

I can run vmstat, and most other IO monitoring apps, but not iotop, I don't have the symbol in my kernel, and can not update my kernel.

I am getting mad with this slow machine. When it's freesing, even network traffic stops. Wait 3mn, and things come back to normal.

I need any help to track the root issue.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
Keruskerfuerst
Advocate
Advocate


Joined: 01 Feb 2006
Posts: 2289
Location: near Augsburg, Germany

PostPosted: Fri Dec 04, 2015 7:28 pm    Post subject: Reply with quote

Detailed hardwareinfo?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9884
Location: almost Mile High in the USA

PostPosted: Fri Dec 04, 2015 7:39 pm    Post subject: Reply with quote

And what kernel version?
Why can't you compile the needed code into your kernel for iotop? It should be there in recent kernels...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Fri Dec 04, 2015 8:21 pm    Post subject: Reply with quote

Keruskerfuerst wrote:
Detailed hardwareinfo?


What details do you want ?
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Fri Dec 04, 2015 8:24 pm    Post subject: Reply with quote

eccerr0r wrote:
And what kernel version?
Why can't you compile the needed code into your kernel for iotop? It should be there in recent kernels...


2.6.34-xen

Rebuilding is very complicated. If I try to change kernel, I have little chances to make new kernel bootable, and, also possibilities to make the system stop booting at all. Fixing a broken system may take days of work. And I don't have days to loose on this.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9884
Location: almost Mile High in the USA

PostPosted: Fri Dec 04, 2015 8:56 pm    Post subject: Reply with quote

Only guess now is fragmentation, if your disk has a lot of turnover from bittorrenting or something...?
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Fri Dec 04, 2015 10:11 pm    Post subject: Reply with quote

eccerr0r wrote:
Only guess now is fragmentation, if your disk has a lot of turnover from bittorrenting or something...?


ZFS can not fragment.

For ext4: 0.2%
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
TigerJr
Guru
Guru


Joined: 19 Jun 2007
Posts: 540

PostPosted: Fri Dec 04, 2015 11:59 pm    Post subject: Reply with quote

If you have xen kernel, than you can have xen guests, many guests - many iops. why you can't debug io rates with snmp or other monitoring systems? Or even with sarg and shell script mrtg or gnuplot with simple webserver to understand what is making high IO rate or latency bottleneck or raid had degraded due to disk fail?
_________________
Do not use gentoo, it die
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9884
Location: almost Mile High in the USA

PostPosted: Sat Dec 05, 2015 1:07 am    Post subject: Reply with quote

doublehp wrote:
eccerr0r wrote:
Only guess now is fragmentation, if your disk has a lot of turnover from bittorrenting or something...?


ZFS can not fragment.

For ext4: 0.2%

Yeah, right. Fragmentation resistance does not equate to "can not fragment". Unless there's an in-OS/in-kernel auto defragmenter there is always a degenerate write pattern that forces fragmentation.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Keruskerfuerst
Advocate
Advocate


Joined: 01 Feb 2006
Posts: 2289
Location: near Augsburg, Germany

PostPosted: Sat Dec 05, 2015 8:06 am    Post subject: Reply with quote

ZFS has a fragmentation program.

Detailed hardware info means:
CPU(s)
Mainboard
RAM; type
Graphics card
Harddisk controller
Harddisks (HDD or SSD)
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Sat Dec 05, 2015 10:00 am    Post subject: Reply with quote

TigerJr wrote:
If you have xen kernel, than you can have xen guests, many guests - many iops. why you can't debug io rates with snmp or other monitoring systems? Or even with sarg and shell script mrtg or gnuplot with simple webserver to understand what is making high IO rate or latency bottleneck or raid had degraded due to disk fail?


I don't have any guest. Usually, just after boot, if I don't touch anythink, the disk may stay iddle for seconds, or tens of second (apart from the 5s sync of ext). I don't have background heavy work.

ZFS works a way that does not fragment files. And, it has a background daem that cleans the filesystem when system is iddle (daemon is always on, but does work only when no writes are made for a long time). So, ARC is preventing fragmentation, and, ARC has a design that can not make it responsible for system slowliness.

Quote:
Yeah, right. Fragmentation resistance does not equate to "can not fragment". Unless there's an in-OS/in-kernel auto defragmenter there is always a degenerate write pattern that forces fragmentation.


Yes, ZFS is very special about this (and many more interesting aspects). I have askes ZFS people, and ZFS is correctly configured in my machine. Maybe I could increase the memory afforded to ARC (I could double it, since I have x5 my SDRAM).

Quote:
Detailed hardware info means:
CPU(s)
Mainboard
RAM; type
Graphics card
Harddisk controller
Harddisks (HDD or SSD)


To make it quick, I take this from lshw (not the most detailed, but there will be short answers to your question)

Quote:

*-core
description: Motherboard
product: GA-MA785GT-UD3H
vendor: Gigabyte Technology Co., Ltd.
physical id: 0
*-firmware
description: BIOS
vendor: Award Software International, Inc.
physical id: 0
version: F3 (09/16/2009)
size: 128KiB
capacity: 960KiB
capabilities: isa pci pnp apm upgrade shadowing cdboot bootselect socketedrom edd int13floppy360 int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer int10video acpi usb agp ls120boot zipboot biosbootspecification
*-cpu
description: CPU
product: AMD Phenom(tm) II X4 965 Processor
vendor: Advanced Micro Devices [AMD]
physical id: 4
bus info: cpu@0
version: AMD Phenom(tm) II X4 965 Processor
slot: Socket M2
size: 3400MHz
width: 64 bits
clock: 200MHz
capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
*-cache:0
description: L1 cache
physical id: a
slot: Internal Cache
size: 128KiB
capacity: 128KiB
capabilities: synchronous internal write-back
*-cache:1
description: L3 cache
physical id: c
slot: External Cache
size: 512KiB
capacity: 512KiB
capabilities: synchronous internal write-back
*-cache
description: L1 cache
physical id: b
slot: Internal Cache
size: 128KiB
capacity: 128KiB
capabilities: synchronous internal write-back
*-memory
description: System Memory
physical id: 29
slot: System board or motherboard
size: 20GiB
*-bank:0
description: DIMM 1333 MHz (0.8 ns)
product: None
vendor: None
physical id: 0
serial: None
slot: A0
size: 2GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM 1333 MHz (0.8 ns)
product: None
vendor: None
physical id: 1
serial: None
slot: A1
size: 2GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:2
description: DIMM 1333 MHz (0.8 ns)
product: None
vendor: None
physical id: 2
serial: None
slot: A2
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:3
description: DIMM 1333 MHz (0.8 ns)
product: None
vendor: None
physical id: 3
serial: None
slot: A3
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)

*-storage
description: SATA controller
product: SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
vendor: Advanced Micro Devices [AMD] nee ATI
physical id: 11
bus info: pci@0000:00:11.0
logical name: scsi0
logical name: scsi1
logical name: scsi2
logical name: scsi3
logical name: scsi4
version: 00
width: 32 bits
clock: 66MHz
capabilities: storage pm ahci_1.0 bus_master cap_list emulated
configuration: driver=ahci latency=32
resources: irq:22 ioport:ff00(size=8) ioport:fe00(size=4) ioport:fd00(size=8) ioport:fc00(size=4) ioport:fb00(size=16) memory:fe02f000-fe02f3ff
*-disk:0
description: ATA Disk
product: ST3000VX000-1CU1
vendor: Seagate
physical id: 0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: CV23
serial: XXXXXXXXXXXXX
size: 2794GiB (3TB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5



All 5 disks are identical, with consecutive serial numbers. They are pluged in hotplug sata mode (not IDE compatible), all on MB, and hardware RAID manager (of motherboard) is not used.

Quote:

# lspci | grep VGA
01:05.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI RS880 [Radeon HD 4200]
02:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI RV710 [Radeon HD 4350/4550]
03:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI RV710 [Radeon HD 4350/4550]


Configured with Xinerama, and partial Xrandr (on first card). I forgot details of Xcrossover configuration.

Xinerama is the part that took me 6 months to configure, and get working, because ... X team still not implemented in 2010 what they promised to to in 2004. I spent 3 years in finding cards that would work, and let me have 6 monitors for a reasonable price. That's the point that makes the setup VERY FRAGILE. Any change in kernel configuration, BIOS configuration, or file or driver or firmware change around VGA will break the system, from ... getting freese (double panic) when starting X, to ... kernel not booting at all. If won't try to rebuild the kernel; because, if I just forget any small detail during the rebuild process, X will refuse to start (and may even freese the kernel, what end up in breaking the file systems: when you have more than 40 double panic per day, and have to press RESET button each time, the file system end up in getting corrupt). I reported all these issues against Linux, X, ATI drivers, and nobody care, because I am obviously the only guy on earth who want's to have 6 monitors in Xinerama.

X will also break on udev or sysinit update.

Staring alternate distribution also breaks the system. Booting Arch Linux and Debian usually messes MDADM chains. They are easier to fix than X; only takes half an hour to fix, and don't have side effect.

And this was the short version. For the long story, search all the bugs I created, or joined, in Gugzilla ... and don't forget the closed ones. In there you will find my detailed X configuration, logs, and crash logs.

But once again: the whole system was fast and fluent untill 2013. Things slown down after disk change; and since then, they get worst and worst each month. I stopped updating Gentoo in octobre 2010. It was working; no reason to change stuff that work.

Right now, my load is quiet low: 0.16, 0.25, 0.27 . System is up since 90mn. Few things are running (TB, FF, Chrome), and system is fluent.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
TigerJr
Guru
Guru


Joined: 19 Jun 2007
Posts: 540

PostPosted: Sat Dec 05, 2015 11:33 am    Post subject: Reply with quote

mdadm? you didn't use hardware raid???
_________________
Do not use gentoo, it die
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Sat Dec 05, 2015 12:31 pm    Post subject: Reply with quote

I had an idea.

There is one thing that grows up with time: my Gmail box. Each email sent is copied into SentFolder on local, and, due to SMTP relay being gmail, it's also in Sent folder on TWO Gmail accounts. I am cleaning the online account for the box used for outgoing messages (the account configured for the SMTP relay). The account (on local disk, for cache) was 18G. And is now 5G. But there are still surprises. After heavy cleaning, thunderbird tells me the AllMail folder has a "size on disk 65MB", but, via terminal, after compacting ... it's rather doing ... 2.5G.

I have yet to understand why Thunderbird things that 2.5GB is 65MB ...

Quote:
2.5G [Gmail].sbd/All Mail
16M [Gmail].sbd/All Mail.msf
2.5G [Gmail].sbd/Sent Mail
13M [Gmail].sbd/Sent Mail.msf


Since my sync are done at random times ... and since this is stored on ZFS, and since ZFS has a strange mood for duplicating files (to be able to take snapshots), for every single email sent, and stored by Gmail online, if Thunderbird wants to add a message in the middle of an mbox file (MSF), it MAY be possible that ZFS was trying to duplicate a 9GB file ... for every single message new. This could be done in the background, when TB triggers a sync timeout (every 10mn from memory).

I have tried "compacting" and "repair folder"

Also, panacea.dat refers to both
ImapMail/imap.gmail.com/[Gmail].sbd/All Mail.msf
and
ImapMail/imap.gmail.com/[Gmail]-1.sbd/All Mail.msf
and does not explain the difference.

But there is an account I can not clean up because Gmail will not let me filter (search) messages that are in AllMail, but not in any other directory. You can not search "unlabelled" messages.

And Even a thunderbird issue would not explain why my filesystem is so slow (when TB is closed).

Quote:
mdadm? you didn't use hardware raid???


No; Hardware Raid is not flexible enough, has too many bugs, and is not portable at all. If your MB dies, you loose all data. Can not expand (move from RAID6/5 to raid6/6) or change raid profile. If one sata plug dies, can not move disk to IDE or USB plug. And it masks disks completely, so you can not monitor SMART anymore.

Right NOW, E17 shows thunderbird with red blinking decoration, but, rest of system is fluent, and, disk blinks slowly: 1 flash per second. So, app is rozzen, but, one core of my CPU is at 99%; all other cores are 75% iddle. I don't know what it's doing. That never happened before 2014. ... and 3mn later, hop, app works normally.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png


Last edited by doublehp on Sat Dec 05, 2015 12:53 pm; edited 1 time in total
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Sat Dec 05, 2015 12:44 pm    Post subject: Reply with quote

Took 315 seconds to flush bin that contained 132,862 messages ... from Gmail Webmail.

Even the largest computer in the world was in trouble to flush my account. No wonder my small desktop was slow managing it :)

Never saw Google spend more than 5s on any request.

One account left ... but don't know how to clean it. I have too many messages, in too many folders to make a copy of it.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
schorsch_76
Guru
Guru


Joined: 19 Jun 2012
Posts: 452

PostPosted: Sat Dec 05, 2015 2:17 pm    Post subject: Reply with quote

It doesnt help to cry. If you want to do something, do it.

Take a full backup and keep your current kernel as a fallback. Then try a newer kernel and try to improve. If you cant do a full backup, then try to backup the system and keep your data save.

It is always a bad sign, if the one who installed/build something doesnt want to touch it, because it could break. ;)
_________________
// valid again: I forgot about the git access. Now 1.2GB big. Start: 2015-06-25
git daily portage tree
Web: https://github.com/schorsch1976/portage
git clone https://github.com/schorsch1976/portage
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Sat Dec 05, 2015 2:51 pm    Post subject: Reply with quote

schorsch_76 wrote:
If you want to do something, do it.


- I want to know why the system is slower than before.
- I want to know why the system freeses from time to time (for 10s to 10mn).
- I want to know why writing on filesystem is significantly slower than writing on block (I am used to loose 10 or 20% speed; I have lost 90 or 95%).
- I want to know why the system may be unresponsive for about 5mn, when CPU is 75% iddle, and disk flashes two or three short flashes per second (heavy writes produce a continuous light on the HDD LED).

I just have no clue how to dig these issues.

I am shy rebuilding a kernel, because the last 3 times I did it, it took me a full week to fix the system afterwards (a full week spending 12h per day on my computer), because of udev and X bugs (that have not been fixed since 4 to 15 years - yes recent Xorg still trigger bugs I reported in 2002, dispite they pretend a complete rewrite ... twice).

And yes, I am out of nerves. Some days, I have work to do on computer, and it's so exhausting for me to wait 5mn doing nothing looking foe mouse refusing to move, I just go to bed without doing work. And next day ... work is not done.

And I won't do a complete reinstall of the system, because it took me 6 months to set it up (2 full months not doing anything else than building packages and configuring /etc, 10 to 12h a day, not doing any thing else, 6 days per week).

New kernel means ... udev update, means, break the whole box. Reverting to revious kernel will required udev revert, and considering udev update implies messing many files in /etc ... you know it's not something that can be done. Ortherwise, I would need to make a complete snapshot of the system at ZFS level, stop mounting user volumes, spend several days on trying to make it work, and if it does not, revert to previous snapshot of filesystem. This means ... several days without logging in as user, and not going out for work. I have too much work out of home to tell people I will take 3 days to fix my computer.

Yes I have lost patience. I want to fix the issue. Not shoot blindly with a tank and hope that some random actions may help. I want to dig the root cause of problem, and fix the issue where it is. I am pretty certain there are a lot of things that can be done without rebuilding kernel. I just don't know what. Not sure if pushing syslog on a network server could help; pushing kernel logs to serial port is usually much more reliable (directly talk to hardware, less software layers; better chances to receive messages). But I don't know how to enter heavy debug mode, or how to debug modules or scheduler. But during system freeses, kernel is still up and running, because I see slow disk activity; so there are some things to do at least at kernel level.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
TigerJr
Guru
Guru


Joined: 19 Jun 2007
Posts: 540

PostPosted: Sat Dec 05, 2015 3:39 pm    Post subject: Reply with quote

doublehp wrote:
- I want to know why the system is slower than before.
- I want to know why the system freeses from time to time (for 10s to 10mn).
- I want to know why writing on filesystem is significantly slower than writing on block (I am used to loose 10 or 20% speed; I have lost 90 or 95%).
- I want to know why the system may be unresponsive for about 5mn, when CPU is 75% iddle, and disk flashes two or three short flashes per second (heavy writes produce a continuous light on the HDD LED).


Hm, i want to know too... but here i can only read =) But for diagnose disease you have only LED =)

Quote:
- it's still reasonably fast on block level
- I have no SMART error (or any other kind of error in any log I have)
- I have free space (and free inodes)
- I am not strugnling on CPU
- I am not strugling on bus
- I have very low IO

- I don't know which apps may need to write, or read


Ok - you have all of this issue but how can we help you? You are in panic?

You need handkerchief, doll whipping, girls, alcohol, atomic bomb, drugs??? Just ask...

You didn't do anything to monitor issues past years and than problem appears - you are in panic...
_________________
Do not use gentoo, it die
Back to top
View user's profile Send private message
Keruskerfuerst
Advocate
Advocate


Joined: 01 Feb 2006
Posts: 2289
Location: near Augsburg, Germany

PostPosted: Sat Dec 05, 2015 4:14 pm    Post subject: Reply with quote

1. You should update the bios of the mainboard. Here: http://www.gigabyte.com/products/product-page.aspx?pid=3154#bios
2. I only see one HDD in you setup.
3. ZFS does fragment and there is a utility to defragment the partitions.
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Sun Dec 06, 2015 7:03 pm    Post subject: Reply with quote

Quote:
Hm, i want to know too... but here i can only read =) But for diagnose disease you have only LED =)


Many times, I did not even have one LED to help debugging.

One LED is enough to make mose. Those who tried to play with Android boot sequence will understand.

Quote:
You didn't do anything to monitor issues past years and than problem appears


You did not ask anything else than HW configuration. Maybe I have. But ... did not spent hours on mentionning every single thing I have done to track the issue. To talk about the biggest tool I have, Munin; but it shows logs only for 14 months; what is not enough. I started to use MRTG in 2005; and Munin around 2009. MRTG was too heavy. Munin has only 1y worth of logs.

I also have kept all my syslogs since 2008.

I also run top all day long. I have also tried various combinations of smartctl, vmstat, and lsof ... and got nothing conclusive. I did dozens of things to log, monitor, track, and analyse, and nothing was conclusive. I am running out of ideas.

I said I don' want to rebuild my kernel; but, maybe I already have everything needed to dig the issue. Up to now, nobody mentionned any application, tool, or procedure. You don't know, because you didn't ask.

Keruskerfuerst wrote:
1. You should update the bios of the mainboard. Here: http://www.gigabyte.com/products/product-page.aspx?pid=3154#bios
2. I only see one HDD in you setup.
3. ZFS does fragment and there is a utility to defragment the partitions.


This is a forum; I am trying to make messages informative, short, and remove all redundant parts. I said the 5 disks are identical with consecutive serial numbers. If I wanted to make very long messages, I could paste my kernel conf, and all my syslogs from the 5 last years, and merge logs, and why not ... a complete hexdump of my disks. I have answered every single question that was asked.

I have dug the ZFS possibility with ZFS people from IRC. They said my system is fine; but I could update my ZFS when next update will be available, and increase my ARC cache, but this may not help.

My biggest performance loss is on EXT4 anyway ...

Still waiting for suggestions, or commands to run, monitoring apps to install ... if installing them does not required to update too many things.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
schorsch_76
Guru
Guru


Joined: 19 Jun 2012
Posts: 452

PostPosted: Sun Dec 06, 2015 7:27 pm    Post subject: Reply with quote

How about perf? [1]

[1] https://perf.wiki.kernel.org/index.php/Main_Page

Do you at least have the souce and config of your kernel? With that you could include the functions for iotop.

Without iotop it is difficult to trace io troubles. You dont even know when the hdd runs ... (no LED).
_________________
// valid again: I forgot about the git access. Now 1.2GB big. Start: 2015-06-25
git daily portage tree
Web: https://github.com/schorsch1976/portage
git clone https://github.com/schorsch1976/portage
Back to top
View user's profile Send private message
TigerJr
Guru
Guru


Joined: 19 Jun 2007
Posts: 540

PostPosted: Mon Dec 07, 2015 7:01 am    Post subject: Reply with quote

For analyze we need what you need - information about running you system. But information must be helpful. You say that all information you monitor doesn't helpful, now i don't know what information to request. IOPS monitor than error appears, CPU load, memory load, /proc/mdstat - but you say there is nothing useful

MRTG -not so heavy... (otherwise prtg, cacti, zabbix, watsup and even gnuplot) and can be used via crontab script for each graphs, quite easy - it generates html pages with png images those you monitor via http webserver. You can use all the information you need with mrtg(disk load, network load, cpu load, iops, processes, memory, swap). If server have faced with DoS problem you can understand than error appears and what indications was before the problem appears and even what was source of DoS. That is good for diagnosis. Analyzing only LED gives you small amount of information and haven't LED history to understand what LED rates was hour ago or past day.

Did you check your disks for bad blocks?
_________________
Do not use gentoo, it die
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Fri Dec 11, 2015 9:31 am    Post subject: Reply with quote

schorsch_76 wrote:
How about perf? [1]

[1] https://perf.wiki.kernel.org/index.php/Main_Page

Do you at least have the souce and config of your kernel? With that you could include the functions for iotop.

Without iotop it is difficult to trace io troubles. You dont even know when the hdd runs ... (no LED).


Perf built. What's next ?

Yes I have my kernel tree. But I never understood how to compile drivers without rebuilding the whole tree. I need help on this.

I have one LED on the front of my tower for disks.

I have increased my ARC cache from 300-400 to 4GB.

Updated from "ZFS: Loaded module v0.6.3-181_gb0cf067, ZFS pool version 5000, ZFS filesystem version 5" to latest 0.6.5 (it requires to have a valid Linux tree) butI don't see the difference: I still have zpool ver 28 and zfs ver 5.

Quote:

You say that all information you monitor doesn't helpful, now i don't know what information to request. IOPS monitor than error appears, CPU load, memory load, /proc/mdstat - but you say there is nothing useful


When slowlyness occurs, the system is so slow that I can not launch manually any of those commands; when system get's fluent, there is no trace of any issue. Things like Munin are just useless; they run once every 5mn, and just miss the train. In best case, I can see the load getting high; but even then, CPU remains at 75% iddle, and disk blinks slowly (1 to 3 short flashes per second). And when I type anything in consoles, letters take 20s to 3mn to appear on screen; when process is initiated, the whole system is fluent again, and the command also missed the train. When some commands run on long term, like top, they just freese for 1 to 10mn, and I can't see anything usefull. Last time, I watched my screen carefull for 12; in 12 minutes, the time on clocks moved 5 times (top, and gkrellm). Top was also refreshed 5 times in 12mn. I could write pages about what I have tried to track the issue. I am not here to talk about what failed, but learn new methods to track problems.

I will try to rebuild kernel to add iotop support; but it may take days to fix all side effects it will imply.

MRTG Cacti Munin and brothers ... can't help. Any thing based on cron is useless: as said, they miss the train. I know Thunderbird is a part of the problem; but freeses did not occur between jan 2010 and may 2014. Stopping using TB is not an option.

hmmm ... after remerging a few things, iotop now works. So, what are we looking for in there ?

iotop only gives the process that is doing work; what I would like to see is which file it's reading or writing. Lsof does not do that either.

We have been one step forward today.

Quote:
Did you check your disks for bad blocks?


Cron month does zfs scrub and mdadm full resync.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
schorsch_76
Guru
Guru


Joined: 19 Jun 2012
Posts: 452

PostPosted: Fri Dec 11, 2015 9:52 am    Post subject: Reply with quote

You need to set the kernel options according to [1]

CONFIG_TASKSTATS
CONFIG_TASK_DELAY_ACCT
CONFIG_TASK_IO_ACCOUNTING

[1] http://linux.die.net/man/1/iotop

These changes would require to rebuild your kernel and your out of tree kernel modules (maybe ZFS). To get a backup net, backup your kernel, initrd and the modules Folder.

Code:

mkdir /root/kernel-backup
cp /boot/*`uname -r`* /root/kernel-backup
cp -R /lib/modules/`uname -r`/root/kernel-backup


Please verify that vmlinuz, initrd,system map and your module folder are in the kernel backup.

After this, you can save the folder "kernel-backup" to a USB Flash drive.

Then recompile the kernel with the above options, install it and it's modules, build out of tree modules (maybe zfs), rebuild initrd and reboot.

If something wents wrong, you can use sysrec cd to redeploy the saved kernel files and modules.

With this approach you can use your current kernel and dont need to update it. Just include the needed options.

perf itself is described on the linked page.
[2] https://perf.wiki.kernel.org/index.php/Main_Page
_________________
// valid again: I forgot about the git access. Now 1.2GB big. Start: 2015-06-25
git daily portage tree
Web: https://github.com/schorsch1976/portage
git clone https://github.com/schorsch1976/portage
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Fri Dec 11, 2015 12:09 pm    Post subject: Reply with quote

Code:
# zcat /proc/config.gz  | grep -i TASK
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_IDE_TASK_IOCTL is not set


My setup is way more complex than a simple vmlinuz, due to RAID, and other details. I give a specific version name to each kernel, to avoid having rebuilds overwrite each other. Before doing any change in the conf, I manually edit the kernel name:

Code:
CONFIG_LOCALVERSION="-Gentoo-uranus-1-50"


Most of the rest is done by scripts; but for example, due to raid and other small details, I have four boot partitions. Build time scripts put the kernel in /boot (on top of RAID); then, shutdown scripts sync this to a BIOS accessible non raid partition. This is designed to reduce break of MBR. And since 2010 my Gentoo never suffered from broken MBR. But breaks further in the boot process; usually at startx.

Next ?

iotop running; waiting for the next freese to see how it will behave.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
schorsch_76
Guru
Guru


Joined: 19 Jun 2012
Posts: 452

PostPosted: Fri Dec 11, 2015 5:05 pm    Post subject: Reply with quote

You see, when i do something like this,

Code:
ls /boot/*`uname -r`*

i get

Code:
ls /boot/*`uname -r`*
/boot/System.map-4.1.3-slim      /boot/initramfs-genkernel-x86_64-4.1.3-slim
/boot/System.map-4.1.3-slim.old  /boot/initrd-4.1.3-slim.img.gz
/boot/config-4.1.3-slim          /boot/vmlinuz-4.1.3-slim
/boot/config-4.1.3-slim.old      /boot/vmlinuz-4.1.3-slim.old


That is why i posted these two liner to backup your whole linux stuff. Even when it has a "more complex name" uname -r will tell it to me.

It is not complex. It is just a kernel which follow the same rules. Even when it is from 2010.

My output is:
Code:
zcat /proc/config.gz | grep LOCALVERSION
CONFIG_LOCALVERSION="-slim"
# CONFIG_LOCALVERSION_AUTO is not set


and now guess what will your uname -r tell you?
_________________
// valid again: I forgot about the git access. Now 1.2GB big. Start: 2015-06-25
git daily portage tree
Web: https://github.com/schorsch1976/portage
git clone https://github.com/schorsch1976/portage
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum