Is my SSD dying ?

destroyedlolo · Posted: Sun Aug 06, 2023 11:07 am Post subject: Is my SSD dying ?

Hello,

For the 2nd time in a week, my BananaPI lost it's disk.

Having a look on the log, I can see only :

NeddySeagoon · Posted: Sun Aug 06, 2023 11:25 am Post subject:

destroyedlolo,

Step 1. Run

destroyedlolo · Posted: Sun Aug 06, 2023 11:33 am Post subject:

Can I run smartctl with the disk mounted (it's the system disk).

But unfortunately, it doesn't appear in /dev : I have no /dev/sd??? anymore

Will try again after some minutes in case keeping it unpowered for a while help (it did at the first incident).

NeddySeagoon · Posted: Sun Aug 06, 2023 12:13 pm Post subject:

destroyedlolo,

Yes. You need its device node. Mounted is OK.
smantctl reads a couple of data blocks from the drive firmware, so its harmless to filesystems.

You can do it in any system. USB/SATA converters can get in the way.

Some of my SSDs do trim at power up and do not come ready until trim completes.
That can be scary. I've had to leave one drive powered up over 24 hours before it appeared in /dev
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

destroyedlolo · Posted: Sun Aug 06, 2023 3:43 pm Post subject:

Disk stopped for 2h ... at bootup, survived 1 minutes but it disappeared again.

NeddySeagoon · Posted: Sun Aug 06, 2023 4:56 pm Post subject:

destroyedlolo,

Given the hint that points to an interface problem, try a different SATA data cable, or a different SATA port on the motherboard or both.

If you don't have a spare SATA data cable, try plugging and unplugging tho data cable a couple of times to 'wipe' the contacts.
Do both ends.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

destroyedlolo · Posted: Sun Aug 06, 2023 5:35 pm Post subject:

I tried with another cable : no change.
The bananaPI has only 1 Sata and I can't swap this device without large impact. So I tried the easiest way : swap the disk with another one (mecanical). It's working.

So, this disk is dead.

By the way, this system is running a very old Gentoo with 3.14 kernel (I can't upgrade gentoo has I'm stuck with PHP 5.4 and I even don't have suited DTB to upgrade the kernel). This disk contains only a datas (database, website, ...) and logs. I need to check is this ancient kernel can run zram to save its endurance.

NeddySeagoon · Posted: Sun Aug 06, 2023 6:43 pm Post subject:

destroyedlolo,

End of life SSD normally go read only, so I suspect it not that.

Are you able to connect it to another system. Over USB if you need to.
USB3 with UAS support would be best.

Any arch will do as you do not want to run the code on the drive.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

destroyedlolo · Posted: Sun Aug 06, 2023 7:00 pm Post subject:

It's what I was thinking too.

But my emergency is to restart (otherwise, I'm missing my home automation). So I'll buy another disk and do some stress test on this one on another system.

destroyedlolo · Posted: Mon Aug 07, 2023 9:49 am Post subject:

Hi @NeddySeagoon,

I think you were right : it seems the problem was due to a weak contact on the bPI power supply.
I enforced the connection and the disk survived for 10h (which turning out in record numbers since this issue raised).

Surprisingly, the SBC itself or its PMU didn't crash or issue an alert, but the SATA controler that failed.

Crossing my finger for coming hours :wink:

NeddySeagoon · Posted: Mon Aug 07, 2023 11:16 am Post subject:

destroyedlolo,

Get that smart data while you can.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

destroyedlolo · Posted: Mon Aug 07, 2023 7:45 pm Post subject:

So my system is up for 23h now, and counting.

Here the output of smartctl

NeddySeagoon · Posted: Mon Aug 07, 2023 8:10 pm Post subject:

destroyedlolo,

The values of VALUE WORST THRESH are all normalised.
If VALUE or WORST <= THRESH that parameter has failed.

destroyedlolo · Posted: Mon Aug 07, 2023 10:20 pm Post subject:

Thanks for your detailed explanation. Very appreciated.

For the high write vs read ratio, it's not due to portage (this system is frozen for years due to PHP leak of upward compatibility) but because this machine is hosting a database containing all my smart home figures. With 600 years expectancy, I'm safe for a moment :wink:

destroyedlolo · Posted: Sat Sep 09, 2023 10:10 pm Post subject:

Hello,

Back to this old issue : so the problem was an oxidation of the SATA connector.
After some cleaning, it's running now for a month w/o issue.

First time I encountered this kind of issue.

Thanks for the help and the explanation.

NeddySeagoon · Posted: Sun Sep 10, 2023 9:35 am Post subject:

destroyedlolo,

That will fix it for 6 to 9 months. Thank you for the update.
Its probably the gold worn off the SATA data cable. Next time replace the cable.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.