Re: RAID10 Balancing Request for Comments and Advices

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Vincent Olivier <vincent@up4.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: RAID10 Balancing Request for Comments and Advices
Date: Wed, 17 Jun 2015 09:46:50 -0400	[thread overview]
Message-ID: <74449C35-BA4E-4476-9EA1-EFE66312AFFA@up4.com> (raw)
In-Reply-To: <pan$966fb$92ec2f1c$291706f7$cf1b6c68@cox.net>

[-- Attachment #1: Type: text/plain, Size: 7254 bytes --]


> On Jun 16, 2015, at 7:58 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
> Vincent Olivier posted on Tue, 16 Jun 2015 09:34:29 -0400 as excerpted:
> 
> 
>>> On Jun 16, 2015, at 8:25 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>> 
>>> On Tue, Jun 16, 2015 at 08:09:17AM -0400, Vincent Olivier wrote:
>>>> 
>>>> My first question is this : is it normal to have “single” blocks ?
>>>> Why not only RAID10? I don’t remember the exact mkfs options I used
>>>> but I certainly didn’t ask for “single” so this is unexpected.
>>> 
>>> Yes. It's an artefact of the way that mkfs works. If you run a
>>> balance on those chunks, they'll go away. (btrfs balance start
>>> -dusage=0 -musage=0 /mountpoint)
>> 
>> Thanks! I did and it did go away, except for the "GlobalReserve, single:
>> total=512.00MiB, used=0.00B”. But I suppose this is a permanent fixture,
>> right?
> 
> Yes.  GlobalReserve is for short-term btrfs-internal use, reserved for
> times when btrfs needs to (temporarily) allocate some space in ordered to
> free space, etc.  It's always single, and you'll rarely see anything but
> 0 used except perhaps in the middle of a balance or something.


Get it. Thanks.

Is there anyway to put that on another device, say, a SSD? I am thinking of backing up this RAID10 on a 2x8TB device-managed SMR RAID1 and I want to minimize random write operations (noatime & al.). I will start a new thread for that maybe but first, is there something substantial I can read about btrfs+SMR? Or should I avoid SMR+btfs ?


> 
>>> For maintenance, I would suggest running a scrub regularly, to
>>> check for various forms of bitrot. Typical frequencies for a scrub are
>>> once a week or once a month -- opinions vary (as do runtimes).
>> 
>> 
>> Yes. I cronned it weekly for now. Takes about 5 hours. Is it
>> automatically corrected on RAID10 since a copy of it exist within the
>> filesystem ? What happens for RAID0 ?
> 
> For raid10 (and the raid1 I use), yes, it's corrected, from the other
> existing copy, assuming it's good, tho if there are metadata checksum
> errors, there may be corresponding unverified checksums as well, where
> the verification couldn't be done because the metadata containing the
> checksums was bad.  Thus, if there are errors found and corrected, and
> you see unverified errors as well, rerun the scrub, so the newly
> corrected metadata can now be used to verify the previously unverified
> errors.


ok then, rule of the thumb re-run the scrub on “unverified checksum error(s)”. I have yet to see checksum errors yet but will keep it in mind..

> 
> I'm presently getting a lot of experience with this as one of the ssds in
> my raid1 is gradually failing and rewriting sectors.  Generally what
> happens is that the ssd will take too long, triggering a SATA reset (30
> second timeout), and btrfs will call that an error.  The scrub then
> rewrites the bad copy on the unreliable device with the good copy from
> the more reliable device, with the write triggering a sector relocation
> on the bad device.  The newly written copy then checks out good, but if
> it was metadata, it very likely contained checksums for several other
> blocks, which couldn't be verified because the block containing their
> checksums was itself bad.  Typically I'll see dozens to a couple hundred
> unverified errors for every bad metadata block rewritten in this way.
> Rerunning the scrub then either verifies or fixes the previously
> unverified blocks, tho sometimes one of those in turn ends up bad and if
> it's a metadata block, I may end up rerunning the scrub another time or
> two, until everything checks out.
> 
> FWIW, on the bad device, smartctl -A reports (excerpted):
> 
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>  5 Reallocated_Sector_Ct   0x0032   098   098   036    Old_age   Always       -       259
> 182 Erase_Fail_Count_Total  0x0032   100   100   000    Old_age   Always       -       132
> 
> While on the paired good device:
> 
>  5 Reallocated_Sector_Ct   0x0032   253   253   036    Old_age   Always       -       0
> 182 Erase_Fail_Count_Total  0x0032   253   253   000    Old_age   Always       -       0
> 
> Meanwhile, smartctl -H has already warned once that the device is
> failing, tho it went back to passing status again, but as of now it's
> saying failing, again.  The attribute that actually registers as failing,
> again from the bad device, followed by the good, is:
> 
>  1 Raw_Read_Error_Rate     0x000f   001   001   006    Pre-fail  Always   FAILING_NOW 3081
> 
>  1 Raw_Read_Error_Rate     0x000f   160   159   006    Pre-fail  Always       -       41
> 
> When it's not actually reporting failing, the FAILING_NOW status is
> replaced with IN_THE_PAST.
> 
> 250 Read_Error_Retry_Rate is the other attribute of interest, with values
> of 100 current and worst for both devices, threshold 0, but a raw value
> of 2488 for the good device and over 17,000,000 for the failing device.
> But with the "cooked" value never moving from 100 and with no real
> guidance on how to interpret the raw values, while it's interesting,
> I am left relying on the others for indicators I can actually understand.
> 
> The 5 and 182 raw counts have been increasing gradually over time, and I
> scrub every time I do a major update, with another reallocated sector or
> two often appearing.  But as long as the paired good device keeps its zero
> count and I have backups (as I do!), btrfs is actually allowing me to
> continue using the unreliable device, relying on btrfs checksums and
> scrubbing to keep it usable.  And FWIW, I do have another device ready to
> go in when I decide I've had enough of this, but as long as I have
> backups and btrfs scrub keeps things fixed up, there's no real hurry
> unless I decide I'm tired of dealing with it.  Meanwhile, I'm having a
> bit of morbid fun watching as it slowly decays, getting experience of
> the process in a reasonably controlled setting without serious danger
> to my data, since it is backed up.


You sure have morbid inclinations ! ;-)

Out of curiosity what is the frequency and sequence of smartctl long/short tests + btrfs scrubs ? Is it all automated ?


> As for raid0 (and single), there's only one copy.  Btrfs detects checksum
> failure as it does above, but since there's only the one copy, if it's
> bad, well, for data you simply can't access that file any longer.  For
> metadata, you can't access whatever directories and files it referenced,
> any longer.  (FWIW for the truly desperate who hope that at least some of
> it can be recovered even if it's not a bit-perfect match, there's a btrfs
> command that wipes the checksum tree, which will let you access the
> previously bad-checksum files again, but it works on the entire
> filesystem so it's all or nothing, and of course with known corruption,
> there's no guarantees.)

But is it possible to manually correct the corruption by overwriting the corrupted files with a copy from a backup ? I mean is there enough information reported in order to do that ?

thanks!

v

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

next prev parent reply	other threads:[~2015-06-17 13:47 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-16 12:09 RAID10 Balancing Request for Comments and Advices Vincent Olivier
2015-06-16 12:25 ` Hugo Mills
2015-06-16 13:34   ` Vincent Olivier
2015-06-16 23:58     ` Duncan
2015-06-17  0:14       ` Chris Murphy
2015-06-17 13:13         ` Vincent Olivier
2015-06-17 13:27           ` Hugo Mills
2015-06-17 13:29             ` Vincent Olivier
2015-06-18  4:37             ` Duncan
2015-06-17 13:46       ` Vincent Olivier [this message]
2015-06-18  8:00         ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74449C35-BA4E-4476-9EA1-EFE66312AFFA@up4.com \
    --to=vincent@up4.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.