RAID10 Balancing Request for Comments and Advices

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* RAID10 Balancing Request for Comments and Advices
@ 2015-06-16 12:09 Vincent Olivier
  2015-06-16 12:25 ` Hugo Mills
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Olivier @ 2015-06-16 12:09 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I have a Centos 7 machine with the latest EPEL kernel-ml (4.0.5) with a 6-disk 4TB HGST RAID10 btrfs volume. With the following mount options :

noatime,compress=zlib,space_cache 0 2

"btrfs filesystem df” gives :

Data, RAID10: total=7.08TiB, used=7.02TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID10: total=7.88MiB, used=656.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID10: total=9.19GiB, used=7.56GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

My first question is this : is it normal to have “single” blocks ? Why not only RAID10? I don’t remember the exact mkfs options I used but I certainly didn’t ask for “single” so this is unexpected.

My second question is : what is the best device add / balance sequence to use if I want to add 2 more disks to this RAID10 volume? Also is a balance necessary at all since I’m adding a pair?

My third question is: given that this file system is an offline backup for another RAID0 volume with SMB sharing, what is the best maintenance schedule as long as it is offline? For now, I only have a weekly cron scrub now, but I think that the priority is to have it balanced after a send-receive or rsync to optimize storage space availability (over performance). Is there a “light” balancing method recommended in this case?

My fourth question, still within the same context: are there best practices when using smartctl for periodically testing (long test, short test) btrfs RAID devices?

Thanks!

Vincent

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-16 12:09 RAID10 Balancing Request for Comments and Advices Vincent Olivier
@ 2015-06-16 12:25 ` Hugo Mills
  2015-06-16 13:34   ` Vincent Olivier
  0 siblings, 1 reply; 11+ messages in thread
From: Hugo Mills @ 2015-06-16 12:25 UTC (permalink / raw)
  To: Vincent Olivier; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3591 bytes --]

On Tue, Jun 16, 2015 at 08:09:17AM -0400, Vincent Olivier wrote:
> Hello,
> 
> I have a Centos 7 machine with the latest EPEL kernel-ml (4.0.5) with a 6-disk 4TB HGST RAID10 btrfs volume. With the following mount options :
> 
> noatime,compress=zlib,space_cache 0 2
> 
> 
> "btrfs filesystem df” gives :
> 
> 
> Data, RAID10: total=7.08TiB, used=7.02TiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID10: total=7.88MiB, used=656.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID10: total=9.19GiB, used=7.56GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B

> My first question is this : is it normal to have “single” blocks ?
> Why not only RAID10? I don’t remember the exact mkfs options I used
> but I certainly didn’t ask for “single” so this is unexpected.

   Yes. It's an artefact of the way that mkfs works. If you run a
balance on those chunks, they'll go away. (btrfs balance start
-dusage=0 -musage=0 /mountpoint)

> My second question is : what is the best device add / balance sequence to use if I want to add 2 more disks to this RAID10 volume? Also is a balance necessary at all since I’m adding a pair?

   Add both devices first, then balance.

   For a RAID-1 filesystem, adding two devices wouldn't need a balance
to get full usage out of the new devices. However, you've got RAID-10,
so the most you'd be able to get on the FS without a balance is four
times the remaining space on one of the existing disks.

   The chunk allocator for RAID-10 will allocate as many chunks as it
can in an even number across all the devices, omitting the device with
the smallest free space if there's an odd number of devices. It must
have space on at least four devices, so adding two devices means that
it'll have to have free space on at least two of the existing ones
(and will try to use all of them).

   So yes, unless you're adding four devices, a rebalance is required
here.

> My third question is: given that this file system is an offline
> backup for another RAID0 volume with SMB sharing, what is the best
> maintenance schedule as long as it is offline? For now, I only have
> a weekly cron scrub now, but I think that the priority is to have it
> balanced after a send-receive or rsync to optimize storage space
> availability (over performance). Is there a “light” balancing method
> recommended in this case?

   You don't need to balance after send/receive or rsync. If you find
that you have lots of data space allocated but not used (the first
line in btrfs fi df, above), *and* metadata close to usage (within,
say, 700 MiB), *and* no unallocated space (btrfs fi show), then it's
worth running a filtered balance with -dlimit=3 or some similar small
value to free up some space that the metadata can expand into. Other
than that, it's pretty much entirely pointless.

   For maintenance, I would suggest running a scrub regularly, to
check for various forms of bitrot. Typical frequencies for a scrub
are once a week or once a month -- opinions vary (as do runtimes).

> My fourth question, still within the same context: are there best
> practices when using smartctl for periodically testing (long test,
> short test) btrfs RAID devices?

   I can't answer that one, I'm afraid.

   Hugo.

-- 
Hugo Mills             | Welcome to Rivendell, Mr Anderson...
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                            Machinae Supremacy, Hybrid

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-16 12:25 ` Hugo Mills
@ 2015-06-16 13:34   ` Vincent Olivier
  2015-06-16 23:58     ` Duncan
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Olivier @ 2015-06-16 13:34 UTC (permalink / raw)
  To: linux-btrfs

> 
> On Jun 16, 2015, at 8:25 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> 
> On Tue, Jun 16, 2015 at 08:09:17AM -0400, Vincent Olivier wrote:
>> 
>> "btrfs filesystem df” gives :
>> 
>> 
>> Data, RAID10: total=7.08TiB, used=7.02TiB
>> Data, single: total=8.00MiB, used=0.00B
>> System, RAID10: total=7.88MiB, used=656.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, RAID10: total=9.19GiB, used=7.56GiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
>> My first question is this : is it normal to have “single” blocks ?
>> Why not only RAID10? I don’t remember the exact mkfs options I used
>> but I certainly didn’t ask for “single” so this is unexpected.
> 
>  Yes. It's an artefact of the way that mkfs works. If you run a
> balance on those chunks, they'll go away. (btrfs balance start
> -dusage=0 -musage=0 /mountpoint)



Thanks! I did and it did go away, except for the "GlobalReserve, single: total=512.00MiB, used=0.00B”. But I suppose this is a permanent fixture, right?



> 
>> My second question is : what is the best device add / balance sequence to use if I want to add 2 more disks to this RAID10 volume? Also is a balance necessary at all since I’m adding a pair?
> 
>  Add both devices first, then balance.
> 
>  For a RAID-1 filesystem, adding two devices wouldn't need a balance
> to get full usage out of the new devices. However, you've got RAID-10,
> so the most you'd be able to get on the FS without a balance is four
> times the remaining space on one of the existing disks.
> 
>  The chunk allocator for RAID-10 will allocate as many chunks as it
> can in an even number across all the devices, omitting the device with
> the smallest free space if there's an odd number of devices. It must
> have space on at least four devices, so adding two devices means that
> it'll have to have free space on at least two of the existing ones
> (and will try to use all of them).
> 
>  So yes, unless you're adding four devices, a rebalance is required
> here.


It is perfectly clear and logical that 1+0 works on four devices at a time.


>> My third question is: given that this file system is an offline
>> backup for another RAID0 volume with SMB sharing, what is the best
>> maintenance schedule as long as it is offline? For now, I only have
>> a weekly cron scrub now, but I think that the priority is to have it
>> balanced after a send-receive or rsync to optimize storage space
>> availability (over performance). Is there a “light” balancing method
>> recommended in this case?
> 
>  You don't need to balance after send/receive or rsync. If you find
> that you have lots of data space allocated but not used (the first
> line in btrfs fi df, above), *and* metadata close to usage (within,
> say, 700 MiB), *and* no unallocated space (btrfs fi show), then it's
> worth running a filtered balance with -dlimit=3 or some similar small
> value to free up some space that the metadata can expand into. Other
> than that, it's pretty much entirely pointless.


Ok thanks. Is there a btrfs-utils way of automating the "if less than 1Gb free do balance -dlimit=3” ?


>  For maintenance, I would suggest running a scrub regularly, to
> check for various forms of bitrot. Typical frequencies for a scrub
> are once a week or once a month -- opinions vary (as do runtimes).


Yes. I cronned it weekly for now. Takes about 5 hours. Is it automatically corrected on RAID10 since a copy of it exist within the filesystem ? What happens for RAID0 ?

Thanks!

V

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-16 13:34   ` Vincent Olivier
@ 2015-06-16 23:58     ` Duncan
  2015-06-17  0:14       ` Chris Murphy
  2015-06-17 13:46       ` Vincent Olivier
  0 siblings, 2 replies; 11+ messages in thread
From: Duncan @ 2015-06-16 23:58 UTC (permalink / raw)
  To: linux-btrfs

Vincent Olivier posted on Tue, 16 Jun 2015 09:34:29 -0400 as excerpted:

>> On Jun 16, 2015, at 8:25 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>> 
>> On Tue, Jun 16, 2015 at 08:09:17AM -0400, Vincent Olivier wrote:
>>> 
>>> My first question is this : is it normal to have “single” blocks ?
>>> Why not only RAID10? I don’t remember the exact mkfs options I used
>>> but I certainly didn’t ask for “single” so this is unexpected.
>> 
>>  Yes. It's an artefact of the way that mkfs works. If you run a
>> balance on those chunks, they'll go away. (btrfs balance start
>> -dusage=0 -musage=0 /mountpoint)
> 
> Thanks! I did and it did go away, except for the "GlobalReserve, single:
> total=512.00MiB, used=0.00B”. But I suppose this is a permanent fixture,
> right?

Yes.  GlobalReserve is for short-term btrfs-internal use, reserved for 
times when btrfs needs to (temporarily) allocate some space in ordered to 
free space, etc.  It's always single, and you'll rarely see anything but 
0 used except perhaps in the middle of a balance or something.

>>  You don't need to balance after send/receive or rsync. If you find
>> that you have lots of data space allocated but not used (the first line
>> in btrfs fi df, above), *and* metadata close to usage (within, say, 700
>> MiB), *and* no unallocated space (btrfs fi show), then it's worth
>> running a filtered balance with -dlimit=3 or some similar small value
>> to free up some space that the metadata can expand into. Other than
>> that, it's pretty much entirely pointless.
> 
> Ok thanks. Is there a btrfs-utils way of automating the "if less than
> 1Gb free do balance -dlimit=3” ?

On a current kernel unlike older ones, btrfs actually automates entirely 
empty chunk reclaim, so this problem doesn't occur anything close to near 
as often as it used to.  However, it's still possible to have mostly but 
not entirely empty chunks that btrfs won't automatically reclaim.  A 
balance can be used to rewrite and combine these mostly empty chunks, 
reclaiming the space saved.  This is what Hugo was recommending.  Mostly-
empty chunk rebalance and reclaim is not (yet?) entirely automated, but 
in most use-cases it's not something you need to worry about that much.  
You do it if you notice a huge difference in available vs used in btrfs 
fi df, or if btrfs fi show drops below several gigs available, or if you 
start getting nospace errors, but otherwise, don't worry about it.

Tho of course you could script the check if desired, but then you'd 
simply script conditional logic to support your own special-case.

>>  For maintenance, I would suggest running a scrub regularly, to
>> check for various forms of bitrot. Typical frequencies for a scrub are
>> once a week or once a month -- opinions vary (as do runtimes).
> 
> 
> Yes. I cronned it weekly for now. Takes about 5 hours. Is it
> automatically corrected on RAID10 since a copy of it exist within the
> filesystem ? What happens for RAID0 ?

For raid10 (and the raid1 I use), yes, it's corrected, from the other 
existing copy, assuming it's good, tho if there are metadata checksum 
errors, there may be corresponding unverified checksums as well, where 
the verification couldn't be done because the metadata containing the 
checksums was bad.  Thus, if there are errors found and corrected, and 
you see unverified errors as well, rerun the scrub, so the newly 
corrected metadata can now be used to verify the previously unverified 
errors.

I'm presently getting a lot of experience with this as one of the ssds in 
my raid1 is gradually failing and rewriting sectors.  Generally what 
happens is that the ssd will take too long, triggering a SATA reset (30 
second timeout), and btrfs will call that an error.  The scrub then 
rewrites the bad copy on the unreliable device with the good copy from 
the more reliable device, with the write triggering a sector relocation 
on the bad device.  The newly written copy then checks out good, but if 
it was metadata, it very likely contained checksums for several other 
blocks, which couldn't be verified because the block containing their 
checksums was itself bad.  Typically I'll see dozens to a couple hundred 
unverified errors for every bad metadata block rewritten in this way.  
Rerunning the scrub then either verifies or fixes the previously 
unverified blocks, tho sometimes one of those in turn ends up bad and if 
it's a metadata block, I may end up rerunning the scrub another time or 
two, until everything checks out.

FWIW, on the bad device, smartctl -A reports (excerpted):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   098   098   036    Old_age   Always       -       259
182 Erase_Fail_Count_Total  0x0032   100   100   000    Old_age   Always       -       132

While on the paired good device:

  5 Reallocated_Sector_Ct   0x0032   253   253   036    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   253   253   000    Old_age   Always       -       0

Meanwhile, smartctl -H has already warned once that the device is 
failing, tho it went back to passing status again, but as of now it's 
saying failing, again.  The attribute that actually registers as failing, 
again from the bad device, followed by the good, is:

  1 Raw_Read_Error_Rate     0x000f   001   001   006    Pre-fail  Always   FAILING_NOW 3081

  1 Raw_Read_Error_Rate     0x000f   160   159   006    Pre-fail  Always       -       41                                                                   

When it's not actually reporting failing, the FAILING_NOW status is 
replaced with IN_THE_PAST.

250 Read_Error_Retry_Rate is the other attribute of interest, with values 
of 100 current and worst for both devices, threshold 0, but a raw value 
of 2488 for the good device and over 17,000,000 for the failing device. 
But with the "cooked" value never moving from 100 and with no real
guidance on how to interpret the raw values, while it's interesting, 
I am left relying on the others for indicators I can actually understand.

The 5 and 182 raw counts have been increasing gradually over time, and I 
scrub every time I do a major update, with another reallocated sector or 
two often appearing.  But as long as the paired good device keeps its zero 
count and I have backups (as I do!), btrfs is actually allowing me to 
continue using the unreliable device, relying on btrfs checksums and 
scrubbing to keep it usable.  And FWIW, I do have another device ready to 
go in when I decide I've had enough of this, but as long as I have 
backups and btrfs scrub keeps things fixed up, there's no real hurry 
unless I decide I'm tired of dealing with it.  Meanwhile, I'm having a 
bit of morbid fun watching as it slowly decays, getting experience of
the process in a reasonably controlled setting without serious danger
to my data, since it is backed up.

As for raid0 (and single), there's only one copy.  Btrfs detects checksum 
failure as it does above, but since there's only the one copy, if it's 
bad, well, for data you simply can't access that file any longer.  For 
metadata, you can't access whatever directories and files it referenced, 
any longer.  (FWIW for the truly desperate who hope that at least some of 
it can be recovered even if it's not a bit-perfect match, there's a btrfs 
command that wipes the checksum tree, which will let you access the 
previously bad-checksum files again, but it works on the entire 
filesystem so it's all or nothing, and of course with known corruption, 
there's no guarantees.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-16 23:58     ` Duncan
@ 2015-06-17  0:14       ` Chris Murphy
  2015-06-17 13:13         ` Vincent Olivier
  2015-06-17 13:46       ` Vincent Olivier
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Murphy @ 2015-06-17  0:14 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Jun 16, 2015 at 5:58 PM, Duncan <1i5t5.duncan@cox.net> wrote:

> On a current kernel unlike older ones, btrfs actually automates entirely
> empty chunk reclaim, so this problem doesn't occur anything close to near
> as often as it used to.  However, it's still possible to have mostly but
> not entirely empty chunks that btrfs won't automatically reclaim.  A
> balance can be used to rewrite and combine these mostly empty chunks,
> reclaiming the space saved.  This is what Hugo was recommending.

Yes, as little as a -dusage=5 (data chunks that are 5% or less full)
can clear the problem and is very fast, seconds. Possibly a bit
longer, many seconds o single digit minutes is -dusage=15. I haven't
done a full balance in forever.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-17  0:14       ` Chris Murphy
@ 2015-06-17 13:13         ` Vincent Olivier
  2015-06-17 13:27           ` Hugo Mills
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Olivier @ 2015-06-17 13:13 UTC (permalink / raw)
  To: linux-btrfs


> On Jun 16, 2015, at 8:14 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> On Tue, Jun 16, 2015 at 5:58 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
>> On a current kernel unlike older ones, btrfs actually automates entirely
>> empty chunk reclaim, so this problem doesn't occur anything close to near
>> as often as it used to.  However, it's still possible to have mostly but
>> not entirely empty chunks that btrfs won't automatically reclaim.  A
>> balance can be used to rewrite and combine these mostly empty chunks,
>> reclaiming the space saved.  This is what Hugo was recommending.
> 
> Yes, as little as a -dusage=5 (data chunks that are 5% or less full)
> can clear the problem and is very fast, seconds. Possibly a bit
> longer, many seconds o single digit minutes is -dusage=15. I haven't
> done a full balance in forever.


Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated "0 out of 3026 chunks”.

Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and it took les than 30 seconds.

So I put a -dusage=25 in the weekly cron just before the scrub.

FYI.

Thanks for your help.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-17 13:13         ` Vincent Olivier
@ 2015-06-17 13:27           ` Hugo Mills
  2015-06-17 13:29             ` Vincent Olivier
  2015-06-18  4:37             ` Duncan
  0 siblings, 2 replies; 11+ messages in thread
From: Hugo Mills @ 2015-06-17 13:27 UTC (permalink / raw)
  To: Vincent Olivier; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1761 bytes --]

On Wed, Jun 17, 2015 at 09:13:08AM -0400, Vincent Olivier wrote:
> 
> > On Jun 16, 2015, at 8:14 PM, Chris Murphy <lists@colorremedies.com> wrote:
> > 
> > On Tue, Jun 16, 2015 at 5:58 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> > 
> >> On a current kernel unlike older ones, btrfs actually automates entirely
> >> empty chunk reclaim, so this problem doesn't occur anything close to near
> >> as often as it used to.  However, it's still possible to have mostly but
> >> not entirely empty chunks that btrfs won't automatically reclaim.  A
> >> balance can be used to rewrite and combine these mostly empty chunks,
> >> reclaiming the space saved.  This is what Hugo was recommending.
> > 
> > Yes, as little as a -dusage=5 (data chunks that are 5% or less full)
> > can clear the problem and is very fast, seconds. Possibly a bit
> > longer, many seconds o single digit minutes is -dusage=15. I haven't
> > done a full balance in forever.
> 
> 
> Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated "0 out of 3026 chunks”.
> 
> Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and it took les than 30 seconds.
> 
> So I put a -dusage=25 in the weekly cron just before the scrub.

   In most cases, all you need to do is clean up one data chunk to
give the metadata enough space to work in. Instead of manually
iterating through several values of usage= until you get a useful
response, you can use limit=<n> to stop after <n> successful block
group relocations.

   Hugo.

-- 
Hugo Mills             | Alert status mauve ocelot: Slight chance of
hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-17 13:27           ` Hugo Mills
@ 2015-06-17 13:29             ` Vincent Olivier
  2015-06-18  4:37             ` Duncan
  1 sibling, 0 replies; 11+ messages in thread
From: Vincent Olivier @ 2015-06-17 13:29 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1683 bytes --]


> On Jun 17, 2015, at 9:27 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> 
> On Wed, Jun 17, 2015 at 09:13:08AM -0400, Vincent Olivier wrote:
>> 
>>> On Jun 16, 2015, at 8:14 PM, Chris Murphy <lists@colorremedies.com> wrote:
>>> 
>>> On Tue, Jun 16, 2015 at 5:58 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>>> 
>>>> On a current kernel unlike older ones, btrfs actually automates entirely
>>>> empty chunk reclaim, so this problem doesn't occur anything close to near
>>>> as often as it used to.  However, it's still possible to have mostly but
>>>> not entirely empty chunks that btrfs won't automatically reclaim.  A
>>>> balance can be used to rewrite and combine these mostly empty chunks,
>>>> reclaiming the space saved.  This is what Hugo was recommending.
>>> 
>>> Yes, as little as a -dusage=5 (data chunks that are 5% or less full)
>>> can clear the problem and is very fast, seconds. Possibly a bit
>>> longer, many seconds o single digit minutes is -dusage=15. I haven't
>>> done a full balance in forever.
>> 
>> 
>> Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated "0 out of 3026 chunks”.
>> 
>> Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and it took les than 30 seconds.
>> 
>> So I put a -dusage=25 in the weekly cron just before the scrub.
> 
>   In most cases, all you need to do is clean up one data chunk to
> give the metadata enough space to work in. Instead of manually
> iterating through several values of usage= until you get a useful
> response, you can use limit=<n> to stop after <n> successful block
> group relocations.


Nice! Will do that instead! Thanks.

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-16 23:58     ` Duncan
  2015-06-17  0:14       ` Chris Murphy
@ 2015-06-17 13:46       ` Vincent Olivier
  2015-06-18  8:00         ` Duncan
  1 sibling, 1 reply; 11+ messages in thread
From: Vincent Olivier @ 2015-06-17 13:46 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7254 bytes --]


> On Jun 16, 2015, at 7:58 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
> Vincent Olivier posted on Tue, 16 Jun 2015 09:34:29 -0400 as excerpted:
> 
> 
>>> On Jun 16, 2015, at 8:25 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>> 
>>> On Tue, Jun 16, 2015 at 08:09:17AM -0400, Vincent Olivier wrote:
>>>> 
>>>> My first question is this : is it normal to have “single” blocks ?
>>>> Why not only RAID10? I don’t remember the exact mkfs options I used
>>>> but I certainly didn’t ask for “single” so this is unexpected.
>>> 
>>> Yes. It's an artefact of the way that mkfs works. If you run a
>>> balance on those chunks, they'll go away. (btrfs balance start
>>> -dusage=0 -musage=0 /mountpoint)
>> 
>> Thanks! I did and it did go away, except for the "GlobalReserve, single:
>> total=512.00MiB, used=0.00B”. But I suppose this is a permanent fixture,
>> right?
> 
> Yes.  GlobalReserve is for short-term btrfs-internal use, reserved for
> times when btrfs needs to (temporarily) allocate some space in ordered to
> free space, etc.  It's always single, and you'll rarely see anything but
> 0 used except perhaps in the middle of a balance or something.


Get it. Thanks.

Is there anyway to put that on another device, say, a SSD? I am thinking of backing up this RAID10 on a 2x8TB device-managed SMR RAID1 and I want to minimize random write operations (noatime & al.). I will start a new thread for that maybe but first, is there something substantial I can read about btrfs+SMR? Or should I avoid SMR+btfs ?


> 
>>> For maintenance, I would suggest running a scrub regularly, to
>>> check for various forms of bitrot. Typical frequencies for a scrub are
>>> once a week or once a month -- opinions vary (as do runtimes).
>> 
>> 
>> Yes. I cronned it weekly for now. Takes about 5 hours. Is it
>> automatically corrected on RAID10 since a copy of it exist within the
>> filesystem ? What happens for RAID0 ?
> 
> For raid10 (and the raid1 I use), yes, it's corrected, from the other
> existing copy, assuming it's good, tho if there are metadata checksum
> errors, there may be corresponding unverified checksums as well, where
> the verification couldn't be done because the metadata containing the
> checksums was bad.  Thus, if there are errors found and corrected, and
> you see unverified errors as well, rerun the scrub, so the newly
> corrected metadata can now be used to verify the previously unverified
> errors.


ok then, rule of the thumb re-run the scrub on “unverified checksum error(s)”. I have yet to see checksum errors yet but will keep it in mind..

> 
> I'm presently getting a lot of experience with this as one of the ssds in
> my raid1 is gradually failing and rewriting sectors.  Generally what
> happens is that the ssd will take too long, triggering a SATA reset (30
> second timeout), and btrfs will call that an error.  The scrub then
> rewrites the bad copy on the unreliable device with the good copy from
> the more reliable device, with the write triggering a sector relocation
> on the bad device.  The newly written copy then checks out good, but if
> it was metadata, it very likely contained checksums for several other
> blocks, which couldn't be verified because the block containing their
> checksums was itself bad.  Typically I'll see dozens to a couple hundred
> unverified errors for every bad metadata block rewritten in this way.
> Rerunning the scrub then either verifies or fixes the previously
> unverified blocks, tho sometimes one of those in turn ends up bad and if
> it's a metadata block, I may end up rerunning the scrub another time or
> two, until everything checks out.
> 
> FWIW, on the bad device, smartctl -A reports (excerpted):
> 
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>  5 Reallocated_Sector_Ct   0x0032   098   098   036    Old_age   Always       -       259
> 182 Erase_Fail_Count_Total  0x0032   100   100   000    Old_age   Always       -       132
> 
> While on the paired good device:
> 
>  5 Reallocated_Sector_Ct   0x0032   253   253   036    Old_age   Always       -       0
> 182 Erase_Fail_Count_Total  0x0032   253   253   000    Old_age   Always       -       0
> 
> Meanwhile, smartctl -H has already warned once that the device is
> failing, tho it went back to passing status again, but as of now it's
> saying failing, again.  The attribute that actually registers as failing,
> again from the bad device, followed by the good, is:
> 
>  1 Raw_Read_Error_Rate     0x000f   001   001   006    Pre-fail  Always   FAILING_NOW 3081
> 
>  1 Raw_Read_Error_Rate     0x000f   160   159   006    Pre-fail  Always       -       41
> 
> When it's not actually reporting failing, the FAILING_NOW status is
> replaced with IN_THE_PAST.
> 
> 250 Read_Error_Retry_Rate is the other attribute of interest, with values
> of 100 current and worst for both devices, threshold 0, but a raw value
> of 2488 for the good device and over 17,000,000 for the failing device.
> But with the "cooked" value never moving from 100 and with no real
> guidance on how to interpret the raw values, while it's interesting,
> I am left relying on the others for indicators I can actually understand.
> 
> The 5 and 182 raw counts have been increasing gradually over time, and I
> scrub every time I do a major update, with another reallocated sector or
> two often appearing.  But as long as the paired good device keeps its zero
> count and I have backups (as I do!), btrfs is actually allowing me to
> continue using the unreliable device, relying on btrfs checksums and
> scrubbing to keep it usable.  And FWIW, I do have another device ready to
> go in when I decide I've had enough of this, but as long as I have
> backups and btrfs scrub keeps things fixed up, there's no real hurry
> unless I decide I'm tired of dealing with it.  Meanwhile, I'm having a
> bit of morbid fun watching as it slowly decays, getting experience of
> the process in a reasonably controlled setting without serious danger
> to my data, since it is backed up.


You sure have morbid inclinations ! ;-)

Out of curiosity what is the frequency and sequence of smartctl long/short tests + btrfs scrubs ? Is it all automated ?


> As for raid0 (and single), there's only one copy.  Btrfs detects checksum
> failure as it does above, but since there's only the one copy, if it's
> bad, well, for data you simply can't access that file any longer.  For
> metadata, you can't access whatever directories and files it referenced,
> any longer.  (FWIW for the truly desperate who hope that at least some of
> it can be recovered even if it's not a bit-perfect match, there's a btrfs
> command that wipes the checksum tree, which will let you access the
> previously bad-checksum files again, but it works on the entire
> filesystem so it's all or nothing, and of course with known corruption,
> there's no guarantees.)

But is it possible to manually correct the corruption by overwriting the corrupted files with a copy from a backup ? I mean is there enough information reported in order to do that ?

thanks!

v

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-17 13:27           ` Hugo Mills
  2015-06-17 13:29             ` Vincent Olivier
@ 2015-06-18  4:37             ` Duncan
  1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2015-06-18  4:37 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Wed, 17 Jun 2015 13:27:36 +0000 as excerpted:

>> Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and
>> relocated "0 out of 3026 chunks”.
>> 
>> Out of curiosity, I had to use -dusage=90 to have it relocate only 1
>> chunk and it took les than 30 seconds.
>> 
>> So I put a -dusage=25 in the weekly cron just before the scrub.
> 
> In most cases, all you need to do is clean up one data chunk to
> give the metadata enough space to work in. Instead of manually iterating
> through several values of usage= until you get a useful response, you
> can use limit=<n> to stop after <n> successful block group relocations.

Thanks, Hugo.  It wasn't previously clear to me what the practical usage 
for the (relatively new) limit= filter was.  Very useful explanation. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10 Balancing Request for Comments and Advices
  2015-06-17 13:46       ` Vincent Olivier
@ 2015-06-18  8:00         ` Duncan
  0 siblings, 0 replies; 11+ messages in thread
From: Duncan @ 2015-06-18  8:00 UTC (permalink / raw)
  To: linux-btrfs

Vincent Olivier posted on Wed, 17 Jun 2015 09:46:50 -0400 as excerpted:

>> On Jun 16, 2015, at 7:58 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> 
>> Yes.  GlobalReserve is for short-term btrfs-internal use, reserved for
>> times when btrfs needs to (temporarily) allocate some space in ordered
>> to free space, etc.  It's always single, and you'll rarely see anything
>> but 0 used except perhaps in the middle of a balance or something.
> 
> 
> Get it. Thanks.
> 
> Is there anyway to put that on another device, say, a SSD?

Not (AFAIK) presently.  There are various btrfs feature suggestions 
involving selective steering various btrfs component bits to faster or 
slower devices, etc, as can be seen on the wiki, but the btrfs chunk 
allocator isn't really customizable beyond basic raid-level, yet.  It 
does what it does and that's it.  For fancy features such as this, unless 
you're a company or individual with resources to invest in a specific 
feature of interest, I'd say give btrfs development another five years or 
so, and it may be tackling this sort of thing.

The two actually working alternatives I know of are bcached btrfs 
(there's someone on-list that actually does that and reports it working), 
and a more mature "btrfs-similar" solution such as zfs, tho of course zfs 
on Linux has its own issues, primarily licensing/legal.

> I am thinking
> of backing up this RAID10 on a 2x8TB device-managed SMR RAID1 and I want
> to minimize random write operations (noatime & al.). I will start a new
> thread for that maybe but first, is there something substantial I can
> read about btrfs+SMR? Or should I avoid SMR+btfs ?

I haven't the foggiest, but in case it spares someone looking up SMR like 
I just had to do, SMR = Shingled Magnetic Recording -- the new "shingled" 
drives that have been in the tech news since shortly before they started 
shipping in late 2013.

https://en.wikipedia.org/wiki/Shingled_magnetic_recording

> ok then, rule of the thumb re-run the scrub on “unverified checksum
> error(s)”. I have yet to see checksum errors yet but will keep it in
> mind..

FWIW, see my few minutes ago reply to Marc MERLIN in the "BTRFS: read 
error corrected: ino 1 off ...." thread, if you're interested in further 
discussion on this.  

But regardless, based on my own experience, that's a good rule of thumb, 
yes. =:^)

>> Meanwhile, I'm having a bit of morbid fun watching as [a dying ssd]
>> slowly decays, getting experience of the process in a reasonably
>> controlled setting without serious danger to my data, since it is
>> backed up.

> You sure have morbid inclinations ! ;-)

=:^)

> Out of curiosity what is the frequency and sequence of smartctl
> long/short tests + btrfs scrubs ? Is it all automated ?

I haven't automated any of that, except that since this dying ssd thing 
started I created a small scriptlet (could be an alias, but I prefer 
scriptlets), "bscrub", that runs btrfs scrub start -Bd $*, to avoid 
typing in the full command.  All I have to add is the mountpoint to 
scrub, possibly preceded by -r to read-only scrub /, which I keep read-
only mounted by default.

Perhaps to my harm I don't actually do the smart-tests regularly.  I'm 
not actually sure they're particularly useful on SSDs, particularly when 
using checksum-verified and raid-redundant filesystems such as btrfs in 
raid1/10 mode (and raid5/6 as it matures).  In practice btrfs scrub 
regularly reporting error corrected and/or nasty bus reset errors showing 
up in the logs are a pretty good advance indicators, better than smart 
status, from what I've seen.

I do check smartctrl -AH regularly, particularly now, but (in the past at 
least, I think my habit may be changing for the better, now, one of the 
positive results of letting the dying ssd run for the moment) less 
frequently when no problems are evident.

I actually have a pretty firm policy of splitting up my data onto 
separate filesystems (btrfs subvolumes don't cut it for me as all the 
data eggs are still in the same filesystem basket and if its bottom falls 
out, !!!!), keeping them of easily managed and easily backed up size.  My 
largest btrfs is actually under 50 gig.  Between that and the fact that 
I'm using ssds, whole-filesystem maintenance (btrfs scrub, balance, and 
check commands) time is on the order of seconds to a few minutes (single 
digits) per filesystem.  As a result, running them is relatively trivial 
-- it doesn't take the hours to days people report for their multi-
terabyte btrfs on spinning rust, and I can and do sometimes run them on a 
whim.  Scrubs are generally under a minute per filesystem, with only a 
handful of filesystems routinely used, so under 10 minutes, total, 
including repeat-runs, on all routinely mounted btrfs.

Given the trivial time factor I basically simply integrated the scrub 
into my update procedure (weekly on average, tho it can be daily if I'm 
waiting on a fix or 10-14 days if I'm lazy), since that's my biggest 
filesystem changes and thus most likely to trigger new bad blocks.  / is 
read-only mounted by default except for updates, and the packages 
partition is only mounted for updates, so that takes care of them.  I've 
lately taken to scrubbing home every couple of days, before a reboot or 
sometimes when I'm reading this list and thus thinking about it.  boot 
and log are both trivial, under a gig each so scrubbed about as fast as I 
lift my finger off enter.  And boot isn't mounted by default and can be 
scrubbed when I mount it to update kernels, while log isn't something I'm 
hugely worried about losing.  My big partition is the media partition, 
but that's still reiserfs on spinning rust, so is neither scrubbable nor 
endangered by the failing ssd.  Other than that there's the backup 
versions of all these filesystem partitions, but they too can be scrubbed 
on update (primary backup, btrfs, on the ssds) or are still on reiserfs 
on spinning rust (secondary backup).

>> As for raid0 (and single), there's only one copy.  Btrfs detects
>> checksum failure as it does above, but since there's only the one copy,
>> if it's bad, well, for data you simply can't access that file any
>> longer.  For metadata, you can't access whatever directories and files
>> it referenced,
>> any longer.  (FWIW for the truly desperate who hope that at least some
>> of it can be recovered even if it's not a bit-perfect match, there's a
>> btrfs command that wipes the checksum tree, which will let you access
>> the previously bad-checksum files again, but it works on the entire
>> filesystem so it's all or nothing, and of course with known corruption,
>> there's no guarantees.)
> 
> But is it possible to manually correct the corruption by overwriting the
> corrupted files with a copy from a backup ? I mean is there enough
> information reported in order to do that ?

In general, yes.  For data corruption, btrfs scrub prints the affected 
file, so deleting it and pulling a new copy over from backup shouldn't be 
an issue.

Metadata is by nature a bit more difficult to trace down and correct, but 
(except for ssd) it's dup by default on single-device and raid1 by 
default on multi-device anyway, and I'd consider anyone playing games 
with single metadata without backups to be getting exactly the deal they 
negotiated for if they lose it all.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-18  8:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-16 12:09 RAID10 Balancing Request for Comments and Advices Vincent Olivier
2015-06-16 12:25 ` Hugo Mills
2015-06-16 13:34   ` Vincent Olivier
2015-06-16 23:58     ` Duncan
2015-06-17  0:14       ` Chris Murphy
2015-06-17 13:13         ` Vincent Olivier
2015-06-17 13:27           ` Hugo Mills
2015-06-17 13:29             ` Vincent Olivier
2015-06-18  4:37             ` Duncan
2015-06-17 13:46       ` Vincent Olivier
2015-06-18  8:00         ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.