What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?
@ 2021-01-17 20:35  
  2021-01-17 22:38 ` John Stoffel
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From:   @ 2021-01-17 20:35 UTC (permalink / raw
  To: linux-raid

I have 4 slow, loud, big, power hungry and old hard drives, and 2 SSD's. I'm trying to come up with a way to combine them into a system that has the following characteristics:

A) The hard drives stop spinning 5 minutes after they have been used.
B) The SSD's are used for read and write caching. Writes to the system are absorbed by the SSD's. Only when the ssd's are full of dirty data, then the hard drives are woken up. (This means the SSD's contain dirty data for potentially a long time.)
C) When data is requested that's not present on the SSD's (a read cache miss), then the hard drive which has that data is woken up.
D) When a hard drive is woken up as a result of a read cache miss, then the SSD's write out the dirty data to that drive.
E) If one drive fails, or starts to produce random data, the system must return the correct data to the user.

First idea is to use this stack of bcache and btrfs:
+--------------------------------------------+--------------+
|               btrfs raid 1 (2 copies) /mnt                |
+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
+--------------+--------------+--------------+--------------+
|                       Cache (SSD)                         |
|                       /dev/sda4                           |
+--------------+--------------+--------------+--------------+
| Data HDD     | Data HDD     | Data HDD     |Data HDD      |
| /dev/sda8    | /dev/sda9    | /dev/sda10   |/dev/sda11    |
+--------------+--------------+--------------+--------------+
The good:
Btrfs in raid 1 is able to handle a failing hard drive, both when it failed completely, and when it corrupts data.
Bcache is capable of using an ssd to cache the read and the write requests from btrfs. 
The not-so-good:
Bcache can only use one SSD, so using bcache is only possible as read cache in order to achieve characteristic E, but this prevents characteristic B to be achieved.
I can't get bcache to read-ahead the data that is adjacent to the data that has just been accessed.

Second idea is to use a SSD in front of each hard drive:
+-----------------------------------------------------------+
|                btrfs raid 1 (2 copies) /mnt               |
+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
+--------------+--------------+--------------+--------------+
| Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  | 
| /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
+--------------+--------------+--------------+--------------+
| Data         | Data         | Data         | Data         |
| /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
+--------------+--------------+--------------+--------------+
The good:
This setup achieves all characteristics I'm after 
The not-so-good:
This requires more SSD's and more (SATA) ports than I have.
I can't get bcache to read-ahead the data that is adjacent to the data that has just been accessed.

Third idea is to use mdadm to create a raid 0 array out of the 2 SSD's to create a fault tolerant write cache:
+-----------------------------------------------------------+
|                 btrfs raid 1 (2 copies) /mnt              |
+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
+--------------+--------------+--------------+--------------+
|                      bcache Cache                         |
|                         /dev/md0                          |
+-----------------------------------------------------------+
|               mdadm raid 0 array /dev/md0                 |
|             SSD /dev/sda4 and SSD /dev/sda5               |    
+--------------+--------------+--------------+--------------+
| Data         | Data         | Data         | Data         |
| /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
+--------------+--------------+--------------+--------------+
The good:
This setup is capable of achieving all characteristics I'm after. It can handle abrupt failure of a single drive.
The not-so-good:
When one of the SSD's start to produce random data, mdadm is not able to know what SSD produces correct data, and data is lost. (both copies of the data btrfs is trying to write to underlying storage are on the 2 SSD's.

Fourth idea is to use dm-cache. Dm-cache can only cache one backing device, and it has no way to use 2 cache devices. 
+-----------------------------------------------------------+
|                btrfs raid 1 (2 copies) /mnt               |
+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
+--------------+--------------+--------------+--------------+
| Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  |  
| /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
+--------------+--------------+--------------+--------------+
| Data         | Data         | Data         | Data         |
| /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
+--------------+--------------+--------------+--------------+
The good:
This setup is capable of achieving all characteristics I'm after.
The not-so-good:
This requires more SSD's and more (SATA) ports than I have.

What options do I have to create the desired setup?
Is it feasible to add a checksum to mdadm, much like btrfs has, so it can tell what drive (if any) has returned the correct data?

Is this the correct mailing list to ask these questions?

---

Take your mailboxes with you. Free, fast and secure Mail &amp; Cloud: https://www.eclipso.eu - Time to change!



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?
  2021-01-17 20:35 What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?  
@ 2021-01-17 22:38 ` John Stoffel
  2021-01-17 23:32 ` Wols Lists
  2021-01-18  1:59 ` Chris Murphy
  2 siblings, 0 replies; 5+ messages in thread
From: John Stoffel @ 2021-01-17 22:38 UTC (permalink / raw
  To:  ; +Cc: linux-raid

>>>>>   <" <Cedric.dewijs@eclipso.eu>> writes:

> I have 4 slow, loud, big, power hungry and old hard drives, and 2 SSD's. I'm trying to come up with a way to combine them into a system that has the following characteristics:
> A) The hard drives stop spinning 5 minutes after they have been used.
> B) The SSD's are used for read and write caching. Writes to the system are absorbed by the SSD's. Only when the ssd's are full of dirty data, then the hard drives are woken up. (This means the SSD's contain dirty data for potentially a long time.)
> C) When data is requested that's not present on the SSD's (a read cache miss), then the hard drive which has that data is woken up.
> D) When a hard drive is woken up as a result of a read cache miss, then the SSD's write out the dirty data to that drive.
> E) If one drive fails, or starts to produce random data, the system must return the correct data to the user.

> First idea is to use this stack of bcache and btrfs:
> +--------------------------------------------+--------------+
> |               btrfs raid 1 (2 copies) /mnt                |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
> +--------------+--------------+--------------+--------------+
> |                       Cache (SSD)                         |
> |                       /dev/sda4                           |
> +--------------+--------------+--------------+--------------+
> | Data HDD     | Data HDD     | Data HDD     |Data HDD      |
> | /dev/sda8    | /dev/sda9    | /dev/sda10   |/dev/sda11    |
> +--------------+--------------+--------------+--------------+
> The good:
> Btrfs in raid 1 is able to handle a failing hard drive, both when it failed completely, and when it corrupts data.
> Bcache is capable of using an ssd to cache the read and the write requests from btrfs. 
> The not-so-good:
> Bcache can only use one SSD, so using bcache is only possible as read cache in order to achieve characteristic E, but this prevents characteristic B to be achieved.
> I can't get bcache to read-ahead the data that is adjacent to the data that has just been accessed.

> Second idea is to use a SSD in front of each hard drive:
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  | 
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup achieves all characteristics I'm after 
> The not-so-good:

> This requires more SSD's and more (SATA) ports than I have.  I can't
> get bcache to read-ahead the data that is adjacent to the data that
> has just been accessed.

Why don't you just partition your SSD(s) into 4 partitions and use
each partition as a cache for a seperate HDD?  The SSDs have more than
enough IOPs to handle the load.  But!  I would mirror a pair of SSDs
and mirror pairs of disks for even more redundancy and reliability
here.


> Third idea is to use mdadm to create a raid 0 array out of the 2
> SSD's to create a fault tolerant write cache:

No, RAID1 (mirror) not a RAID 0 stripe across two SSDs.

> +-----------------------------------------------------------+
> |                 btrfs raid 1 (2 copies) /mnt              |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
> +--------------+--------------+--------------+--------------+
> |                      bcache Cache                         |
> |                         /dev/md0                          |
> +-----------------------------------------------------------+
> |               mdadm raid 0 array /dev/md0                 |
> |             SSD /dev/sda4 and SSD /dev/sda5               |    
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup is capable of achieving all characteristics I'm after. It can handle abrupt failure of a single drive.
> The not-so-good:
> When one of the SSD's start to produce random data, mdadm is not able to know what SSD produces correct data, and data is lost. (both copies of the data btrfs is trying to write to underlying storage are on the 2 SSD's.

> Fourth idea is to use dm-cache. Dm-cache can only cache one backing device, and it has no way to use 2 cache devices. 
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  |  
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup is capable of achieving all characteristics I'm after.
> The not-so-good:
> This requires more SSD's and more (SATA) ports than I have.

Remember, a single SSD can handle loads more IOPs than all four of
your drives, so partitioning your SSDS might be an answer to your
solutions.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?
  2021-01-17 20:35 What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?  
  2021-01-17 22:38 ` John Stoffel
@ 2021-01-17 23:32 ` Wols Lists
  2021-01-18  1:59 ` Chris Murphy
  2 siblings, 0 replies; 5+ messages in thread
From: Wols Lists @ 2021-01-17 23:32 UTC (permalink / raw
  To: Cedric.dewijs, linux-raid

On 17/01/21 20:35, Cedric.dewijs@eclipso.eu wrote:
> Is it feasible to add a checksum to mdadm, much like btrfs has, so it can tell what drive (if any) has returned the correct data?

put mdadm on top of dm-integrity

https://raid.wiki.kernel.org/index.php/Dm-integrity

Cheers,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?
  2021-01-17 20:35 What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?  
  2021-01-17 22:38 ` John Stoffel
  2021-01-17 23:32 ` Wols Lists
@ 2021-01-18  1:59 ` Chris Murphy
  2021-01-19 19:27   ` Phillip Susi
  2 siblings, 1 reply; 5+ messages in thread
From: Chris Murphy @ 2021-01-18  1:59 UTC (permalink / raw
  To: Cedric.dewijs; +Cc: Linux-RAID

On Sun, Jan 17, 2021 at 1:43 PM <Cedric.dewijs@eclipso.eu> wrote:
>
> I have 4 slow, loud, big, power hungry and old hard drives, and 2 SSD's. I'm trying to come up with a way to combine them into a system that has the following characteristics:
>
> A) The hard drives stop spinning 5 minutes after they have been used.
> B) The SSD's are used for read and write caching. Writes to the system are absorbed by the SSD's. Only when the ssd's are full of dirty data, then the hard drives are woken up. (This means the SSD's contain dirty data for potentially a long time.)
> C) When data is requested that's not present on the SSD's (a read cache miss), then the hard drive which has that data is woken up.
> D) When a hard drive is woken up as a result of a read cache miss, then the SSD's write out the dirty data to that drive.
> E) If one drive fails, or starts to produce random data, the system must return the correct data to the user.
>
> First idea is to use this stack of bcache and btrfs:
> +--------------------------------------------+--------------+
> |               btrfs raid 1 (2 copies) /mnt                |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
> +--------------+--------------+--------------+--------------+
> |                       Cache (SSD)                         |
> |                       /dev/sda4                           |
> +--------------+--------------+--------------+--------------+
> | Data HDD     | Data HDD     | Data HDD     |Data HDD      |
> | /dev/sda8    | /dev/sda9    | /dev/sda10   |/dev/sda11    |
> +--------------+--------------+--------------+--------------+
> The good:
> Btrfs in raid 1 is able to handle a failing hard drive, both when it failed completely, and when it corrupts data.
> Bcache is capable of using an ssd to cache the read and the write requests from btrfs.
> The not-so-good:
> Bcache can only use one SSD, so using bcache is only possible as read cache in order to achieve characteristic E, but this prevents characteristic B to be achieved.
> I can't get bcache to read-ahead the data that is adjacent to the data that has just been accessed.
>
> Second idea is to use a SSD in front of each hard drive:
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  |
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup achieves all characteristics I'm after
> The not-so-good:
> This requires more SSD's and more (SATA) ports than I have.
> I can't get bcache to read-ahead the data that is adjacent to the data that has just been accessed.
>
> Third idea is to use mdadm to create a raid 0 array out of the 2 SSD's to create a fault tolerant write cache:
> +-----------------------------------------------------------+
> |                 btrfs raid 1 (2 copies) /mnt              |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
> +--------------+--------------+--------------+--------------+
> |                      bcache Cache                         |
> |                         /dev/md0                          |
> +-----------------------------------------------------------+
> |               mdadm raid 0 array /dev/md0                 |
> |             SSD /dev/sda4 and SSD /dev/sda5               |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup is capable of achieving all characteristics I'm after. It can handle abrupt failure of a single drive.
> The not-so-good:
> When one of the SSD's start to produce random data, mdadm is not able to know what SSD produces correct data, and data is lost. (both copies of the data btrfs is trying to write to underlying storage are on the 2 SSD's.
>
> Fourth idea is to use dm-cache. Dm-cache can only cache one backing device, and it has no way to use 2 cache devices.
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  |
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup is capable of achieving all characteristics I'm after.
> The not-so-good:
> This requires more SSD's and more (SATA) ports than I have.
>
> What options do I have to create the desired setup?
> Is it feasible to add a checksum to mdadm, much like btrfs has, so it can tell what drive (if any) has returned the correct data?
>
> Is this the correct mailing list to ask these questions?
>
> ---
>
> Take your mailboxes with you. Free, fast and secure Mail &amp; Cloud: https://www.eclipso.eu - Time to change!
>


I think all the options are complicated, and lack sufficient hardware
isolation to withstand SSD failure.

I think all the options are complicated in a recovery/restore context.
I think you're better off with two separate storage setups, this
isolates problems and makes recovery much simpler and more robust.

A

HDD1+SSD1=>
                            Btrfs raid1 profile for metadata, raid0 or
raid1 profile for data depending on your risk tolerance
HDD2+SSD2=>

B

HDD3+HDD4=> Btrfs raid1 profile for metadata and data.


Setup btrbk to take snapshots and use btrfs send/receive to cheaply
replicate A to B. Configure a scheduled scrub of A and B once a month.

Assuming A has more storage capacity than B, you can have important vs
throwaway subvolumes. Important is backed up, and throwaway isn't.

If A has problems and can't be repaired, just blow it away and
recreate it from scratch, then restore from the latest snapshot on B.

Still another variation would be to skip the cache, and setup a
separate file system on the SSDs.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?
  2021-01-18  1:59 ` Chris Murphy
@ 2021-01-19 19:27   ` Phillip Susi
  0 siblings, 0 replies; 5+ messages in thread
From: Phillip Susi @ 2021-01-19 19:27 UTC (permalink / raw
  To: Chris Murphy; +Cc: Cedric.dewijs, Linux-RAID

Chris Murphy writes:

> A
>
> HDD1+SSD1=>
>                             Btrfs raid1 profile for metadata, raid0 or
> raid1 profile for data depending on your risk tolerance
> HDD2+SSD2=>

But then sometimes reads will go to one leg and miss the cache and have
to hit the HDD even though the data is in the ssd cache of the other leg
already.  Therefore, wouldn't it be preferable to keep the mirroring
underneath the cache?  Also having a separate B with the other two HDDs
and using it purely as a backup further halves your storage capacity (
so now only 1/4th ).  Also if one disk gets spun up to satisfy a read,
the next read might go start up the other disk rather than go to the one
that is already running.

I'd say make a raid5 of the 4 HDDs, a raid1 of the the two SSDs, and use
the combined SSDs to cache the combined HDDs.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-01-19 20:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-17 20:35 What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?  
2021-01-17 22:38 ` John Stoffel
2021-01-17 23:32 ` Wols Lists
2021-01-18  1:59 ` Chris Murphy
2021-01-19 19:27   ` Phillip Susi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.