From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A701EC77B7A for ; Fri, 26 May 2023 23:46:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243840AbjEZXqc (ORCPT ); Fri, 26 May 2023 19:46:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242194AbjEZXqJ (ORCPT ); Fri, 26 May 2023 19:46:09 -0400 Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C902BE5A for ; Fri, 26 May 2023 16:45:37 -0700 (PDT) Received: by mail-pf1-x434.google.com with SMTP id d2e1a72fcca58-64d24136663so1100084b3a.0 for ; Fri, 26 May 2023 16:45:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1685144706; x=1687736706; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=CWzxmQ8QxCg8O8dff9yLq2U6+xkvzRDR890XcXfK3mg=; b=qNuMbPXmTsHLzrS9s68EhswAAzoxtilaQGWKnCRgcVZwgN8OI10caCromsS5yYFv2R H+874/2Gj6go+XQifmTf5kewkn7i7zvKifQzI3km3f265IqOyDa5gd9FHfoFuAjSyJKf yp3Cu3MsWTcnXTi9viqZ4k5krjQN5vXKA82k2fOXEqn8d2vffoDkxV23GTnBw+SHldDg BKOE/PD8nm9+R387W92lR63+tYy0KUg7uB0gL32cR+2Vdg9xe1KAHkV5Hxgrp36Ol5Dq mPB6awaqPs/XRgHujtk3dzFuZEXXJDEl3MYeGn/kMUMu3PuoffhICO1ogYG8CxUAsQyb 9Idg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685144706; x=1687736706; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=CWzxmQ8QxCg8O8dff9yLq2U6+xkvzRDR890XcXfK3mg=; b=kgcdb0URLpj1p9zJUdjOuNyv3SBoJ7dGpkKJ8OpkfSU08rqTfIVDlv1E9eHmCLosJ6 BGwV8Q+yJF+gpNUtM0Mrnu2CoMiDhKJBWPk6YTFzQWjP/PowWIL/oYkMeVaMSXn7O6u6 BhLRIqQGVBcarOGMOFVbD2jF6Ck6p5Q1PBEJmv4RFGDVQpEJXK1RPvPr7lKQ3nyjZg9t LnSr6DfJOYwCx7MZRN//AlGWTtxJPW2J7TFDnMu28QNrspPRZ/+twv0eEFa0BvViC/Av beqwec8gSnRcvS2c/BJi9AcHLcjVugclBgMYURkbus9y8MBBH2P7L8Wct7ObnGvPTAaa ik1g== X-Gm-Message-State: AC+VfDzKEkiNoEBIrhIJJkj0RJZUl+0R3s5erv2+dJ4c29XMXWeWKV3/ c2IrdzljWYFSQQ/PZKzpUMYrow== X-Google-Smtp-Source: ACHHUZ4K3iebgteK6vCXrRpRxz8VUX6r/9jfq7BcWqxhVfO6CgyxLnisMFr2+io1W37C8mIwj9ZXTw== X-Received: by 2002:a05:6a00:2d88:b0:64c:ecf7:f49a with SMTP id fb8-20020a056a002d8800b0064cecf7f49amr5438139pfb.21.1685144706285; Fri, 26 May 2023 16:45:06 -0700 (PDT) Received: from dread.disaster.area (pa49-179-0-188.pa.nsw.optusnet.com.au. [49.179.0.188]) by smtp.gmail.com with ESMTPSA id l11-20020a62be0b000000b0064f46570bb7sm3100448pff.167.2023.05.26.16.45.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 May 2023 16:45:05 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1q2h7G-004Jsa-2r; Sat, 27 May 2023 09:45:02 +1000 Date: Sat, 27 May 2023 09:45:02 +1000 From: Dave Chinner To: Joe Thornber Cc: Brian Foster , Mike Snitzer , Jens Axboe , Christoph Hellwig , Theodore Ts'o , Sarthak Kukreti , dm-devel@redhat.com, "Michael S. Tsirkin" , "Darrick J. Wong" , Jason Wang , Bart Van Assche , linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, Joe Thornber , Andreas Dilger , Stefan Hajnoczi , linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, Alasdair Kergon Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote: > Here's my take: > > I don't see why the filesystem cares if thinp is doing a reservation or > provisioning under the hood. All that matters is that a future write > to that region will be honoured (barring device failure etc.). > > I agree that the reservation/force mapped status needs to be inherited > by snapshots. > > > One of the few strengths of thinp is the performance of taking a snapshot. > Most snapshots created are never activated. Many other snapshots are > only alive for a brief period, and used read-only. eg, blk-archive > (https://github.com/jthornber/blk-archive) uses snapshots to do very > fast incremental backups. As such I'm strongly against any scheme that > requires provisioning as part of the snapshot operation. > > Hank and I are in the middle of the range tree work which requires a > metadata > change. So now is a convenient time to piggyback other metadata changes to > support reservations. > > > Given the above this is what I suggest: > > 1) We have an api (ioctl, bio flag, whatever) that lets you > reserve/guarantee a region: > > int reserve_region(dev, sector_t begin, sector_t end); A C-based interface is not sufficient because the layer that must do provsioning is not guaranteed to be directly under the filesystem. We must be able to propagate the request down to the layers that need to provision storage, and that includes hardware devices. e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it allocates in it's backing device to guarantee that the provisioned LBA range it allocates is also fully provisioned by the storage below it.... > This api should be used minimally, eg, critical FS metadata only. Keep in mind that "critical FS metadata" in this context is any metadata which could cause the filesystem to hang or enter a global error state if an unexpected ENOSPC error occurs during a metadata write IO. Which, in pretty much every journalling filesystem, equates to all metadata in the filesystem. For a typical root filesystem, that might be a in the range of a 1-200MB (depending on journal size). For larger filesytems with lots of files in them, it will be in the range of GBs of space. Plan for having to support tens of GBs of provisioned space in filesystems, not tens of MBs.... [snip] > Now this is a lot of work. As well as the kernel changes we'll need to > update the userland tools: thin_check, thin_ls, thin_metadata_unpack, > thin_rmap, thin_delta, thin_metadata_pack, thin_repair, thin_trim, > thin_dump, thin_metadata_size, thin_restore. Are we confident that we > have buy in from the FS teams that this will be widely adopted? Are users > asking for this? I really don't want to do 6 months of work for nothing. I think there's a 2-3 solid days of coding to fully implement REQ_PROVISION support in XFS, including userspace tool support. Maybe a couple of weeks more to flush the bugs out before it's largely ready to go. So if there's buy in from the block layer and DM people for REQ_PROVISION as described, then I'll definitely have XFS support ready for you to test whenever dm-thinp is ready to go. I can't speak for other filesystems, I suspect the only one we care about is ext4. btrfs and f2fs don't need dm-thinp and there aren't any other filesystems that are used in production on top of dm-thinp, so I think only XFS and ext4 matter at this point in time. I suspect that ext4 would be fairly easy to add support for as well. ext4 has a lot more fixed-place metadata than XFS has so much more of it's metadata is covered by mkfs-time provisioning. Limiting dynamic metadata to specific fully provisioned block groups and provisioning new block groups for metadata when they are near full would be equivalent to how I plan to provision metadata space in XFS. Hence the implementation for ext4 looks to be broadly similar in scope and complexity as XFS.... -Dave. -- Dave Chinner david@fromorbit.com