LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] rust: add untrusted data abstraction
       [not found] <20240913112643.542914-1-benno.lossin@proton.me>
@ 2024-09-13 11:26 ` Benno Lossin
  2024-09-13 13:41   ` Finn Behrens
  2024-09-13 15:33   ` Simona Vetter
  2024-09-13 11:27 ` [RFC PATCH 2/3] WIP: rust: fs: mark data returned by inodes untrusted Benno Lossin
  2024-09-13 11:27 ` [RFC PATCH 3/3] WIP: rust: tarfs: use untrusted data API Benno Lossin
  2 siblings, 2 replies; 18+ messages in thread
From: Benno Lossin @ 2024-09-13 11:26 UTC (permalink / raw)
  To: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross
  Cc: rust-for-linux, linux-kernel

When reading data from userspace, hardware or other external untrusted
sources, the data must be validated before it is used for logic within
the kernel. This abstraction provides a generic newtype wrapper that
prevents direct access to the inner type. It does allow access through
the `untrusted()` method, but that should be a telltale sign to
reviewers that untrusted data is being used at that point.

Any API that reads data from an untrusted source should return
`Untrusted<T>` where `T` is the type of the underlying untrusted data.
This generic allows other abstractions to return their custom type
wrapped by `Untrusted`, signaling to the caller that the data must be
validated before use. This allows those abstractions to be used both in
a trusted and untrusted manner, increasing their generality.
Additionally, using the arbitrary self types feature, APIs can be
designed to explicitly read untrusted data:

    impl MyCustomDataSource {
        pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
    }

To validate data, the `Validator` trait is introduced, readers of
untrusted data should implement it for their type and move all of their
validation and parsing logic into its `validate` function. That way
reviewers and later contributors have a central location to consult for
data validation.

Signed-off-by: Benno Lossin <benno.lossin@proton.me>
---
 rust/kernel/types.rs | 248 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 247 insertions(+), 1 deletion(-)

diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 9e7ca066355c..20ef04b1b417 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -8,9 +8,10 @@
     cell::UnsafeCell,
     marker::{PhantomData, PhantomPinned},
     mem::{ManuallyDrop, MaybeUninit},
-    ops::{Deref, DerefMut},
+    ops::{Deref, DerefMut, Index},
     pin::Pin,
     ptr::NonNull,
+    slice::SliceIndex,
 };
 
 /// Used to transfer ownership to and from foreign (non-Rust) languages.
@@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
 // does not have any uninitialized portions either.
 unsafe impl<T: AsBytes> AsBytes for [T] {}
 unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
+
+/// Untrusted data of type `T`.
+///
+/// When reading data from userspace, hardware or other external untrusted sources, the data must
+/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
+/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
+/// [`validate_bytes()`] function.
+///
+///
+/// [`validate()`]: Self::validate
+/// [`validate_bytes()`]: Self::validate_bytes
+#[repr(transparent)]
+pub struct Untrusted<T: ?Sized>(T);
+
+impl<T: ?Sized> Untrusted<T> {
+    /// Marks the given value as untrusted.
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// # mod bindings { unsafe fn read_foo_info() -> [u8; 4] { todo!() } };
+    /// fn read_foo_info() -> Untrusted<[u8; 4]> {
+    ///     // SAFETY: just an FFI call without preconditions.
+    ///     Untrusted::new_untrusted(unsafe { bindings::read_foo_info() })
+    /// }
+    /// ```
+    pub fn new_untrusted(value: T) -> Self
+    where
+        T: Sized,
+    {
+        Self(value)
+    }
+
+    /// Marks the value behind the reference as untrusted.
+    ///
+    /// # Examples
+    ///
+    /// In this imaginary example there exists the `foo_hardware` struct on the C side, as well as
+    /// a `foo_hardware_read` function that reads some data directly from the hardware.
+    /// ```
+    /// # mod bindings { struct foo_hardware; unsafe fn foo_hardware_read(_foo: *mut Foo, _len: &mut usize) -> *mut u8 { todo!() } };
+    /// struct Foo(Opaque<bindings::foo_hardware>);
+    ///
+    /// impl Foo {
+    ///     pub fn read(&mut self, mut len: usize) -> Result<&Untrusted<[u8]>> {
+    ///         // SAFETY: just an FFI call without preconditions.
+    ///         let data: *mut u8 = unsafe { bindings::foo_hardware_read(self.0.get(), &mut len) };
+    ///         let data = error::from_err_ptr(data)?;
+    ///         let data = ptr::slice_from_raw_parts(data, len);
+    ///         // SAFETY: `data` returned by `foo_hardware_read` is valid for reads as long as the
+    ///         // `foo_hardware` object exists. That function updated the
+    ///         let data = unsafe { &*data };
+    ///         Ok(Untrusted::new_untrusted_ref(data))
+    ///     }
+    /// }
+    /// ```
+    pub fn new_untrusted_ref<'a>(value: &'a T) -> &'a Self {
+        let ptr: *const T = value;
+        // CAST: `Self` is `repr(transparent)` and contains a `T`.
+        let ptr = ptr as *const Self;
+        // SAFETY: `ptr` came from a shared reference valid for `'a`.
+        unsafe { &*ptr }
+    }
+
+    /// Gives direct access to the underlying untrusted data.
+    ///
+    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
+    /// do so use [`validate()`].
+    ///
+    /// [`validate()`]: Self::validate
+    pub fn untrusted(&self) -> &T {
+        &self.0
+    }
+
+    /// Gives direct access to the underlying untrusted data.
+    ///
+    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
+    /// do so use [`validate()`].
+    ///
+    /// [`validate()`]: Self::validate
+    pub fn untrusted_mut(&mut self) -> &mut T {
+        &mut self.0
+    }
+
+    /// Unwraps the data and removes the untrusted marking.
+    ///
+    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
+    /// do so use [`validate()`].
+    ///
+    /// [`validate()`]: Self::validate
+    pub fn into_inner_untrusted(self) -> T
+    where
+        T: Sized,
+    {
+        self.0
+    }
+
+    /// Dereferences the underlying untrusted data.
+    ///
+    /// Often, untrusted data is not directly exposed as bytes, but rather as a reader that can
+    /// give you access to raw bytes. When such a type implements [`Deref`], then this function
+    /// allows you to get access to the underlying data.
+    pub fn deref(&self) -> &Untrusted<<T as Deref>::Target>
+    where
+        T: Deref,
+    {
+        Untrusted::new_untrusted_ref(self.0.deref())
+    }
+
+    /// Validates and parses the untrusted data.
+    ///
+    /// See the [`Validator`] trait on how to implement it.
+    pub fn validate<V: Validator<Input = T>>(&self) -> Result<V::Output, V::Err> {
+        V::validate(self)
+    }
+}
+
+impl Untrusted<[u8]> {
+    /// Validate the given bytes directly.
+    ///
+    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
+    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
+    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
+    /// the [`Validator`] trait and the [`validate()`] function instead.
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
+    ///
+    /// let data: &Untrusted<[u8]> = get_untrusted_data();
+    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
+    ///     if untrusted.len() != 2 {
+    ///         return Err(());
+    ///     }
+    ///     if untrusted[0] & 0xf0 != 0 {
+    ///         return Err(());
+    ///     }
+    ///     if untrusted[1] >= 100 {
+    ///         return Err(());
+    ///     }
+    ///     Ok(())
+    /// });
+    /// match data {
+    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
+    ///     Err(()) => pr_info!("read faulty data from hardware!"),
+    /// }
+    /// ```
+    ///
+    /// [`validate()`]: Self::validate
+    pub fn validate_bytes<E>(
+        &self,
+        validator: impl FnOnce(&[u8]) -> Result<(), E>,
+    ) -> Result<&[u8], E> {
+        let raw_data = self.untrusted();
+        validator(raw_data).map(|()| raw_data)
+    }
+}
+
+impl<T, I> Index<I> for Untrusted<[T]>
+where
+    I: SliceIndex<[T]>,
+{
+    type Output = Untrusted<I::Output>;
+
+    fn index(&self, index: I) -> &Self::Output {
+        Untrusted::new_untrusted_ref(self.0.index(index))
+    }
+}
+
+impl<T> Untrusted<[T]> {
+    /// Gets the length of the underlying untrusted slice.
+    pub fn len(&self) -> usize {
+        self.0.len()
+    }
+}
+
+/// Validates untrusted data.
+///
+/// # Examples
+///
+/// ## Using an API returning untrusted data
+///
+/// Create the type of the data that you want to parse:
+///
+/// ```
+/// pub struct FooData {
+///     data: [u8; 4],
+/// }
+/// ```
+///
+/// Then implement this trait:
+///
+/// ```
+/// use kernel::types::{Untrusted, Validator};
+/// # pub struct FooData {
+/// #     data: [u8; 4],
+/// # }
+/// impl Validator for FooData {
+///     type Input = [u8];
+///     type Output = FooData;
+///     type Err = Error;
+///
+///     fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err> {
+///         let untrusted = untrusted.untrusted();
+///         let untrusted = <[u8; 4]>::try_from(untrusted);
+///         for byte in &untrusted {
+///             if byte & 0xf0 != 0 {
+///                 return Err(());
+///             }
+///         }
+///         Ok(FooData { data: untrusted })
+///     }
+/// }
+/// ```
+///
+/// And then use the API that returns untrusted data:
+///
+/// ```ignore
+/// let result = get_untrusted_data().validate::<FooData>();
+/// ```
+///
+/// ## Creating an API returning untrusted data
+///
+/// In your API instead of just returning the untrusted data, wrap it in [`Untrusted<T>`]:
+///
+/// ```
+/// pub fn get_untrusted_data(&self) -> &Untrusted<[u8]> {
+///     todo!()
+/// }
+/// ```
+pub trait Validator {
+    /// Type of the input data that is untrusted.
+    type Input: ?Sized;
+    /// Type of the validated data.
+    type Output;
+    /// Validation error.
+    type Err;
+
+    /// Validate the given untrusted data and parse it into the output type.
+    ///
+    /// When implementing this function, you can use [`Untrusted::untrusted()`] to get access to
+    /// the raw untrusted data.
+    fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err>;
+}
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/3] WIP: rust: fs: mark data returned by inodes untrusted
       [not found] <20240913112643.542914-1-benno.lossin@proton.me>
  2024-09-13 11:26 ` [PATCH 1/3] rust: add untrusted data abstraction Benno Lossin
@ 2024-09-13 11:27 ` Benno Lossin
  2024-09-13 11:27 ` [RFC PATCH 3/3] WIP: rust: tarfs: use untrusted data API Benno Lossin
  2 siblings, 0 replies; 18+ messages in thread
From: Benno Lossin @ 2024-09-13 11:27 UTC (permalink / raw)
  To: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross
  Cc: rust-for-linux, linux-kernel

This is only meant to show how one would use the untrusted data API, not
to be actually used in production, as I just added `Untrusted` where I
thought it might fit. I have no idea if this is actually the correct
place where the untrusted data path starts.
---
 rust/kernel/fs/inode.rs | 33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/rust/kernel/fs/inode.rs b/rust/kernel/fs/inode.rs
index b2b7d000080e..349f7f1ab420 100644
--- a/rust/kernel/fs/inode.rs
+++ b/rust/kernel/fs/inode.rs
@@ -11,7 +11,9 @@
     PageOffset, UnspecifiedFS,
 };
 use crate::error::{code::*, from_err_ptr, Result};
-use crate::types::{ARef, AlwaysRefCounted, Either, ForeignOwnable, Lockable, Locked, Opaque};
+use crate::types::{
+    ARef, AlwaysRefCounted, Either, ForeignOwnable, Lockable, Locked, Opaque, Untrusted,
+};
 use crate::{
     bindings, block, build_error, container_of, folio, folio::Folio, mem_cache::MemCache,
     str::CStr, str::CString, time::Timespec,
@@ -133,21 +135,24 @@ pub unsafe fn mapper(&self) -> Mapper<T> {
     pub unsafe fn mapped_folio(
         &self,
         offset: Offset,
-    ) -> Result<folio::Mapped<'_, folio::PageCache<T>>> {
+    ) -> Result<Untrusted<folio::Mapped<'_, folio::PageCache<T>>>> {
         let page_index = offset >> bindings::PAGE_SHIFT;
         let page_offset = offset & ((bindings::PAGE_SIZE - 1) as Offset);
         let folio = self.read_mapping_folio(page_index.try_into()?)?;
+        let folio = folio.into_inner_untrusted();
 
         // SAFETY: The safety requirements guarantee that there are no concurrent mutable mappings
         // of the folio.
-        unsafe { Folio::map_owned(folio, page_offset.try_into()?) }
+        Ok(Untrusted::new_untrusted(unsafe {
+            Folio::map_owned(folio, page_offset.try_into()?)?
+        }))
     }
 
     /// Returns the folio at the given page index.
     pub fn read_mapping_folio(
         &self,
         index: PageOffset,
-    ) -> Result<ARef<Folio<folio::PageCache<T>>>> {
+    ) -> Result<Untrusted<ARef<Folio<folio::PageCache<T>>>>> {
         let folio = from_err_ptr(unsafe {
             bindings::read_mapping_folio(
                 (*self.0.get()).i_mapping,
@@ -159,7 +164,7 @@ pub fn read_mapping_folio(
             .ok_or(EIO)?
             .cast::<Folio<folio::PageCache<T>>>();
         // SAFETY: The folio returned by read_mapping_folio has had its refcount incremented.
-        Ok(unsafe { ARef::from_raw(ptr) })
+        Ok(Untrusted::new_untrusted(unsafe { ARef::from_raw(ptr) }))
     }
 
     /// Iterate over the given range, one folio at a time.
@@ -171,7 +176,7 @@ pub unsafe fn for_each_page<U>(
         &self,
         first: Offset,
         len: Offset,
-        mut cb: impl FnMut(&[u8]) -> Result<Option<U>>,
+        mut cb: impl FnMut(&Untrusted<[u8]>) -> Result<Option<U>>,
     ) -> Result<Option<U>> {
         if first >= self.size() {
             return Ok(None);
@@ -183,8 +188,9 @@ pub unsafe fn for_each_page<U>(
         while remain > 0 {
             // SAFETY: The safety requirements of this function satisfy those of `mapped_folio`.
             let data = unsafe { self.mapped_folio(next)? };
+            let data = data.into_inner_untrusted();
             let avail = cmp::min(data.len(), remain.try_into().unwrap_or(usize::MAX));
-            let ret = cb(&data[..avail])?;
+            let ret = cb(Untrusted::new_untrusted_ref(&data[..avail]))?;
             if ret.is_some() {
                 return Ok(ret);
             }
@@ -297,7 +303,7 @@ impl<T: FileSystem + ?Sized, U: Deref<Target = INode<T>>> Locked<U, ReadSem> {
     pub fn mapped_folio<'a>(
         &'a self,
         offset: Offset,
-    ) -> Result<folio::Mapped<'a, folio::PageCache<T>>>
+    ) -> Result<Untrusted<folio::Mapped<'a, folio::PageCache<T>>>>
     where
         T: 'a,
     {
@@ -315,7 +321,7 @@ pub fn for_each_page<V>(
         &self,
         first: Offset,
         len: Offset,
-        cb: impl FnMut(&[u8]) -> Result<Option<V>>,
+        cb: impl FnMut(&Untrusted<[u8]>) -> Result<Option<V>>,
     ) -> Result<Option<V>> {
         if T::IS_UNSPECIFIED {
             build_error!("unspecified file systems cannot safely map folios");
@@ -750,14 +756,17 @@ pub fn split_at(mut self, offset: Offset) -> (Self, Self) {
     }
 
     /// Returns a mapped folio at the given offset.
-    pub fn mapped_folio(&self, offset: Offset) -> Result<folio::Mapped<'_, folio::PageCache<T>>> {
+    pub fn mapped_folio(
+        &self,
+        offset: Offset,
+    ) -> Result<Untrusted<folio::Mapped<'_, folio::PageCache<T>>>> {
         if offset < self.begin || offset >= self.end {
             return Err(ERANGE);
         }
 
         // SAFETY: By the type invariant, there are no other mutable mappings of the folio.
         let mut map = unsafe { self.inode.mapped_folio(offset) }?;
-        map.cap_len((self.end - offset).try_into()?);
+        map.untrusted_mut().cap_len((self.end - offset).try_into()?);
         Ok(map)
     }
 
@@ -766,7 +775,7 @@ pub fn for_each_page<U>(
         &self,
         first: Offset,
         len: Offset,
-        cb: impl FnMut(&[u8]) -> Result<Option<U>>,
+        cb: impl FnMut(&Untrusted<[u8]>) -> Result<Option<U>>,
     ) -> Result<Option<U>> {
         if first < self.begin || first >= self.end {
             return Err(ERANGE);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/3] WIP: rust: tarfs: use untrusted data API
       [not found] <20240913112643.542914-1-benno.lossin@proton.me>
  2024-09-13 11:26 ` [PATCH 1/3] rust: add untrusted data abstraction Benno Lossin
  2024-09-13 11:27 ` [RFC PATCH 2/3] WIP: rust: fs: mark data returned by inodes untrusted Benno Lossin
@ 2024-09-13 11:27 ` Benno Lossin
  2 siblings, 0 replies; 18+ messages in thread
From: Benno Lossin @ 2024-09-13 11:27 UTC (permalink / raw)
  To: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross
  Cc: linux-kernel, rust-for-linux

As an example here is how tarfs could use the untrusted data API. There
are two calls to `untrusted()` outside of `Validator` implementations. I
don't know if the names are used for some logic at a later point, and I
didn't want to mark the name as untrusted, because then I would have to
change more of the vfs API. I think if the names are not used for any
logic, then they can just stay as `Untrusted<CString>`.
---
 fs/tarfs/defs.rs |   1 +
 fs/tarfs/tar.rs  | 143 ++++++++++++++++++++++++++++++++---------------
 2 files changed, 99 insertions(+), 45 deletions(-)

diff --git a/fs/tarfs/defs.rs b/fs/tarfs/defs.rs
index 7481b75aaab2..db22e3299fef 100644
--- a/fs/tarfs/defs.rs
+++ b/fs/tarfs/defs.rs
@@ -70,6 +70,7 @@ pub struct DirEntry {
 
     /// The super-block of a tarfs instance.
     #[repr(C)]
+    #[derive(Clone)]
     pub struct Header {
         /// The offset to the beginning of the inode-table.
         pub inode_table_offset: LE<u64>,
diff --git a/fs/tarfs/tar.rs b/fs/tarfs/tar.rs
index a3f6e468e566..8ca255306c8b 100644
--- a/fs/tarfs/tar.rs
+++ b/fs/tarfs/tar.rs
@@ -9,7 +9,14 @@
     inode::Type, iomap, sb, sb::SuperBlock, Offset, Stat,
 };
 use kernel::types::{ARef, Either, FromBytes, Locked};
-use kernel::{c_str, prelude::*, str::CString, user};
+use kernel::{
+    c_str,
+    prelude::*,
+    str::CString,
+    types::{Untrusted, Validator},
+    user,
+};
+use std::fs::DirEntry;
 
 pub mod defs;
 
@@ -29,7 +36,8 @@
 
 static_assert!(SECTORS_PER_BLOCK > 0);
 
-struct INodeData {
+#[allow(missing_docs)]
+pub struct INodeData {
     offset: u64,
     flags: u8,
 }
@@ -41,26 +49,13 @@ struct TarFs {
     mapper: inode::Mapper,
 }
 
-impl TarFs {
-    fn iget(sb: &SuperBlock<Self>, ino: u64) -> Result<ARef<INode<Self>>> {
-        // Check that the inode number is valid.
-        let h = sb.data();
-        if ino == 0 || ino > h.inode_count {
-            return Err(ENOENT);
-        }
+impl Validator for Inode {
+    type Input = [u8];
+    type Output = inode::Params<INodeData>;
+    type Err = Error;
 
-        // Create an inode or find an existing (cached) one.
-        let mut inode = match sb.get_or_create_inode(ino)? {
-            Either::Left(existing) => return Ok(existing),
-            Either::Right(new) => new,
-        };
-
-        static_assert!((TARFS_BSIZE as usize) % size_of::<Inode>() == 0);
-
-        // Load inode details from storage.
-        let offset = h.inode_table_offset + (ino - 1) * u64::try_from(size_of::<Inode>())?;
-        let b = h.mapper.mapped_folio(offset.try_into()?)?;
-        let idata = Inode::from_bytes(&b, 0).ok_or(EIO)?;
+    fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err> {
+        let idata = Inode::from_bytes(untrusted.untrusted(), 0).ok_or(EIO)?;
 
         let mode = idata.mode.value();
 
@@ -69,36 +64,21 @@ fn iget(sb: &SuperBlock<Self>, ino: u64) -> Result<ARef<INode<Self>>> {
             return Err(ENOENT);
         }
 
-        const DIR_FOPS: file::Ops<TarFs> = file::Ops::new::<TarFs>();
-        const DIR_IOPS: inode::Ops<TarFs> = inode::Ops::new::<TarFs>();
-        const FILE_AOPS: address_space::Ops<TarFs> = iomap::ro_aops::<TarFs>();
-
         let size = idata.size.value();
         let doffset = idata.offset.value();
         let secs = u64::from(idata.lmtime.value()) | (u64::from(idata.hmtime & 0xf) << 32);
         let ts = kernel::time::Timespec::new(secs, 0)?;
         let typ = match mode & fs::mode::S_IFMT {
-            fs::mode::S_IFREG => {
-                inode
-                    .set_fops(file::Ops::generic_ro_file())
-                    .set_aops(FILE_AOPS);
-                Type::Reg
-            }
-            fs::mode::S_IFDIR => {
-                inode.set_iops(DIR_IOPS).set_fops(DIR_FOPS);
-                Type::Dir
-            }
-            fs::mode::S_IFLNK => {
-                inode.set_iops(inode::Ops::simple_symlink_inode());
-                Type::Lnk(Some(Self::get_link(sb, doffset, size)?))
-            }
+            fs::mode::S_IFREG => Type::Reg,
+            fs::mode::S_IFDIR => Type::Dir,
+            fs::mode::S_IFLNK => Type::Lnk(None),
             fs::mode::S_IFSOCK => Type::Sock,
             fs::mode::S_IFIFO => Type::Fifo,
             fs::mode::S_IFCHR => Type::Chr((doffset >> 32) as u32, doffset as u32),
             fs::mode::S_IFBLK => Type::Blk((doffset >> 32) as u32, doffset as u32),
             _ => return Err(ENOENT),
         };
-        inode.init(inode::Params {
+        Ok(inode::Params {
             typ,
             mode: mode & 0o777,
             size: size.try_into()?,
@@ -115,13 +95,86 @@ fn iget(sb: &SuperBlock<Self>, ino: u64) -> Result<ARef<INode<Self>>> {
             },
         })
     }
+}
+
+impl Validator for Header {
+    type Input = [u8];
+    type Output = Self;
+    type Err = Error;
+
+    fn validate(
+        untrusted: &Untrusted<Self::Input>,
+    ) -> core::result::Result<Self::Output, Self::Err> {
+        Header::from_bytes(untrusted.untrusted(), 0)
+            .ok_or(EIO)
+            .cloned()
+    }
+}
+
+impl Validator for DirEntry {
+    type Input = [u8];
+    type Output = Self;
+    type Err = Error;
+
+    fn validate(
+        untrusted: &Untrusted<Self::Input>,
+    ) -> core::result::Result<Self::Output, Self::Err> {
+        let raw = DirEntry::from_bytes_to_slice(untrusted.untrusted()).ok_or(EIO)?;
+        // TODO: actually verify that the `raw` entry is correct.
+        Ok(raw)
+    }
+}
+
+impl TarFs {
+    fn iget(sb: &SuperBlock<Self>, ino: u64) -> Result<ARef<INode<Self>>> {
+        // Check that the inode number is valid.
+        let h = sb.data();
+        if ino == 0 || ino > h.inode_count {
+            return Err(ENOENT);
+        }
+
+        // Create an inode or find an existing (cached) one.
+        let mut inode = match sb.get_or_create_inode(ino)? {
+            Either::Left(existing) => return Ok(existing),
+            Either::Right(new) => new,
+        };
+
+        static_assert!((TARFS_BSIZE as usize) % size_of::<Inode>() == 0);
+
+        // Load inode details from storage.
+        let offset = h.inode_table_offset + (ino - 1) * u64::try_from(size_of::<Inode>())?;
+        let b = h.mapper.mapped_folio(offset.try_into()?)?;
+
+        const DIR_FOPS: file::Ops<TarFs> = file::Ops::new::<TarFs>();
+        const DIR_IOPS: inode::Ops<TarFs> = inode::Ops::new::<TarFs>();
+        const FILE_AOPS: address_space::Ops<TarFs> = iomap::ro_aops::<TarFs>();
+
+        let mut params = b.deref().validate::<Inode>()?;
+        match &params.typ {
+            Type::Reg => {
+                inode
+                    .set_fops(file::Ops::generic_ro_file())
+                    .set_aops(FILE_AOPS);
+            }
+            Type::Dir => {
+                inode.set_iops(DIR_IOPS).set_fops(DIR_FOPS);
+            }
+            Type::Lnk(_) => {
+                inode.set_iops(inode::Ops::simple_symlink_inode());
+                let doffset = params.value.offset;
+                params.typ = Type::Lnk(Some(Self::get_link(sb, doffset, params.size as u64)?));
+            }
+            _ => {}
+        }
+        inode.init(params)
+    }
 
     fn name_eq(sb: &SuperBlock<Self>, mut name: &[u8], offset: u64) -> Result<bool> {
         let ret =
             sb.data()
                 .mapper
                 .for_each_page(offset as Offset, name.len().try_into()?, |data| {
-                    if data != &name[..data.len()] {
+                    if data.untrusted() != &name[..data.len()] {
                         return Ok(Some(()));
                     }
                     name = &name[data.len()..];
@@ -135,7 +188,7 @@ fn read_name(sb: &SuperBlock<Self>, name: &mut [u8], offset: u64) -> Result {
         sb.data()
             .mapper
             .for_each_page(offset as Offset, name.len().try_into()?, |data| {
-                name[copy_to..][..data.len()].copy_from_slice(data);
+                name[copy_to..][..data.len()].copy_from_slice(data.untrusted());
                 copy_to += data.len();
                 Ok(None::<()>)
             })?;
@@ -179,7 +232,7 @@ fn fill_super(
         let tarfs = {
             let offset = (scount - 1) * SECTOR_SIZE;
             let mapped = mapper.mapped_folio(offset.try_into()?)?;
-            let hdr = Header::from_bytes(&mapped, 0).ok_or(EIO)?;
+            let hdr = mapped.deref().validate::<Header>()?;
             let inode_table_offset = hdr.inode_table_offset.value();
             let inode_count = hdr.inode_count.value();
             drop(mapped);
@@ -316,7 +369,7 @@ fn lookup(
             parent.data().offset.try_into()?,
             parent.size(),
             |data| {
-                for e in DirEntry::from_bytes_to_slice(data).ok_or(EIO)? {
+                for e in data.validate::<DirEntry>()? {
                     if Self::name_eq(sb, name, e.name_offset.value())? {
                         return Ok(Some(Self::iget(sb, e.ino.value())?));
                     }
@@ -368,7 +421,7 @@ fn read_dir(
             inode.data().offset as i64 + pos,
             inode.size() - pos,
             |data| {
-                for e in DirEntry::from_bytes_to_slice(data).ok_or(EIO)? {
+                for e in data.validate::<DirEntry>()? {
                     let name_len = usize::try_from(e.name_len.value())?;
                     if name_len > name.len() {
                         name.resize(name_len, 0, GFP_NOFS)?;
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-13 11:26 ` [PATCH 1/3] rust: add untrusted data abstraction Benno Lossin
@ 2024-09-13 13:41   ` Finn Behrens
  2024-09-13 13:47     ` Benno Lossin
  2024-09-13 15:33   ` Simona Vetter
  1 sibling, 1 reply; 18+ messages in thread
From: Finn Behrens @ 2024-09-13 13:41 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel



On 13 Sep 2024, at 13:26, Benno Lossin wrote:

> When reading data from userspace, hardware or other external untrusted
> sources, the data must be validated before it is used for logic within
> the kernel. This abstraction provides a generic newtype wrapper that
> prevents direct access to the inner type. It does allow access through
> the `untrusted()` method, but that should be a telltale sign to
> reviewers that untrusted data is being used at that point.
>
> Any API that reads data from an untrusted source should return
> `Untrusted<T>` where `T` is the type of the underlying untrusted data.
> This generic allows other abstractions to return their custom type
> wrapped by `Untrusted`, signaling to the caller that the data must be
> validated before use. This allows those abstractions to be used both in
> a trusted and untrusted manner, increasing their generality.
> Additionally, using the arbitrary self types feature, APIs can be
> designed to explicitly read untrusted data:
>
>     impl MyCustomDataSource {
>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
>     }
>
> To validate data, the `Validator` trait is introduced, readers of
> untrusted data should implement it for their type and move all of their
> validation and parsing logic into its `validate` function. That way
> reviewers and later contributors have a central location to consult for
> data validation.
>
> Signed-off-by: Benno Lossin <benno.lossin@proton.me>
> ---
>  rust/kernel/types.rs | 248 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 247 insertions(+), 1 deletion(-)
>
> diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
> index 9e7ca066355c..20ef04b1b417 100644
> --- a/rust/kernel/types.rs
> +++ b/rust/kernel/types.rs
> +
> +/// Validates untrusted data.
> +///
> +/// # Examples
> +///
> +/// ## Using an API returning untrusted data
> +///
> +/// Create the type of the data that you want to parse:
> +///
> +/// ```
> +/// pub struct FooData {
> +///     data: [u8; 4],
> +/// }
> +/// ```
> +///
> +/// Then implement this trait:
> +///
> +/// ```
> +/// use kernel::types::{Untrusted, Validator};
> +/// # pub struct FooData {
> +/// #     data: [u8; 4],
> +/// # }
> +/// impl Validator for FooData {
> +///     type Input = [u8];
> +///     type Output = FooData;
> +///     type Err = Error;
> +///
> +///     fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err> {
> +///         let untrusted = untrusted.untrusted();
> +///         let untrusted = <[u8; 4]>::try_from(untrusted);
> +///         for byte in &untrusted {
> +///             if byte & 0xf0 != 0 {
> +///                 return Err(());
> +///             }
> +///         }
> +///         Ok(FooData { data: untrusted })
> +///     }
> +/// }
> +/// ```
> +///
> +/// And then use the API that returns untrusted data:
> +///
> +/// ```ignore
> +/// let result = get_untrusted_data().validate::<FooData>();
> +/// ```
> +///
> +/// ## Creating an API returning untrusted data
> +///
> +/// In your API instead of just returning the untrusted data, wrap it in [`Untrusted<T>`]:
> +///
> +/// ```
> +/// pub fn get_untrusted_data(&self) -> &Untrusted<[u8]> {
> +///     todo!()
> +/// }
> +/// ```
> +pub trait Validator {
> +    /// Type of the input data that is untrusted.
> +    type Input: ?Sized;

I would like to explore this trait with being generic over Input, instead of having Input as an associated type. Might be nicer to have Validators for different input types if the valid data is always the same?

> +    /// Type of the validated data.
> +    type Output;
> +    /// Validation error.
> +    type Err;
> +
> +    /// Validate the given untrusted data and parse it into the output type.
> +    ///
> +    /// When implementing this function, you can use [`Untrusted::untrusted()`] to get access to
> +    /// the raw untrusted data.
> +    fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err>;
> +}
> -- 
> 2.46.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-13 13:41   ` Finn Behrens
@ 2024-09-13 13:47     ` Benno Lossin
  0 siblings, 0 replies; 18+ messages in thread
From: Benno Lossin @ 2024-09-13 13:47 UTC (permalink / raw)
  To: Finn Behrens
  Cc: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On 13.09.24 15:41, Finn Behrens wrote:
> On 13 Sep 2024, at 13:26, Benno Lossin wrote:
>> +pub trait Validator {
>> +    /// Type of the input data that is untrusted.
>> +    type Input: ?Sized;
> 
> I would like to explore this trait with being generic over Input, instead of having Input as an associated type. Might be nicer to have Validators for different input types if the valid data is always the same?

I think I have changed my opinion on this from Kangrejos, I like the
idea. Then we would also be able to do something like this:

    impl<I> Validator<I> for Foo
    where
        I: Deref<Target = [u8]>,
    {
        /* ... */
    }

---
Cheers,
Benno

>> +    /// Type of the validated data.
>> +    type Output;
>> +    /// Validation error.
>> +    type Err;
>> +
>> +    /// Validate the given untrusted data and parse it into the output type.
>> +    ///
>> +    /// When implementing this function, you can use [`Untrusted::untrusted()`] to get access to
>> +    /// the raw untrusted data.
>> +    fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err>;
>> +}
>> --
>> 2.46.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-13 11:26 ` [PATCH 1/3] rust: add untrusted data abstraction Benno Lossin
  2024-09-13 13:41   ` Finn Behrens
@ 2024-09-13 15:33   ` Simona Vetter
  2024-09-13 16:49     ` Benno Lossin
  1 sibling, 1 reply; 18+ messages in thread
From: Simona Vetter @ 2024-09-13 15:33 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
> When reading data from userspace, hardware or other external untrusted
> sources, the data must be validated before it is used for logic within
> the kernel. This abstraction provides a generic newtype wrapper that
> prevents direct access to the inner type. It does allow access through
> the `untrusted()` method, but that should be a telltale sign to
> reviewers that untrusted data is being used at that point.
> 
> Any API that reads data from an untrusted source should return
> `Untrusted<T>` where `T` is the type of the underlying untrusted data.
> This generic allows other abstractions to return their custom type
> wrapped by `Untrusted`, signaling to the caller that the data must be
> validated before use. This allows those abstractions to be used both in
> a trusted and untrusted manner, increasing their generality.
> Additionally, using the arbitrary self types feature, APIs can be
> designed to explicitly read untrusted data:
> 
>     impl MyCustomDataSource {
>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
>     }
> 
> To validate data, the `Validator` trait is introduced, readers of
> untrusted data should implement it for their type and move all of their
> validation and parsing logic into its `validate` function. That way
> reviewers and later contributors have a central location to consult for
> data validation.
> 
> Signed-off-by: Benno Lossin <benno.lossin@proton.me>

Yes please. Some thoughts below. Fair warning, my rust is basic at best,
I'll mix things up probably.

> ---
>  rust/kernel/types.rs | 248 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 247 insertions(+), 1 deletion(-)
> 
> diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
> index 9e7ca066355c..20ef04b1b417 100644
> --- a/rust/kernel/types.rs
> +++ b/rust/kernel/types.rs
> @@ -8,9 +8,10 @@
>      cell::UnsafeCell,
>      marker::{PhantomData, PhantomPinned},
>      mem::{ManuallyDrop, MaybeUninit},
> -    ops::{Deref, DerefMut},
> +    ops::{Deref, DerefMut, Index},
>      pin::Pin,
>      ptr::NonNull,
> +    slice::SliceIndex,
>  };
>  
>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
>  // does not have any uninitialized portions either.
>  unsafe impl<T: AsBytes> AsBytes for [T] {}
>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
> +
> +/// Untrusted data of type `T`.
> +///
> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
> +/// [`validate_bytes()`] function.
> +///
> +///
> +/// [`validate()`]: Self::validate
> +/// [`validate_bytes()`]: Self::validate_bytes
> +#[repr(transparent)]
> +pub struct Untrusted<T: ?Sized>(T);

I think we could make this a bit more strict and instead of encouraging
people to put all their validation code into a Validator<T>, force
them to. Which means assuming apis wrap all untrusted stuff in
Untrusted<T> reviewers only need to look at validate implementations and
nothing else.

If I'm not too wrong with my rust I think this would work with slitting
Untrusted<T> into two types:

- Untrusted<T> is just a box you can't access, and the type returned by
  apis for untrusted data. It doesn't have any of the functions except
  validate and new_untrusted so that you can't get at the data at all.

- UntrustedUnsafe<T> does have all the accessors needed for validation,
  but you can't construct that outside of this module. Maybe simplest to
  just nest this within the Untrusted<T> box.

- Untrusted::validate does the unboxing and passes a reference to
  UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
  validation code is in there and nowhere else. Similar for
  validate_bytes.

Of course people can still bypass this easily, but it gets a bit harder to
do so, and easier for reviewers to verify everything.

> +
> +impl<T: ?Sized> Untrusted<T> {
> +    /// Marks the given value as untrusted.
> +    ///
> +    /// # Examples
> +    ///
> +    /// ```
> +    /// # mod bindings { unsafe fn read_foo_info() -> [u8; 4] { todo!() } };
> +    /// fn read_foo_info() -> Untrusted<[u8; 4]> {
> +    ///     // SAFETY: just an FFI call without preconditions.
> +    ///     Untrusted::new_untrusted(unsafe { bindings::read_foo_info() })
> +    /// }
> +    /// ```
> +    pub fn new_untrusted(value: T) -> Self
> +    where
> +        T: Sized,
> +    {
> +        Self(value)
> +    }
> +
> +    /// Marks the value behind the reference as untrusted.
> +    ///
> +    /// # Examples
> +    ///
> +    /// In this imaginary example there exists the `foo_hardware` struct on the C side, as well as
> +    /// a `foo_hardware_read` function that reads some data directly from the hardware.
> +    /// ```
> +    /// # mod bindings { struct foo_hardware; unsafe fn foo_hardware_read(_foo: *mut Foo, _len: &mut usize) -> *mut u8 { todo!() } };
> +    /// struct Foo(Opaque<bindings::foo_hardware>);
> +    ///
> +    /// impl Foo {
> +    ///     pub fn read(&mut self, mut len: usize) -> Result<&Untrusted<[u8]>> {
> +    ///         // SAFETY: just an FFI call without preconditions.
> +    ///         let data: *mut u8 = unsafe { bindings::foo_hardware_read(self.0.get(), &mut len) };
> +    ///         let data = error::from_err_ptr(data)?;
> +    ///         let data = ptr::slice_from_raw_parts(data, len);
> +    ///         // SAFETY: `data` returned by `foo_hardware_read` is valid for reads as long as the
> +    ///         // `foo_hardware` object exists. That function updated the
> +    ///         let data = unsafe { &*data };
> +    ///         Ok(Untrusted::new_untrusted_ref(data))
> +    ///     }
> +    /// }
> +    /// ```
> +    pub fn new_untrusted_ref<'a>(value: &'a T) -> &'a Self {
> +        let ptr: *const T = value;
> +        // CAST: `Self` is `repr(transparent)` and contains a `T`.
> +        let ptr = ptr as *const Self;
> +        // SAFETY: `ptr` came from a shared reference valid for `'a`.
> +        unsafe { &*ptr }
> +    }
> +
> +    /// Gives direct access to the underlying untrusted data.
> +    ///
> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> +    /// do so use [`validate()`].
> +    ///
> +    /// [`validate()`]: Self::validate
> +    pub fn untrusted(&self) -> &T {
> +        &self.0
> +    }
> +
> +    /// Gives direct access to the underlying untrusted data.
> +    ///
> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> +    /// do so use [`validate()`].
> +    ///
> +    /// [`validate()`]: Self::validate
> +    pub fn untrusted_mut(&mut self) -> &mut T {
> +        &mut self.0
> +    }
> +
> +    /// Unwraps the data and removes the untrusted marking.
> +    ///
> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> +    /// do so use [`validate()`].
> +    ///
> +    /// [`validate()`]: Self::validate
> +    pub fn into_inner_untrusted(self) -> T

I don't like the above two since they could be easily and accidentally
abused to leak untrusted data. And at least in your example I think
they're fallout from being a bit too eager with annotating data as
untrusted. The Folio and Mapped itself are just kernel structures, they
better be correct. So I don't think it's useful to annotate those as
untrusted and rely on Deref to also annotate the actual data as untrusted,
because it forces you to peak behind the box a few times.

Instead I think we only want to annotate the Folio Deref:

impl<S> Deref for Mapped<'_, S> {
    type Target = Untrusted<[u8]>;

Now there will be folios that we trust because they're kernel internal
data and not file cache, so we need more flexibility here. No idea how to
best make that happen, but in either case it's only the Deref data itself
we don't trust, not any of the structures around it. One example would be
gpu page tables. One way to implement this would be to make the Target
type a generic of the folio, or well an associated type of the existing
generics folio paramater:

pub struct Folio<S: FolioType>

pub trait FolioType {
	type Data;
}

impl<T> FolioType for PageCache<T> {
	type Data = Untrusted<[u8]>;
}

And gpu pagetables would use a something like this instead:

impl FolioType for GpuFolios {
	type Data = [struct GpuPTE];
}

All extremely untested.

Now I think there are cases you can't cover with just validate, where you
have multiple input data which needs to be cross-checked to ensure overall
validity. But I think that we can cover that by implementing a Validator
for tuples and making Untrusted a trait so that we can either Accept
Untrusted<(A, B)> or (Untrusted<A>, Untrusted<B>) interchangably when
calling validate().

At least with an hour of theorizing and your one example I couldn't come
up with anything else.

> +    where
> +        T: Sized,
> +    {
> +        self.0
> +    }
> +
> +    /// Dereferences the underlying untrusted data.
> +    ///
> +    /// Often, untrusted data is not directly exposed as bytes, but rather as a reader that can
> +    /// give you access to raw bytes. When such a type implements [`Deref`], then this function
> +    /// allows you to get access to the underlying data.
> +    pub fn deref(&self) -> &Untrusted<<T as Deref>::Target>
> +    where
> +        T: Deref,
> +    {
> +        Untrusted::new_untrusted_ref(self.0.deref())
> +    }
> +
> +    /// Validates and parses the untrusted data.
> +    ///
> +    /// See the [`Validator`] trait on how to implement it.
> +    pub fn validate<V: Validator<Input = T>>(&self) -> Result<V::Output, V::Err> {
> +        V::validate(self)
> +    }
> +}
> +
> +impl Untrusted<[u8]> {
> +    /// Validate the given bytes directly.
> +    ///
> +    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
> +    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
> +    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
> +    /// the [`Validator`] trait and the [`validate()`] function instead.
> +    ///
> +    /// # Examples
> +    ///
> +    /// ```
> +    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
> +    ///
> +    /// let data: &Untrusted<[u8]> = get_untrusted_data();
> +    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
> +    ///     if untrusted.len() != 2 {
> +    ///         return Err(());
> +    ///     }
> +    ///     if untrusted[0] & 0xf0 != 0 {
> +    ///         return Err(());
> +    ///     }
> +    ///     if untrusted[1] >= 100 {
> +    ///         return Err(());
> +    ///     }
> +    ///     Ok(())
> +    /// });
> +    /// match data {
> +    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
> +    ///     Err(()) => pr_info!("read faulty data from hardware!"),
> +    /// }
> +    /// ```
> +    ///
> +    /// [`validate()`]: Self::validate
> +    pub fn validate_bytes<E>(

I think this is a special case of a somewhat common in-place validation
pattern. For example in in complex syscall or ioctl we need to copy
structures from userspace anyway and doing yet another copy to put them
into a rust structure isn't great, instead we validate in-place. So I
think we need something like

impl Untrusted<T> {
	pub fn validate_inplace<E>(
		&self,
		validator: impl FnOnce(&T) -> Result<(), E>,
		) -> Result <&T, E> {
		...
	}
}

Eventually we might want a _mut version of this too, because often uapi
extensions means we need to correctly fill out default values for
extensions when being called by old userspace, and that requires
mutability. And it really is often an integral part of input validation.

Also I think we might need an Iterator for Untrusted<I: IntoIterator> so
that we can validate Untrusted<[T]> and things like that with standard map
and collect and do it all inplace.

> +        &self,
> +        validator: impl FnOnce(&[u8]) -> Result<(), E>,
> +    ) -> Result<&[u8], E> {
> +        let raw_data = self.untrusted();
> +        validator(raw_data).map(|()| raw_data)
> +    }
> +}
> +
> +impl<T, I> Index<I> for Untrusted<[T]>
> +where
> +    I: SliceIndex<[T]>,
> +{
> +    type Output = Untrusted<I::Output>;
> +
> +    fn index(&self, index: I) -> &Self::Output {
> +        Untrusted::new_untrusted_ref(self.0.index(index))
> +    }
> +}
> +
> +impl<T> Untrusted<[T]> {
> +    /// Gets the length of the underlying untrusted slice.
> +    pub fn len(&self) -> usize {
> +        self.0.len()
> +    }
> +}
> +
> +/// Validates untrusted data.
> +///
> +/// # Examples
> +///
> +/// ## Using an API returning untrusted data
> +///
> +/// Create the type of the data that you want to parse:
> +///
> +/// ```
> +/// pub struct FooData {
> +///     data: [u8; 4],
> +/// }
> +/// ```
> +///
> +/// Then implement this trait:
> +///
> +/// ```
> +/// use kernel::types::{Untrusted, Validator};
> +/// # pub struct FooData {
> +/// #     data: [u8; 4],
> +/// # }
> +/// impl Validator for FooData {
> +///     type Input = [u8];
> +///     type Output = FooData;
> +///     type Err = Error;
> +///
> +///     fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err> {
> +///         let untrusted = untrusted.untrusted();
> +///         let untrusted = <[u8; 4]>::try_from(untrusted);
> +///         for byte in &untrusted {
> +///             if byte & 0xf0 != 0 {
> +///                 return Err(());
> +///             }
> +///         }
> +///         Ok(FooData { data: untrusted })
> +///     }
> +/// }
> +/// ```
> +///
> +/// And then use the API that returns untrusted data:
> +///
> +/// ```ignore
> +/// let result = get_untrusted_data().validate::<FooData>();
> +/// ```
> +///
> +/// ## Creating an API returning untrusted data
> +///
> +/// In your API instead of just returning the untrusted data, wrap it in [`Untrusted<T>`]:
> +///
> +/// ```
> +/// pub fn get_untrusted_data(&self) -> &Untrusted<[u8]> {
> +///     todo!()
> +/// }
> +/// ```
> +pub trait Validator {
> +    /// Type of the input data that is untrusted.
> +    type Input: ?Sized;
> +    /// Type of the validated data.
> +    type Output;

So I think the explicit Output makes sense if you have multiple different
untrusted input that validate to the same trusted structure, but I'm not
sure this makes sense as associated types. Instead I'd go with generics
and somethign like this:

pub trait Validator<Input: ?Sized> {
    type Err;

    fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
}

That means you can't implement validate for types from other modules
directly but need a newtype (I think at least, not sure). But I think
that's actually a good thing, since often that means you're validating
some generic state plus whatever your own code needs (like the
inode::Params<tarfs::INodeData> in your example), and both pieces need to
be consisted overall and not just individually (otherwise why does the
that other module not do the parsing for you). And so explicitly treating
the validated output as an explicit new type just makes sense to me. Plus
with derive(Deref) it's trivial to unbox after validation.

Cheers, Sima

> +    /// Validation error.
> +    type Err;
> +
> +    /// Validate the given untrusted data and parse it into the output type.
> +    ///
> +    /// When implementing this function, you can use [`Untrusted::untrusted()`] to get access to
> +    /// the raw untrusted data.
> +    fn validate(untrusted: &Untrusted<Self::Input>) -> Result<Self::Output, Self::Err>;
> +}
> -- 
> 2.46.0
> 
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-13 15:33   ` Simona Vetter
@ 2024-09-13 16:49     ` Benno Lossin
  2024-09-16 15:49       ` Simona Vetter
  0 siblings, 1 reply; 18+ messages in thread
From: Benno Lossin @ 2024-09-13 16:49 UTC (permalink / raw)
  To: Simona Vetter
  Cc: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On 13.09.24 17:33, Simona Vetter wrote:
> On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
>>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
>> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
>>  // does not have any uninitialized portions either.
>>  unsafe impl<T: AsBytes> AsBytes for [T] {}
>>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
>> +
>> +/// Untrusted data of type `T`.
>> +///
>> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
>> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
>> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
>> +/// [`validate_bytes()`] function.
>> +///
>> +///
>> +/// [`validate()`]: Self::validate
>> +/// [`validate_bytes()`]: Self::validate_bytes
>> +#[repr(transparent)]
>> +pub struct Untrusted<T: ?Sized>(T);
> 
> I think we could make this a bit more strict and instead of encouraging
> people to put all their validation code into a Validator<T>, force
> them to. Which means assuming apis wrap all untrusted stuff in
> Untrusted<T> reviewers only need to look at validate implementations and
> nothing else.
> 
> If I'm not too wrong with my rust I think this would work with slitting
> Untrusted<T> into two types:
> 
> - Untrusted<T> is just a box you can't access, and the type returned by
>   apis for untrusted data. It doesn't have any of the functions except
>   validate and new_untrusted so that you can't get at the data at all.
> 
> - UntrustedUnsafe<T> does have all the accessors needed for validation,
>   but you can't construct that outside of this module. Maybe simplest to
>   just nest this within the Untrusted<T> box.
> 
> - Untrusted::validate does the unboxing and passes a reference to
>   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
>   validation code is in there and nowhere else. Similar for
>   validate_bytes.
> 
> Of course people can still bypass this easily, but it gets a bit harder to
> do so, and easier for reviewers to verify everything.

This is a great idea!
I think we should also remove `validate_bytes`, then you really must
implement `Validator`. If we figure out later that people need it, we
can add it later again. I added it because I thought that implementing
`Validator` for something very small might be very annoying, but it
seems like you actually want to force people to implement it :)

I think we should discuss the name `UntrustedUnsafe` though, I don't
really like the `Unsafe` although I understand why you used it. What do
you think of renaming the current `Untrusted` (and the one that only
exposes `validate`) to `Unvalidated` and using `Untrusted` as the
parameter for `Validator::validate`?
Alternatively, we could use `Unverified`/`Untrusted`, because
unvalidated is not a "real English word". But then I think we also
should rename `Validator` to `Verifier` etc.

>> +    /// Gives direct access to the underlying untrusted data.
>> +    ///
>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>> +    /// do so use [`validate()`].
>> +    ///
>> +    /// [`validate()`]: Self::validate
>> +    pub fn untrusted(&self) -> &T {
>> +        &self.0
>> +    }
>> +
>> +    /// Gives direct access to the underlying untrusted data.
>> +    ///
>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>> +    /// do so use [`validate()`].
>> +    ///
>> +    /// [`validate()`]: Self::validate
>> +    pub fn untrusted_mut(&mut self) -> &mut T {
>> +        &mut self.0
>> +    }
>> +
>> +    /// Unwraps the data and removes the untrusted marking.
>> +    ///
>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>> +    /// do so use [`validate()`].
>> +    ///
>> +    /// [`validate()`]: Self::validate
>> +    pub fn into_inner_untrusted(self) -> T
> 
> I don't like the above two since they could be easily and accidentally
> abused to leak untrusted data. And at least in your example I think
> they're fallout from being a bit too eager with annotating data as
> untrusted. The Folio and Mapped itself are just kernel structures, they
> better be correct. So I don't think it's useful to annotate those as
> untrusted and rely on Deref to also annotate the actual data as untrusted,
> because it forces you to peak behind the box a few times.

As I wrote in patch 2, I have no idea if I added the `Untrusted<T>` in
the correct places, as I don't know folios. I would disagree, that these
methods are necessary because of marking the entire folio as untrusted,
they are needed to access the data from within the `Validator::validate`
function, but with your above suggestion, we can move them to the
internal type.

> Instead I think we only want to annotate the Folio Deref:
> 
> impl<S> Deref for Mapped<'_, S> {
>     type Target = Untrusted<[u8]>;
> 
> Now there will be folios that we trust because they're kernel internal
> data and not file cache, so we need more flexibility here. No idea how to
> best make that happen, but in either case it's only the Deref data itself
> we don't trust, not any of the structures around it. One example would be
> gpu page tables. One way to implement this would be to make the Target
> type a generic of the folio, or well an associated type of the existing
> generics folio paramater:
> 
> pub struct Folio<S: FolioType>
> 
> pub trait FolioType {
> 	type Data;
> }
> 
> impl<T> FolioType for PageCache<T> {
> 	type Data = Untrusted<[u8]>;
> }
> 
> And gpu pagetables would use a something like this instead:
> 
> impl FolioType for GpuFolios {
> 	type Data = [struct GpuPTE];
> }
> 
> All extremely untested.

What I would like to avoid is having to do this for every possible data
source. Ideally we could just wrap trusted data sources with `Untrusted`
and be done with it.
If the wrapped type is not plain data, but rather a smart pointer or
other abstraction, then only the underlying data is marked untrusted
(otherwise how would you even know that the pointer can be used?). So
for example one might have an `Untrusted<ARef<[u8]>>`.

What I think we should do instead is make our APIs that return untrusted
data just return `Untrusted<Folio>` and implement the following method:

    impl Folio {
        pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
    }

I think that is the best of both worlds: we don't need to do excessive
type shenanigans for every type carrying potentially untrusted data and
we get to have methods specific to untrusted data.

However, I think implementing this method will be a bit difficult with
the `Untrusted`/`Unvalidated` split. Maybe we can have some `pub(crate)`
methods on `Unvalidated` to perform some mappings?

> Now I think there are cases you can't cover with just validate, where you
> have multiple input data which needs to be cross-checked to ensure overall
> validity. But I think that we can cover that by implementing a Validator
> for tuples and making Untrusted a trait so that we can either Accept
> Untrusted<(A, B)> or (Untrusted<A>, Untrusted<B>) interchangably when
> calling validate().

I could imagine us adding conversion functions that can combine
untrusted values. Additionally, we should probably add a `Context` type
to `Validator` that is an additional parameter.

> At least with an hour of theorizing and your one example I couldn't come
> up with anything else.

Yeah, we need more users of this to know the full way to express this
correctly. I would like to avoid huge refactorings in the future.

>> +impl Untrusted<[u8]> {
>> +    /// Validate the given bytes directly.
>> +    ///
>> +    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
>> +    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
>> +    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
>> +    /// the [`Validator`] trait and the [`validate()`] function instead.
>> +    ///
>> +    /// # Examples
>> +    ///
>> +    /// ```
>> +    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
>> +    ///
>> +    /// let data: &Untrusted<[u8]> = get_untrusted_data();
>> +    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
>> +    ///     if untrusted.len() != 2 {
>> +    ///         return Err(());
>> +    ///     }
>> +    ///     if untrusted[0] & 0xf0 != 0 {
>> +    ///         return Err(());
>> +    ///     }
>> +    ///     if untrusted[1] >= 100 {
>> +    ///         return Err(());
>> +    ///     }
>> +    ///     Ok(())
>> +    /// });
>> +    /// match data {
>> +    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
>> +    ///     Err(()) => pr_info!("read faulty data from hardware!"),
>> +    /// }
>> +    /// ```
>> +    ///
>> +    /// [`validate()`]: Self::validate
>> +    pub fn validate_bytes<E>(
> 
> I think this is a special case of a somewhat common in-place validation
> pattern. For example in in complex syscall or ioctl we need to copy
> structures from userspace anyway and doing yet another copy to put them
> into a rust structure isn't great, instead we validate in-place. So I
> think we need something like
> 
> impl Untrusted<T> {
> 	pub fn validate_inplace<E>(
> 		&self,
> 		validator: impl FnOnce(&T) -> Result<(), E>,
> 		) -> Result <&T, E> {
> 		...
> 	}
> }

I had thought about in-place validation as well, but I first wanted to
get a feel for how to do it, since I felt that in-place might make it
significantly more complicated.
In your proposal, you give a reference back, but maybe the data started
out as a `Box<[u8]>` (or how do you usually copy from userspace?), so it
would be better to be able to also handle owned data.

Also, it might be a good idea to just make the copy to kernel and
validate a single step.

> Eventually we might want a _mut version of this too, because often uapi
> extensions means we need to correctly fill out default values for
> extensions when being called by old userspace, and that requires
> mutability. And it really is often an integral part of input validation.

I see, will have to think about how to include this as well.

> Also I think we might need an Iterator for Untrusted<I: IntoIterator> so
> that we can validate Untrusted<[T]> and things like that with standard map
> and collect and do it all inplace.

Hmm, so a general iterator for `Unvalidated` might be a bad idea, but
for the `Untrusted`, it should be fine.

>> +pub trait Validator {
>> +    /// Type of the input data that is untrusted.
>> +    type Input: ?Sized;
>> +    /// Type of the validated data.
>> +    type Output;
> 
> So I think the explicit Output makes sense if you have multiple different
> untrusted input that validate to the same trusted structure, but I'm not
> sure this makes sense as associated types. Instead I'd go with generics
> and somethign like this:
> 
> pub trait Validator<Input: ?Sized> {
>     type Err;
> 
>     fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
> }
> 
> That means you can't implement validate for types from other modules
> directly but need a newtype (I think at least, not sure). But I think
> that's actually a good thing, since often that means you're validating
> some generic state plus whatever your own code needs (like the
> inode::Params<tarfs::INodeData> in your example), and both pieces need to
> be consisted overall and not just individually (otherwise why does the
> that other module not do the parsing for you). And so explicitly treating
> the validated output as an explicit new type just makes sense to me. Plus
> with derive(Deref) it's trivial to unbox after validation.

There might be the need to validate the same piece of data with
different ways and I am not convinced adding a newtype for every single
case is a good way to achieve it.
Although it would simplify the `Validator` trait... I will think a bit
about this.

> Cheers, Sima

Thanks a lot for taking a look!

---
Cheers,
Benno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-13 16:49     ` Benno Lossin
@ 2024-09-16 15:49       ` Simona Vetter
  2024-09-18 15:40         ` Benno Lossin
  2024-09-21  7:45         ` Benno Lossin
  0 siblings, 2 replies; 18+ messages in thread
From: Simona Vetter @ 2024-09-16 15:49 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Simona Vetter, Greg KH, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, rust-for-linux,
	linux-kernel

On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
> On 13.09.24 17:33, Simona Vetter wrote:
> > On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
> >>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
> >> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
> >>  // does not have any uninitialized portions either.
> >>  unsafe impl<T: AsBytes> AsBytes for [T] {}
> >>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
> >> +
> >> +/// Untrusted data of type `T`.
> >> +///
> >> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
> >> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
> >> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
> >> +/// [`validate_bytes()`] function.
> >> +///
> >> +///
> >> +/// [`validate()`]: Self::validate
> >> +/// [`validate_bytes()`]: Self::validate_bytes
> >> +#[repr(transparent)]
> >> +pub struct Untrusted<T: ?Sized>(T);
> > 
> > I think we could make this a bit more strict and instead of encouraging
> > people to put all their validation code into a Validator<T>, force
> > them to. Which means assuming apis wrap all untrusted stuff in
> > Untrusted<T> reviewers only need to look at validate implementations and
> > nothing else.
> > 
> > If I'm not too wrong with my rust I think this would work with slitting
> > Untrusted<T> into two types:
> > 
> > - Untrusted<T> is just a box you can't access, and the type returned by
> >   apis for untrusted data. It doesn't have any of the functions except
> >   validate and new_untrusted so that you can't get at the data at all.
> > 
> > - UntrustedUnsafe<T> does have all the accessors needed for validation,
> >   but you can't construct that outside of this module. Maybe simplest to
> >   just nest this within the Untrusted<T> box.
> > 
> > - Untrusted::validate does the unboxing and passes a reference to
> >   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
> >   validation code is in there and nowhere else. Similar for
> >   validate_bytes.
> > 
> > Of course people can still bypass this easily, but it gets a bit harder to
> > do so, and easier for reviewers to verify everything.
> 
> This is a great idea!
> I think we should also remove `validate_bytes`, then you really must
> implement `Validator`. If we figure out later that people need it, we
> can add it later again. I added it because I thought that implementing
> `Validator` for something very small might be very annoying, but it
> seems like you actually want to force people to implement it :)

See further down, I think there's a real use-case for validate_bytes, or
something really close to it at least.

> I think we should discuss the name `UntrustedUnsafe` though, I don't
> really like the `Unsafe` although I understand why you used it. What do
> you think of renaming the current `Untrusted` (and the one that only
> exposes `validate`) to `Unvalidated` and using `Untrusted` as the
> parameter for `Validator::validate`?

Much better, I didn't like my naming either.

> Alternatively, we could use `Unverified`/`Untrusted`, because
> unvalidated is not a "real English word". But then I think we also
> should rename `Validator` to `Verifier` etc.
> 
> >> +    /// Gives direct access to the underlying untrusted data.
> >> +    ///
> >> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >> +    /// do so use [`validate()`].
> >> +    ///
> >> +    /// [`validate()`]: Self::validate
> >> +    pub fn untrusted(&self) -> &T {
> >> +        &self.0
> >> +    }
> >> +
> >> +    /// Gives direct access to the underlying untrusted data.
> >> +    ///
> >> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >> +    /// do so use [`validate()`].
> >> +    ///
> >> +    /// [`validate()`]: Self::validate
> >> +    pub fn untrusted_mut(&mut self) -> &mut T {
> >> +        &mut self.0
> >> +    }
> >> +
> >> +    /// Unwraps the data and removes the untrusted marking.
> >> +    ///
> >> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >> +    /// do so use [`validate()`].
> >> +    ///
> >> +    /// [`validate()`]: Self::validate
> >> +    pub fn into_inner_untrusted(self) -> T
> > 
> > I don't like the above two since they could be easily and accidentally
> > abused to leak untrusted data. And at least in your example I think
> > they're fallout from being a bit too eager with annotating data as
> > untrusted. The Folio and Mapped itself are just kernel structures, they
> > better be correct. So I don't think it's useful to annotate those as
> > untrusted and rely on Deref to also annotate the actual data as untrusted,
> > because it forces you to peak behind the box a few times.
> 
> As I wrote in patch 2, I have no idea if I added the `Untrusted<T>` in
> the correct places, as I don't know folios. I would disagree, that these
> methods are necessary because of marking the entire folio as untrusted,
> they are needed to access the data from within the `Validator::validate`
> function, but with your above suggestion, we can move them to the
> internal type.
> 
> > Instead I think we only want to annotate the Folio Deref:
> > 
> > impl<S> Deref for Mapped<'_, S> {
> >     type Target = Untrusted<[u8]>;
> > 
> > Now there will be folios that we trust because they're kernel internal
> > data and not file cache, so we need more flexibility here. No idea how to
> > best make that happen, but in either case it's only the Deref data itself
> > we don't trust, not any of the structures around it. One example would be
> > gpu page tables. One way to implement this would be to make the Target
> > type a generic of the folio, or well an associated type of the existing
> > generics folio paramater:
> > 
> > pub struct Folio<S: FolioType>
> > 
> > pub trait FolioType {
> > 	type Data;
> > }
> > 
> > impl<T> FolioType for PageCache<T> {
> > 	type Data = Untrusted<[u8]>;
> > }
> > 
> > And gpu pagetables would use a something like this instead:
> > 
> > impl FolioType for GpuFolios {
> > 	type Data = [struct GpuPTE];
> > }
> > 
> > All extremely untested.
> 
> What I would like to avoid is having to do this for every possible data
> source. Ideally we could just wrap trusted data sources with `Untrusted`
> and be done with it.
> If the wrapped type is not plain data, but rather a smart pointer or
> other abstraction, then only the underlying data is marked untrusted
> (otherwise how would you even know that the pointer can be used?). So
> for example one might have an `Untrusted<ARef<[u8]>>`.

Yeah I think for pure smart pointers this is great, hence why I didn't
object to your Deref implementation. But if there's an entire
datastructure around it (like with pagecache) I think we need to sprinkle
associated types around to make this work.

What imo breaks annotations like this is if you have a lot of false
positive cases that force people to jump through hoops and normalize
having random uncecessary "validate" code all over. That defeats the
point.

> What I think we should do instead is make our APIs that return untrusted
> data just return `Untrusted<Folio>` and implement the following method:
> 
>     impl Folio {
>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
>     }
> 
> I think that is the best of both worlds: we don't need to do excessive
> type shenanigans for every type carrying potentially untrusted data and
> we get to have methods specific to untrusted data.
> 
> However, I think implementing this method will be a bit difficult with
> the `Untrusted`/`Unvalidated` split. Maybe we can have some `pub(crate)`
> methods on `Unvalidated` to perform some mappings?

The thing is, folios are just a pile of contig pages, and there's nothing
in the rules that they only contain untrusted data. Currently in rust code
we have that's the case, but not in general. So we need that associated
type.

But I also think Folio here is special, a lot of the other places where I
want this annotation it's the case that the data returned is _always_
untrusted. So we don't need to place associated types all over the
codebase to make this work, it's just that the rfc example you've picked
needs it.

E.g. copy_from_user is _always_ untrusted, not exception. Network packets
we read are also always untrusted. When you have a device driver and want
to be robust against evil implementations (external bus like usb or cloud
virtual hw with confidential compute), then also everything you ever read
from that device is always untrusted until validated.

And the neat thing is if we get this right, there's a lot of cases where
the Untrusted<> wrapper doesn't matter, because we just pass from one
untrusted to another. E.g. for the write() syscall (or an implemenation of
that in an fs) we get a Untrusted<[u8]> from Folio and let copy_from_user
directly write into that. But that only works if the annotations are
exactly right, not too much, not too little.

Oh another one we need:

impl<T: AsBytes> AsBytes for Untrusted<AsBytes>

Otherwise you can't write untrusted stuff to userspace, which is really no
issue at all (e.g. networking, where the kernel only parses the headers
and shovels the data to userspace unchecked).

> > Now I think there are cases you can't cover with just validate, where you
> > have multiple input data which needs to be cross-checked to ensure overall
> > validity. But I think that we can cover that by implementing a Validator
> > for tuples and making Untrusted a trait so that we can either Accept
> > Untrusted<(A, B)> or (Untrusted<A>, Untrusted<B>) interchangably when
> > calling validate().
> 
> I could imagine us adding conversion functions that can combine
> untrusted values. Additionally, we should probably add a `Context` type
> to `Validator` that is an additional parameter.
> 
> > At least with an hour of theorizing and your one example I couldn't come
> > up with anything else.
> 
> Yeah, we need more users of this to know the full way to express this
> correctly. I would like to avoid huge refactorings in the future.

I think adding it to the copy_*_user functions we already have in
upstream, and then asking Alice to rebase binder should be a really solid
real-world testcase. And I think currently for the things in-flight
copy*user is going to be the main source of untrusted data anyway, not so
much page cache folios.

> >> +impl Untrusted<[u8]> {
> >> +    /// Validate the given bytes directly.
> >> +    ///
> >> +    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
> >> +    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
> >> +    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
> >> +    /// the [`Validator`] trait and the [`validate()`] function instead.
> >> +    ///
> >> +    /// # Examples
> >> +    ///
> >> +    /// ```
> >> +    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
> >> +    ///
> >> +    /// let data: &Untrusted<[u8]> = get_untrusted_data();
> >> +    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
> >> +    ///     if untrusted.len() != 2 {
> >> +    ///         return Err(());
> >> +    ///     }
> >> +    ///     if untrusted[0] & 0xf0 != 0 {
> >> +    ///         return Err(());
> >> +    ///     }
> >> +    ///     if untrusted[1] >= 100 {
> >> +    ///         return Err(());
> >> +    ///     }
> >> +    ///     Ok(())
> >> +    /// });
> >> +    /// match data {
> >> +    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
> >> +    ///     Err(()) => pr_info!("read faulty data from hardware!"),
> >> +    /// }
> >> +    /// ```
> >> +    ///
> >> +    /// [`validate()`]: Self::validate
> >> +    pub fn validate_bytes<E>(
> > 
> > I think this is a special case of a somewhat common in-place validation
> > pattern. For example in in complex syscall or ioctl we need to copy
> > structures from userspace anyway and doing yet another copy to put them
> > into a rust structure isn't great, instead we validate in-place. So I
> > think we need something like
> > 
> > impl Untrusted<T> {
> > 	pub fn validate_inplace<E>(
> > 		&self,
> > 		validator: impl FnOnce(&T) -> Result<(), E>,
> > 		) -> Result <&T, E> {
> > 		...
> > 	}
> > }
> 
> I had thought about in-place validation as well, but I first wanted to
> get a feel for how to do it, since I felt that in-place might make it
> significantly more complicated.
> In your proposal, you give a reference back, but maybe the data started
> out as a `Box<[u8]>` (or how do you usually copy from userspace?), so it
> would be better to be able to also handle owned data.

The code in upstream is just MaybUninit<[u8]>.

> Also, it might be a good idea to just make the copy to kernel and
> validate a single step.

Yeah that'd be a nice helper, but I think conceptually you want it to be
two steps: Often for efficiency complex structures are linearized into one
single memory block, so that you can pull it all in with one
copy_from_user. But validation would need to look at each piece (and it's
often mixed, not just an array), probably together with Alice's Range
datatype to make sure we're don't have index confusions.

> > Eventually we might want a _mut version of this too, because often uapi
> > extensions means we need to correctly fill out default values for
> > extensions when being called by old userspace, and that requires
> > mutability. And it really is often an integral part of input validation.
> 
> I see, will have to think about how to include this as well.
> 
> > Also I think we might need an Iterator for Untrusted<I: IntoIterator> so
> > that we can validate Untrusted<[T]> and things like that with standard map
> > and collect and do it all inplace.
> 
> Hmm, so a general iterator for `Unvalidated` might be a bad idea, but
> for the `Untrusted`, it should be fine.

Yup that was my thinking too, the idea being that you write a validator
for a single element, and then let Iterator magic handle things when you
need to validate an entire array.
 
> >> +pub trait Validator {
> >> +    /// Type of the input data that is untrusted.
> >> +    type Input: ?Sized;
> >> +    /// Type of the validated data.
> >> +    type Output;
> > 
> > So I think the explicit Output makes sense if you have multiple different
> > untrusted input that validate to the same trusted structure, but I'm not
> > sure this makes sense as associated types. Instead I'd go with generics
> > and somethign like this:
> > 
> > pub trait Validator<Input: ?Sized> {
> >     type Err;
> > 
> >     fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
> > }
> > 
> > That means you can't implement validate for types from other modules
> > directly but need a newtype (I think at least, not sure). But I think
> > that's actually a good thing, since often that means you're validating
> > some generic state plus whatever your own code needs (like the
> > inode::Params<tarfs::INodeData> in your example), and both pieces need to
> > be consisted overall and not just individually (otherwise why does the
> > that other module not do the parsing for you). And so explicitly treating
> > the validated output as an explicit new type just makes sense to me. Plus
> > with derive(Deref) it's trivial to unbox after validation.
> 
> There might be the need to validate the same piece of data with
> different ways and I am not convinced adding a newtype for every single
> case is a good way to achieve it.
> Although it would simplify the `Validator` trait... I will think a bit
> about this.

Hm, but unless I misunderstand you already need a random type to attach
your current trait too? So not worse if we require that for the
less-common type of multiple ways to validate the same, and simpler for
the common one.

> > Cheers, Sima
> 
> Thanks a lot for taking a look!

I _really_ want this :-)
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-16 15:49       ` Simona Vetter
@ 2024-09-18 15:40         ` Benno Lossin
  2024-09-18 17:09           ` Greg KH
  2024-09-20 14:29           ` Simona Vetter
  2024-09-21  7:45         ` Benno Lossin
  1 sibling, 2 replies; 18+ messages in thread
From: Benno Lossin @ 2024-09-18 15:40 UTC (permalink / raw)
  To: Simona Vetter, Greg KH, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, rust-for-linux,
	linux-kernel

On 16.09.24 17:49, Simona Vetter wrote:
> On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
>> On 13.09.24 17:33, Simona Vetter wrote:
>>> On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
>>>>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
>>>> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
>>>>  // does not have any uninitialized portions either.
>>>>  unsafe impl<T: AsBytes> AsBytes for [T] {}
>>>>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
>>>> +
>>>> +/// Untrusted data of type `T`.
>>>> +///
>>>> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
>>>> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
>>>> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
>>>> +/// [`validate_bytes()`] function.
>>>> +///
>>>> +///
>>>> +/// [`validate()`]: Self::validate
>>>> +/// [`validate_bytes()`]: Self::validate_bytes
>>>> +#[repr(transparent)]
>>>> +pub struct Untrusted<T: ?Sized>(T);
>>>
>>> I think we could make this a bit more strict and instead of encouraging
>>> people to put all their validation code into a Validator<T>, force
>>> them to. Which means assuming apis wrap all untrusted stuff in
>>> Untrusted<T> reviewers only need to look at validate implementations and
>>> nothing else.
>>>
>>> If I'm not too wrong with my rust I think this would work with slitting
>>> Untrusted<T> into two types:
>>>
>>> - Untrusted<T> is just a box you can't access, and the type returned by
>>>   apis for untrusted data. It doesn't have any of the functions except
>>>   validate and new_untrusted so that you can't get at the data at all.
>>>
>>> - UntrustedUnsafe<T> does have all the accessors needed for validation,
>>>   but you can't construct that outside of this module. Maybe simplest to
>>>   just nest this within the Untrusted<T> box.
>>>
>>> - Untrusted::validate does the unboxing and passes a reference to
>>>   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
>>>   validation code is in there and nowhere else. Similar for
>>>   validate_bytes.
>>>
>>> Of course people can still bypass this easily, but it gets a bit harder to
>>> do so, and easier for reviewers to verify everything.
>>
>> This is a great idea!
>> I think we should also remove `validate_bytes`, then you really must
>> implement `Validator`. If we figure out later that people need it, we
>> can add it later again. I added it because I thought that implementing
>> `Validator` for something very small might be very annoying, but it
>> seems like you actually want to force people to implement it :)
> 
> See further down, I think there's a real use-case for validate_bytes, or
> something really close to it at least.

That's good to know, then I will keep it.

>> I think we should discuss the name `UntrustedUnsafe` though, I don't
>> really like the `Unsafe` although I understand why you used it. What do
>> you think of renaming the current `Untrusted` (and the one that only
>> exposes `validate`) to `Unvalidated` and using `Untrusted` as the
>> parameter for `Validator::validate`?
> 
> Much better, I didn't like my naming either.

While designing my LPC slides, I think that we actually don't need two
types. We can just have `Untrusted` that exposes nothing except
`validate[_bytes]` and have the `Validator::validate` function take the
input data directly unwrapped. Thoughts?

>> Alternatively, we could use `Unverified`/`Untrusted`, because
>> unvalidated is not a "real English word". But then I think we also
>> should rename `Validator` to `Verifier` etc.
>>
>>>> +    /// Gives direct access to the underlying untrusted data.
>>>> +    ///
>>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>>>> +    /// do so use [`validate()`].
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn untrusted(&self) -> &T {
>>>> +        &self.0
>>>> +    }
>>>> +
>>>> +    /// Gives direct access to the underlying untrusted data.
>>>> +    ///
>>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>>>> +    /// do so use [`validate()`].
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn untrusted_mut(&mut self) -> &mut T {
>>>> +        &mut self.0
>>>> +    }
>>>> +
>>>> +    /// Unwraps the data and removes the untrusted marking.
>>>> +    ///
>>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>>>> +    /// do so use [`validate()`].
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn into_inner_untrusted(self) -> T
>>>
>>> I don't like the above two since they could be easily and accidentally
>>> abused to leak untrusted data. And at least in your example I think
>>> they're fallout from being a bit too eager with annotating data as
>>> untrusted. The Folio and Mapped itself are just kernel structures, they
>>> better be correct. So I don't think it's useful to annotate those as
>>> untrusted and rely on Deref to also annotate the actual data as untrusted,
>>> because it forces you to peak behind the box a few times.
>>
>> As I wrote in patch 2, I have no idea if I added the `Untrusted<T>` in
>> the correct places, as I don't know folios. I would disagree, that these
>> methods are necessary because of marking the entire folio as untrusted,
>> they are needed to access the data from within the `Validator::validate`
>> function, but with your above suggestion, we can move them to the
>> internal type.
>>
>>> Instead I think we only want to annotate the Folio Deref:
>>>
>>> impl<S> Deref for Mapped<'_, S> {
>>>     type Target = Untrusted<[u8]>;
>>>
>>> Now there will be folios that we trust because they're kernel internal
>>> data and not file cache, so we need more flexibility here. No idea how to
>>> best make that happen, but in either case it's only the Deref data itself
>>> we don't trust, not any of the structures around it. One example would be
>>> gpu page tables. One way to implement this would be to make the Target
>>> type a generic of the folio, or well an associated type of the existing
>>> generics folio paramater:
>>>
>>> pub struct Folio<S: FolioType>
>>>
>>> pub trait FolioType {
>>> 	type Data;
>>> }
>>>
>>> impl<T> FolioType for PageCache<T> {
>>> 	type Data = Untrusted<[u8]>;
>>> }
>>>
>>> And gpu pagetables would use a something like this instead:
>>>
>>> impl FolioType for GpuFolios {
>>> 	type Data = [struct GpuPTE];
>>> }
>>>
>>> All extremely untested.
>>
>> What I would like to avoid is having to do this for every possible data
>> source. Ideally we could just wrap trusted data sources with `Untrusted`
>> and be done with it.
>> If the wrapped type is not plain data, but rather a smart pointer or
>> other abstraction, then only the underlying data is marked untrusted
>> (otherwise how would you even know that the pointer can be used?). So
>> for example one might have an `Untrusted<ARef<[u8]>>`.
> 
> Yeah I think for pure smart pointers this is great, hence why I didn't
> object to your Deref implementation. But if there's an entire
> datastructure around it (like with pagecache) I think we need to sprinkle
> associated types around to make this work.
> 
> What imo breaks annotations like this is if you have a lot of false
> positive cases that force people to jump through hoops and normalize
> having random uncecessary "validate" code all over. That defeats the
> point.

Agreed.

>> What I think we should do instead is make our APIs that return untrusted
>> data just return `Untrusted<Folio>` and implement the following method:
>>
>>     impl Folio {
>>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
>>     }
>>
>> I think that is the best of both worlds: we don't need to do excessive
>> type shenanigans for every type carrying potentially untrusted data and
>> we get to have methods specific to untrusted data.
>>
>> However, I think implementing this method will be a bit difficult with
>> the `Untrusted`/`Unvalidated` split. Maybe we can have some `pub(crate)`
>> methods on `Unvalidated` to perform some mappings?
> 
> The thing is, folios are just a pile of contig pages, and there's nothing
> in the rules that they only contain untrusted data. Currently in rust code
> we have that's the case, but not in general. So we need that associated
> type.
> 
> But I also think Folio here is special, a lot of the other places where I
> want this annotation it's the case that the data returned is _always_
> untrusted. So we don't need to place associated types all over the
> codebase to make this work, it's just that the rfc example you've picked
> needs it.

I think we should try to make just wrapping stuff in `Untrusted` work. I
don't see how the associated types would help you any more than just
implementing stuff on `&Untrusted<Self`.

> E.g. copy_from_user is _always_ untrusted, not exception. Network packets
> we read are also always untrusted. When you have a device driver and want
> to be robust against evil implementations (external bus like usb or cloud
> virtual hw with confidential compute), then also everything you ever read
> from that device is always untrusted until validated.
> 
> And the neat thing is if we get this right, there's a lot of cases where
> the Untrusted<> wrapper doesn't matter, because we just pass from one
> untrusted to another. E.g. for the write() syscall (or an implemenation of
> that in an fs) we get a Untrusted<[u8]> from Folio and let copy_from_user
> directly write into that. But that only works if the annotations are
> exactly right, not too much, not too little.

Yeah this would be very nice, if you never need to look inside of the
untrusted data, then you can just move it along.

> Oh another one we need:
> 
> impl<T: AsBytes> AsBytes for Untrusted<AsBytes>
> 
> Otherwise you can't write untrusted stuff to userspace, which is really no
> issue at all (e.g. networking, where the kernel only parses the headers
> and shovels the data to userspace unchecked).

Sure.

>>> Now I think there are cases you can't cover with just validate, where you
>>> have multiple input data which needs to be cross-checked to ensure overall
>>> validity. But I think that we can cover that by implementing a Validator
>>> for tuples and making Untrusted a trait so that we can either Accept
>>> Untrusted<(A, B)> or (Untrusted<A>, Untrusted<B>) interchangably when
>>> calling validate().
>>
>> I could imagine us adding conversion functions that can combine
>> untrusted values. Additionally, we should probably add a `Context` type
>> to `Validator` that is an additional parameter.
>>
>>> At least with an hour of theorizing and your one example I couldn't come
>>> up with anything else.
>>
>> Yeah, we need more users of this to know the full way to express this
>> correctly. I would like to avoid huge refactorings in the future.
> 
> I think adding it to the copy_*_user functions we already have in
> upstream, and then asking Alice to rebase binder should be a really solid
> real-world testcase. And I think currently for the things in-flight
> copy*user is going to be the main source of untrusted data anyway, not so
> much page cache folios.

Sure. I chose tarfs as the use-case, because Greg mentioned to me that
it would benefit from adding this API. (I have no prior linux kernel
experience, so you giving me some pointers where this will be useful is
very helpful!)

>>>> +impl Untrusted<[u8]> {
>>>> +    /// Validate the given bytes directly.
>>>> +    ///
>>>> +    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
>>>> +    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
>>>> +    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
>>>> +    /// the [`Validator`] trait and the [`validate()`] function instead.
>>>> +    ///
>>>> +    /// # Examples
>>>> +    ///
>>>> +    /// ```
>>>> +    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
>>>> +    ///
>>>> +    /// let data: &Untrusted<[u8]> = get_untrusted_data();
>>>> +    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
>>>> +    ///     if untrusted.len() != 2 {
>>>> +    ///         return Err(());
>>>> +    ///     }
>>>> +    ///     if untrusted[0] & 0xf0 != 0 {
>>>> +    ///         return Err(());
>>>> +    ///     }
>>>> +    ///     if untrusted[1] >= 100 {
>>>> +    ///         return Err(());
>>>> +    ///     }
>>>> +    ///     Ok(())
>>>> +    /// });
>>>> +    /// match data {
>>>> +    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
>>>> +    ///     Err(()) => pr_info!("read faulty data from hardware!"),
>>>> +    /// }
>>>> +    /// ```
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn validate_bytes<E>(
>>>
>>> I think this is a special case of a somewhat common in-place validation
>>> pattern. For example in in complex syscall or ioctl we need to copy
>>> structures from userspace anyway and doing yet another copy to put them
>>> into a rust structure isn't great, instead we validate in-place. So I
>>> think we need something like
>>>
>>> impl Untrusted<T> {
>>> 	pub fn validate_inplace<E>(
>>> 		&self,
>>> 		validator: impl FnOnce(&T) -> Result<(), E>,
>>> 		) -> Result <&T, E> {
>>> 		...
>>> 	}
>>> }
>>
>> I had thought about in-place validation as well, but I first wanted to
>> get a feel for how to do it, since I felt that in-place might make it
>> significantly more complicated.
>> In your proposal, you give a reference back, but maybe the data started
>> out as a `Box<[u8]>` (or how do you usually copy from userspace?), so it
>> would be better to be able to also handle owned data.
> 
> The code in upstream is just MaybUninit<[u8]>.

Where is that stored though? On the heap or the stack?

>> Also, it might be a good idea to just make the copy to kernel and
>> validate a single step.
> 
> Yeah that'd be a nice helper, but I think conceptually you want it to be
> two steps: Often for efficiency complex structures are linearized into one
> single memory block, so that you can pull it all in with one
> copy_from_user. But validation would need to look at each piece (and it's
> often mixed, not just an array), probably together with Alice's Range
> datatype to make sure we're don't have index confusions.

Yeah that makes sense.

>>> Eventually we might want a _mut version of this too, because often uapi
>>> extensions means we need to correctly fill out default values for
>>> extensions when being called by old userspace, and that requires
>>> mutability. And it really is often an integral part of input validation.
>>
>> I see, will have to think about how to include this as well.
>>
>>> Also I think we might need an Iterator for Untrusted<I: IntoIterator> so
>>> that we can validate Untrusted<[T]> and things like that with standard map
>>> and collect and do it all inplace.
>>
>> Hmm, so a general iterator for `Unvalidated` might be a bad idea, but
>> for the `Untrusted`, it should be fine.
> 
> Yup that was my thinking too, the idea being that you write a validator
> for a single element, and then let Iterator magic handle things when you
> need to validate an entire array.
> 
>>>> +pub trait Validator {
>>>> +    /// Type of the input data that is untrusted.
>>>> +    type Input: ?Sized;
>>>> +    /// Type of the validated data.
>>>> +    type Output;
>>>
>>> So I think the explicit Output makes sense if you have multiple different
>>> untrusted input that validate to the same trusted structure, but I'm not
>>> sure this makes sense as associated types. Instead I'd go with generics
>>> and somethign like this:
>>>
>>> pub trait Validator<Input: ?Sized> {
>>>     type Err;
>>>
>>>     fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
>>> }
>>>
>>> That means you can't implement validate for types from other modules
>>> directly but need a newtype (I think at least, not sure). But I think
>>> that's actually a good thing, since often that means you're validating
>>> some generic state plus whatever your own code needs (like the
>>> inode::Params<tarfs::INodeData> in your example), and both pieces need to
>>> be consisted overall and not just individually (otherwise why does the
>>> that other module not do the parsing for you). And so explicitly treating
>>> the validated output as an explicit new type just makes sense to me. Plus
>>> with derive(Deref) it's trivial to unbox after validation.
>>
>> There might be the need to validate the same piece of data with
>> different ways and I am not convinced adding a newtype for every single
>> case is a good way to achieve it.
>> Although it would simplify the `Validator` trait... I will think a bit
>> about this.
> 
> Hm, but unless I misunderstand you already need a random type to attach
> your current trait too? So not worse if we require that for the
> less-common type of multiple ways to validate the same, and simpler for
> the common one.

Yes, but you wouldn't have to unwrap the return type. For example with
your proposal we have:

    struct MyINodeParams(INodeParams);

    impl Validator<[u8]> for MyINodeParams {
        type Err = Error;

        fn validate(untrusted: &Untrusted<[u8]>) -> Result<Self> {
            /*...*/
            Ok(Self(params))
        }
    }

    impl MyINodeParams {
        fn into_inner(self) -> INodeParams {
            self.0
        }
    }

And then you would do:

    let params = untrusted.validate::<MyINodeParams>().into_inner();

I find the `into_inner()` a bit annoying (one could just use `.0`
instead, but I also don't like that). I find specifying the `Output` a
bit cleaner.
    
---
Cheers,
Benno

>>> Cheers, Sima
>>
>> Thanks a lot for taking a look!
> 
> I _really_ want this :-)
> -Sima
> --
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-18 15:40         ` Benno Lossin
@ 2024-09-18 17:09           ` Greg KH
  2024-09-18 17:33             ` Benno Lossin
  2024-09-20 14:29           ` Simona Vetter
  1 sibling, 1 reply; 18+ messages in thread
From: Greg KH @ 2024-09-18 17:09 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Simona Vetter, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On Wed, Sep 18, 2024 at 03:40:54PM +0000, Benno Lossin wrote:
> >> Yeah, we need more users of this to know the full way to express this
> >> correctly. I would like to avoid huge refactorings in the future.
> > 
> > I think adding it to the copy_*_user functions we already have in
> > upstream, and then asking Alice to rebase binder should be a really solid
> > real-world testcase. And I think currently for the things in-flight
> > copy*user is going to be the main source of untrusted data anyway, not so
> > much page cache folios.
> 
> Sure. I chose tarfs as the use-case, because Greg mentioned to me that
> it would benefit from adding this API. (I have no prior linux kernel
> experience, so you giving me some pointers where this will be useful is
> very helpful!)

I just had tarfs as an easy example where we were reading data off the
disk and acting on it, in a way just like C where if the data is
corrupted we can do "not normal" things.  Sorry it got tied up with
folios, that is not the normal way drivers work, they either get data
from userspace through a char device node (ioctls) or from hardware
(memory copies/reads/something) and for them the "untrusted data"
abstraction should be much simpler than dealing with a folio.

We don't really have any other good examples of drivers in rust yet that
I could find other than maybe binder, but Alice has already posted her
solution for how to handle untrusted data there (comes in through a char
device node and/or a filesystem entry point) but it's much more complex
and possibly harder to use as a simple example of the api ideas.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-18 17:09           ` Greg KH
@ 2024-09-18 17:33             ` Benno Lossin
  2024-09-18 17:39               ` Greg KH
  0 siblings, 1 reply; 18+ messages in thread
From: Benno Lossin @ 2024-09-18 17:33 UTC (permalink / raw)
  To: Greg KH
  Cc: Simona Vetter, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On 18.09.24 19:09, Greg KH wrote:
> On Wed, Sep 18, 2024 at 03:40:54PM +0000, Benno Lossin wrote:
>>>> Yeah, we need more users of this to know the full way to express this
>>>> correctly. I would like to avoid huge refactorings in the future.
>>>
>>> I think adding it to the copy_*_user functions we already have in
>>> upstream, and then asking Alice to rebase binder should be a really solid
>>> real-world testcase. And I think currently for the things in-flight
>>> copy*user is going to be the main source of untrusted data anyway, not so
>>> much page cache folios.
>>
>> Sure. I chose tarfs as the use-case, because Greg mentioned to me that
>> it would benefit from adding this API. (I have no prior linux kernel
>> experience, so you giving me some pointers where this will be useful is
>> very helpful!)
> 
> I just had tarfs as an easy example where we were reading data off the
> disk and acting on it, in a way just like C where if the data is
> corrupted we can do "not normal" things.  Sorry it got tied up with

No worries! I was just under the impression that this would be common
(maybe it's common for filesystems?), so just having that clarification
now makes more sense.

> folios, that is not the normal way drivers work, they either get data
> from userspace through a char device node (ioctls) or from hardware
> (memory copies/reads/something) and for them the "untrusted data"
> abstraction should be much simpler than dealing with a folio.
> 
> We don't really have any other good examples of drivers in rust yet that
> I could find other than maybe binder, but Alice has already posted her
> solution for how to handle untrusted data there (comes in through a char
> device node and/or a filesystem entry point) but it's much more complex
> and possibly harder to use as a simple example of the api ideas.

Yeah that's what I thought as well when you brought it up the first
time.

---
Cheers,
Benno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-18 17:33             ` Benno Lossin
@ 2024-09-18 17:39               ` Greg KH
  0 siblings, 0 replies; 18+ messages in thread
From: Greg KH @ 2024-09-18 17:39 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Simona Vetter, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On Wed, Sep 18, 2024 at 05:33:33PM +0000, Benno Lossin wrote:
> On 18.09.24 19:09, Greg KH wrote:
> > On Wed, Sep 18, 2024 at 03:40:54PM +0000, Benno Lossin wrote:
> >>>> Yeah, we need more users of this to know the full way to express this
> >>>> correctly. I would like to avoid huge refactorings in the future.
> >>>
> >>> I think adding it to the copy_*_user functions we already have in
> >>> upstream, and then asking Alice to rebase binder should be a really solid
> >>> real-world testcase. And I think currently for the things in-flight
> >>> copy*user is going to be the main source of untrusted data anyway, not so
> >>> much page cache folios.
> >>
> >> Sure. I chose tarfs as the use-case, because Greg mentioned to me that
> >> it would benefit from adding this API. (I have no prior linux kernel
> >> experience, so you giving me some pointers where this will be useful is
> >> very helpful!)
> > 
> > I just had tarfs as an easy example where we were reading data off the
> > disk and acting on it, in a way just like C where if the data is
> > corrupted we can do "not normal" things.  Sorry it got tied up with
> 
> No worries! I was just under the impression that this would be common
> (maybe it's common for filesystems?), so just having that clarification
> now makes more sense.

Yes, this would be common for filesystems.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-18 15:40         ` Benno Lossin
  2024-09-18 17:09           ` Greg KH
@ 2024-09-20 14:29           ` Simona Vetter
  2024-09-20 15:28             ` Benno Lossin
  1 sibling, 1 reply; 18+ messages in thread
From: Simona Vetter @ 2024-09-20 14:29 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Simona Vetter, Greg KH, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, rust-for-linux,
	linux-kernel

On Wed, Sep 18, 2024 at 03:40:54PM +0000, Benno Lossin wrote:
> On 16.09.24 17:49, Simona Vetter wrote:
> > On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
> >> On 13.09.24 17:33, Simona Vetter wrote:
> >>> On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
> >>>>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
> >>>> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
> >>>>  // does not have any uninitialized portions either.
> >>>>  unsafe impl<T: AsBytes> AsBytes for [T] {}
> >>>>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
> >>>> +
> >>>> +/// Untrusted data of type `T`.
> >>>> +///
> >>>> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
> >>>> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
> >>>> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
> >>>> +/// [`validate_bytes()`] function.
> >>>> +///
> >>>> +///
> >>>> +/// [`validate()`]: Self::validate
> >>>> +/// [`validate_bytes()`]: Self::validate_bytes
> >>>> +#[repr(transparent)]
> >>>> +pub struct Untrusted<T: ?Sized>(T);
> >>>
> >>> I think we could make this a bit more strict and instead of encouraging
> >>> people to put all their validation code into a Validator<T>, force
> >>> them to. Which means assuming apis wrap all untrusted stuff in
> >>> Untrusted<T> reviewers only need to look at validate implementations and
> >>> nothing else.
> >>>
> >>> If I'm not too wrong with my rust I think this would work with slitting
> >>> Untrusted<T> into two types:
> >>>
> >>> - Untrusted<T> is just a box you can't access, and the type returned by
> >>>   apis for untrusted data. It doesn't have any of the functions except
> >>>   validate and new_untrusted so that you can't get at the data at all.
> >>>
> >>> - UntrustedUnsafe<T> does have all the accessors needed for validation,
> >>>   but you can't construct that outside of this module. Maybe simplest to
> >>>   just nest this within the Untrusted<T> box.
> >>>
> >>> - Untrusted::validate does the unboxing and passes a reference to
> >>>   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
> >>>   validation code is in there and nowhere else. Similar for
> >>>   validate_bytes.
> >>>
> >>> Of course people can still bypass this easily, but it gets a bit harder to
> >>> do so, and easier for reviewers to verify everything.
> >>
> >> This is a great idea!
> >> I think we should also remove `validate_bytes`, then you really must
> >> implement `Validator`. If we figure out later that people need it, we
> >> can add it later again. I added it because I thought that implementing
> >> `Validator` for something very small might be very annoying, but it
> >> seems like you actually want to force people to implement it :)
> > 
> > See further down, I think there's a real use-case for validate_bytes, or
> > something really close to it at least.
> 
> That's good to know, then I will keep it.
> 
> >> I think we should discuss the name `UntrustedUnsafe` though, I don't
> >> really like the `Unsafe` although I understand why you used it. What do
> >> you think of renaming the current `Untrusted` (and the one that only
> >> exposes `validate`) to `Unvalidated` and using `Untrusted` as the
> >> parameter for `Validator::validate`?
> > 
> > Much better, I didn't like my naming either.
> 
> While designing my LPC slides, I think that we actually don't need two
> types. We can just have `Untrusted` that exposes nothing except
> `validate[_bytes]` and have the `Validator::validate` function take the
> input data directly unwrapped. Thoughts?

I think the only tricky case is when you want to validate multiple
untrusted things together. But we could sort that out by making untrusted
a trait (or at least the validate function that runs on it), and then
implement it for tuples.

Which is an entirely irrelevant detour out of the way of saying "yes I
think this works and is much simpler."
 
> >> Alternatively, we could use `Unverified`/`Untrusted`, because
> >> unvalidated is not a "real English word". But then I think we also
> >> should rename `Validator` to `Verifier` etc.
> >>
> >>>> +    /// Gives direct access to the underlying untrusted data.
> >>>> +    ///
> >>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >>>> +    /// do so use [`validate()`].
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn untrusted(&self) -> &T {
> >>>> +        &self.0
> >>>> +    }
> >>>> +
> >>>> +    /// Gives direct access to the underlying untrusted data.
> >>>> +    ///
> >>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >>>> +    /// do so use [`validate()`].
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn untrusted_mut(&mut self) -> &mut T {
> >>>> +        &mut self.0
> >>>> +    }
> >>>> +
> >>>> +    /// Unwraps the data and removes the untrusted marking.
> >>>> +    ///
> >>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >>>> +    /// do so use [`validate()`].
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn into_inner_untrusted(self) -> T
> >>>
> >>> I don't like the above two since they could be easily and accidentally
> >>> abused to leak untrusted data. And at least in your example I think
> >>> they're fallout from being a bit too eager with annotating data as
> >>> untrusted. The Folio and Mapped itself are just kernel structures, they
> >>> better be correct. So I don't think it's useful to annotate those as
> >>> untrusted and rely on Deref to also annotate the actual data as untrusted,
> >>> because it forces you to peak behind the box a few times.
> >>
> >> As I wrote in patch 2, I have no idea if I added the `Untrusted<T>` in
> >> the correct places, as I don't know folios. I would disagree, that these
> >> methods are necessary because of marking the entire folio as untrusted,
> >> they are needed to access the data from within the `Validator::validate`
> >> function, but with your above suggestion, we can move them to the
> >> internal type.
> >>
> >>> Instead I think we only want to annotate the Folio Deref:
> >>>
> >>> impl<S> Deref for Mapped<'_, S> {
> >>>     type Target = Untrusted<[u8]>;
> >>>
> >>> Now there will be folios that we trust because they're kernel internal
> >>> data and not file cache, so we need more flexibility here. No idea how to
> >>> best make that happen, but in either case it's only the Deref data itself
> >>> we don't trust, not any of the structures around it. One example would be
> >>> gpu page tables. One way to implement this would be to make the Target
> >>> type a generic of the folio, or well an associated type of the existing
> >>> generics folio paramater:
> >>>
> >>> pub struct Folio<S: FolioType>
> >>>
> >>> pub trait FolioType {
> >>> 	type Data;
> >>> }
> >>>
> >>> impl<T> FolioType for PageCache<T> {
> >>> 	type Data = Untrusted<[u8]>;
> >>> }
> >>>
> >>> And gpu pagetables would use a something like this instead:
> >>>
> >>> impl FolioType for GpuFolios {
> >>> 	type Data = [struct GpuPTE];
> >>> }
> >>>
> >>> All extremely untested.
> >>
> >> What I would like to avoid is having to do this for every possible data
> >> source. Ideally we could just wrap trusted data sources with `Untrusted`
> >> and be done with it.
> >> If the wrapped type is not plain data, but rather a smart pointer or
> >> other abstraction, then only the underlying data is marked untrusted
> >> (otherwise how would you even know that the pointer can be used?). So
> >> for example one might have an `Untrusted<ARef<[u8]>>`.
> > 
> > Yeah I think for pure smart pointers this is great, hence why I didn't
> > object to your Deref implementation. But if there's an entire
> > datastructure around it (like with pagecache) I think we need to sprinkle
> > associated types around to make this work.
> > 
> > What imo breaks annotations like this is if you have a lot of false
> > positive cases that force people to jump through hoops and normalize
> > having random uncecessary "validate" code all over. That defeats the
> > point.
> 
> Agreed.
> 
> >> What I think we should do instead is make our APIs that return untrusted
> >> data just return `Untrusted<Folio>` and implement the following method:
> >>
> >>     impl Folio {
> >>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
> >>     }
> >>
> >> I think that is the best of both worlds: we don't need to do excessive
> >> type shenanigans for every type carrying potentially untrusted data and
> >> we get to have methods specific to untrusted data.
> >>
> >> However, I think implementing this method will be a bit difficult with
> >> the `Untrusted`/`Unvalidated` split. Maybe we can have some `pub(crate)`
> >> methods on `Unvalidated` to perform some mappings?
> > 
> > The thing is, folios are just a pile of contig pages, and there's nothing
> > in the rules that they only contain untrusted data. Currently in rust code
> > we have that's the case, but not in general. So we need that associated
> > type.
> > 
> > But I also think Folio here is special, a lot of the other places where I
> > want this annotation it's the case that the data returned is _always_
> > untrusted. So we don't need to place associated types all over the
> > codebase to make this work, it's just that the rfc example you've picked
> > needs it.
> 
> I think we should try to make just wrapping stuff in `Untrusted` work. I
> don't see how the associated types would help you any more than just
> implementing stuff on `&Untrusted<Self`.

I guess you could wrap it as Untrusted in each use site when you get the
data out of the Folio, but that makes the guarantees we get out of these
annotations much less stringent. Which is why I think for Folio<> (well
really for Pagecache) we need to go with the associated type or it's a bit
self-defeating.

> > E.g. copy_from_user is _always_ untrusted, not exception. Network packets
> > we read are also always untrusted. When you have a device driver and want
> > to be robust against evil implementations (external bus like usb or cloud
> > virtual hw with confidential compute), then also everything you ever read
> > from that device is always untrusted until validated.
> > 
> > And the neat thing is if we get this right, there's a lot of cases where
> > the Untrusted<> wrapper doesn't matter, because we just pass from one
> > untrusted to another. E.g. for the write() syscall (or an implemenation of
> > that in an fs) we get a Untrusted<[u8]> from Folio and let copy_from_user
> > directly write into that. But that only works if the annotations are
> > exactly right, not too much, not too little.
> 
> Yeah this would be very nice, if you never need to look inside of the
> untrusted data, then you can just move it along.
> 
> > Oh another one we need:
> > 
> > impl<T: AsBytes> AsBytes for Untrusted<AsBytes>
> > 
> > Otherwise you can't write untrusted stuff to userspace, which is really no
> > issue at all (e.g. networking, where the kernel only parses the headers
> > and shovels the data to userspace unchecked).
> 
> Sure.
> 
> >>> Now I think there are cases you can't cover with just validate, where you
> >>> have multiple input data which needs to be cross-checked to ensure overall
> >>> validity. But I think that we can cover that by implementing a Validator
> >>> for tuples and making Untrusted a trait so that we can either Accept
> >>> Untrusted<(A, B)> or (Untrusted<A>, Untrusted<B>) interchangably when
> >>> calling validate().
> >>
> >> I could imagine us adding conversion functions that can combine
> >> untrusted values. Additionally, we should probably add a `Context` type
> >> to `Validator` that is an additional parameter.
> >>
> >>> At least with an hour of theorizing and your one example I couldn't come
> >>> up with anything else.
> >>
> >> Yeah, we need more users of this to know the full way to express this
> >> correctly. I would like to avoid huge refactorings in the future.
> > 
> > I think adding it to the copy_*_user functions we already have in
> > upstream, and then asking Alice to rebase binder should be a really solid
> > real-world testcase. And I think currently for the things in-flight
> > copy*user is going to be the main source of untrusted data anyway, not so
> > much page cache folios.
> 
> Sure. I chose tarfs as the use-case, because Greg mentioned to me that
> it would benefit from adding this API. (I have no prior linux kernel
> experience, so you giving me some pointers where this will be useful is
> very helpful!)
> 
> >>>> +impl Untrusted<[u8]> {
> >>>> +    /// Validate the given bytes directly.
> >>>> +    ///
> >>>> +    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
> >>>> +    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
> >>>> +    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
> >>>> +    /// the [`Validator`] trait and the [`validate()`] function instead.
> >>>> +    ///
> >>>> +    /// # Examples
> >>>> +    ///
> >>>> +    /// ```
> >>>> +    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
> >>>> +    ///
> >>>> +    /// let data: &Untrusted<[u8]> = get_untrusted_data();
> >>>> +    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
> >>>> +    ///     if untrusted.len() != 2 {
> >>>> +    ///         return Err(());
> >>>> +    ///     }
> >>>> +    ///     if untrusted[0] & 0xf0 != 0 {
> >>>> +    ///         return Err(());
> >>>> +    ///     }
> >>>> +    ///     if untrusted[1] >= 100 {
> >>>> +    ///         return Err(());
> >>>> +    ///     }
> >>>> +    ///     Ok(())
> >>>> +    /// });
> >>>> +    /// match data {
> >>>> +    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
> >>>> +    ///     Err(()) => pr_info!("read faulty data from hardware!"),
> >>>> +    /// }
> >>>> +    /// ```
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn validate_bytes<E>(
> >>>
> >>> I think this is a special case of a somewhat common in-place validation
> >>> pattern. For example in in complex syscall or ioctl we need to copy
> >>> structures from userspace anyway and doing yet another copy to put them
> >>> into a rust structure isn't great, instead we validate in-place. So I
> >>> think we need something like
> >>>
> >>> impl Untrusted<T> {
> >>> 	pub fn validate_inplace<E>(
> >>> 		&self,
> >>> 		validator: impl FnOnce(&T) -> Result<(), E>,
> >>> 		) -> Result <&T, E> {
> >>> 		...
> >>> 	}
> >>> }
> >>
> >> I had thought about in-place validation as well, but I first wanted to
> >> get a feel for how to do it, since I felt that in-place might make it
> >> significantly more complicated.
> >> In your proposal, you give a reference back, but maybe the data started
> >> out as a `Box<[u8]>` (or how do you usually copy from userspace?), so it
> >> would be better to be able to also handle owned data.
> > 
> > The code in upstream is just MaybUninit<[u8]>.
> 
> Where is that stored though? On the heap or the stack?
> 
> >> Also, it might be a good idea to just make the copy to kernel and
> >> validate a single step.
> > 
> > Yeah that'd be a nice helper, but I think conceptually you want it to be
> > two steps: Often for efficiency complex structures are linearized into one
> > single memory block, so that you can pull it all in with one
> > copy_from_user. But validation would need to look at each piece (and it's
> > often mixed, not just an array), probably together with Alice's Range
> > datatype to make sure we're don't have index confusions.
> 
> Yeah that makes sense.
> 
> >>> Eventually we might want a _mut version of this too, because often uapi
> >>> extensions means we need to correctly fill out default values for
> >>> extensions when being called by old userspace, and that requires
> >>> mutability. And it really is often an integral part of input validation.
> >>
> >> I see, will have to think about how to include this as well.
> >>
> >>> Also I think we might need an Iterator for Untrusted<I: IntoIterator> so
> >>> that we can validate Untrusted<[T]> and things like that with standard map
> >>> and collect and do it all inplace.
> >>
> >> Hmm, so a general iterator for `Unvalidated` might be a bad idea, but
> >> for the `Untrusted`, it should be fine.
> > 
> > Yup that was my thinking too, the idea being that you write a validator
> > for a single element, and then let Iterator magic handle things when you
> > need to validate an entire array.
> > 
> >>>> +pub trait Validator {
> >>>> +    /// Type of the input data that is untrusted.
> >>>> +    type Input: ?Sized;
> >>>> +    /// Type of the validated data.
> >>>> +    type Output;
> >>>
> >>> So I think the explicit Output makes sense if you have multiple different
> >>> untrusted input that validate to the same trusted structure, but I'm not
> >>> sure this makes sense as associated types. Instead I'd go with generics
> >>> and somethign like this:
> >>>
> >>> pub trait Validator<Input: ?Sized> {
> >>>     type Err;
> >>>
> >>>     fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
> >>> }
> >>>
> >>> That means you can't implement validate for types from other modules
> >>> directly but need a newtype (I think at least, not sure). But I think
> >>> that's actually a good thing, since often that means you're validating
> >>> some generic state plus whatever your own code needs (like the
> >>> inode::Params<tarfs::INodeData> in your example), and both pieces need to
> >>> be consisted overall and not just individually (otherwise why does the
> >>> that other module not do the parsing for you). And so explicitly treating
> >>> the validated output as an explicit new type just makes sense to me. Plus
> >>> with derive(Deref) it's trivial to unbox after validation.
> >>
> >> There might be the need to validate the same piece of data with
> >> different ways and I am not convinced adding a newtype for every single
> >> case is a good way to achieve it.
> >> Although it would simplify the `Validator` trait... I will think a bit
> >> about this.
> > 
> > Hm, but unless I misunderstand you already need a random type to attach
> > your current trait too? So not worse if we require that for the
> > less-common type of multiple ways to validate the same, and simpler for
> > the common one.
> 
> Yes, but you wouldn't have to unwrap the return type. For example with
> your proposal we have:
> 
>     struct MyINodeParams(INodeParams);
> 
>     impl Validator<[u8]> for MyINodeParams {
>         type Err = Error;
> 
>         fn validate(untrusted: &Untrusted<[u8]>) -> Result<Self> {
>             /*...*/
>             Ok(Self(params))
>         }
>     }
> 
>     impl MyINodeParams {
>         fn into_inner(self) -> INodeParams {
>             self.0
>         }
>     }
> 
> And then you would do:
> 
>     let params = untrusted.validate::<MyINodeParams>().into_inner();
> 
> I find the `into_inner()` a bit annoying (one could just use `.0`
> instead, but I also don't like that). I find specifying the `Output` a
> bit cleaner.

Hm right. But I guess with your new plan to only support validate, which
gets the inner passed in explicitly and returns whatever the closure
returns?

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-20 14:29           ` Simona Vetter
@ 2024-09-20 15:28             ` Benno Lossin
  0 siblings, 0 replies; 18+ messages in thread
From: Benno Lossin @ 2024-09-20 15:28 UTC (permalink / raw)
  To: Simona Vetter, Greg KH, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, rust-for-linux,
	linux-kernel

On 20.09.24 16:29, Simona Vetter wrote:
> On Wed, Sep 18, 2024 at 03:40:54PM +0000, Benno Lossin wrote:
>> On 16.09.24 17:49, Simona Vetter wrote:
>>> On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
>>>> What I think we should do instead is make our APIs that return untrusted
>>>> data just return `Untrusted<Folio>` and implement the following method:
>>>>
>>>>     impl Folio {
>>>>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
>>>>     }
>>>>
>>>> I think that is the best of both worlds: we don't need to do excessive
>>>> type shenanigans for every type carrying potentially untrusted data and
>>>> we get to have methods specific to untrusted data.
>>>>
>>>> However, I think implementing this method will be a bit difficult with
>>>> the `Untrusted`/`Unvalidated` split. Maybe we can have some `pub(crate)`
>>>> methods on `Unvalidated` to perform some mappings?
>>>
>>> The thing is, folios are just a pile of contig pages, and there's nothing
>>> in the rules that they only contain untrusted data. Currently in rust code
>>> we have that's the case, but not in general. So we need that associated
>>> type.
>>>
>>> But I also think Folio here is special, a lot of the other places where I
>>> want this annotation it's the case that the data returned is _always_
>>> untrusted. So we don't need to place associated types all over the
>>> codebase to make this work, it's just that the rfc example you've picked
>>> needs it.
>>
>> I think we should try to make just wrapping stuff in `Untrusted` work. I
>> don't see how the associated types would help you any more than just
>> implementing stuff on `&Untrusted<Self`.
> 
> I guess you could wrap it as Untrusted in each use site when you get the
> data out of the Folio, but that makes the guarantees we get out of these
> annotations much less stringent. Which is why I think for Folio<> (well
> really for Pagecache) we need to go with the associated type or it's a bit
> self-defeating.

Let's just implement both ways and see which one is better.

[...]

>>>>>> +pub trait Validator {
>>>>>> +    /// Type of the input data that is untrusted.
>>>>>> +    type Input: ?Sized;
>>>>>> +    /// Type of the validated data.
>>>>>> +    type Output;
>>>>>
>>>>> So I think the explicit Output makes sense if you have multiple different
>>>>> untrusted input that validate to the same trusted structure, but I'm not
>>>>> sure this makes sense as associated types. Instead I'd go with generics
>>>>> and somethign like this:
>>>>>
>>>>> pub trait Validator<Input: ?Sized> {
>>>>>     type Err;
>>>>>
>>>>>     fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
>>>>> }
>>>>>
>>>>> That means you can't implement validate for types from other modules
>>>>> directly but need a newtype (I think at least, not sure). But I think
>>>>> that's actually a good thing, since often that means you're validating
>>>>> some generic state plus whatever your own code needs (like the
>>>>> inode::Params<tarfs::INodeData> in your example), and both pieces need to
>>>>> be consisted overall and not just individually (otherwise why does the
>>>>> that other module not do the parsing for you). And so explicitly treating
>>>>> the validated output as an explicit new type just makes sense to me. Plus
>>>>> with derive(Deref) it's trivial to unbox after validation.
>>>>
>>>> There might be the need to validate the same piece of data with
>>>> different ways and I am not convinced adding a newtype for every single
>>>> case is a good way to achieve it.
>>>> Although it would simplify the `Validator` trait... I will think a bit
>>>> about this.
>>>
>>> Hm, but unless I misunderstand you already need a random type to attach
>>> your current trait too? So not worse if we require that for the
>>> less-common type of multiple ways to validate the same, and simpler for
>>> the common one.
>>
>> Yes, but you wouldn't have to unwrap the return type. For example with
>> your proposal we have:
>>
>>     struct MyINodeParams(INodeParams);
>>
>>     impl Validator<[u8]> for MyINodeParams {
>>         type Err = Error;
>>
>>         fn validate(untrusted: &Untrusted<[u8]>) -> Result<Self> {
>>             /*...*/
>>             Ok(Self(params))
>>         }
>>     }
>>
>>     impl MyINodeParams {
>>         fn into_inner(self) -> INodeParams {
>>             self.0
>>         }
>>     }
>>
>> And then you would do:
>>
>>     let params = untrusted.validate::<MyINodeParams>().into_inner();
>>
>> I find the `into_inner()` a bit annoying (one could just use `.0`
>> instead, but I also don't like that). I find specifying the `Output` a
>> bit cleaner.
> 
> Hm right. But I guess with your new plan to only support validate, which
> gets the inner passed in explicitly and returns whatever the closure
> returns?

The only thing that changes with my suggestion is the parameter type of
`validate` (and the names):


    struct MyINodeParams(INodeParams);

    impl Validate<[u8]> for MyINodeParams {
        type Err = Error;

        fn validate(untrusted: &[u8]) -> Result<Self> {
            /* ... */
            Ok(Self(params))
        }
    }

    let params = untrusted.validate::<MyINodeParams>().into_inner();

And with the `Output` type on the trait we would have:

    struct MyINodeParams;

    impl Validate<[u8]> for MyINodeParams {
        type Err = Error;
        type Output = INodeParams;

        fn validate(untrusted: &[u8]) -> Result<Self::Output> {
            // ...
        }
    }

    let params = untrusted.validate::<MyINodeParams>();

I don't think that it's a huge difference, but nonetheless it is
probably useful.

But, I just remembered a probably more important thing: returning
`Result<Self>` will make it possible to use type inference in places
wehre you *do* want your custom type, so

    struct Foo { /* ... */ }
    impl Validate for Foo { /* ... */ }
    fn use_my_foo(foo: Foo) { /* ... */ }
    
    use_my_foo(untrusted.validate()?)

Should work (ie the `.validate()` call doesn't need the generic
argument).

So I think I will go for no `Output` type in the next version.

---
Cheers,
Benno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-16 15:49       ` Simona Vetter
  2024-09-18 15:40         ` Benno Lossin
@ 2024-09-21  7:45         ` Benno Lossin
  2024-09-23 16:08           ` Simona Vetter
  1 sibling, 1 reply; 18+ messages in thread
From: Benno Lossin @ 2024-09-21  7:45 UTC (permalink / raw)
  To: Simona Vetter, Greg KH, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, rust-for-linux,
	linux-kernel

On 16.09.24 17:49, Simona Vetter wrote:
> On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
>> On 13.09.24 17:33, Simona Vetter wrote:
>>> On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
>>>>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
>>>> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
>>>>  // does not have any uninitialized portions either.
>>>>  unsafe impl<T: AsBytes> AsBytes for [T] {}
>>>>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
>>>> +
>>>> +/// Untrusted data of type `T`.
>>>> +///
>>>> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
>>>> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
>>>> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
>>>> +/// [`validate_bytes()`] function.
>>>> +///
>>>> +///
>>>> +/// [`validate()`]: Self::validate
>>>> +/// [`validate_bytes()`]: Self::validate_bytes
>>>> +#[repr(transparent)]
>>>> +pub struct Untrusted<T: ?Sized>(T);
>>>
>>> I think we could make this a bit more strict and instead of encouraging
>>> people to put all their validation code into a Validator<T>, force
>>> them to. Which means assuming apis wrap all untrusted stuff in
>>> Untrusted<T> reviewers only need to look at validate implementations and
>>> nothing else.
>>>
>>> If I'm not too wrong with my rust I think this would work with slitting
>>> Untrusted<T> into two types:
>>>
>>> - Untrusted<T> is just a box you can't access, and the type returned by
>>>   apis for untrusted data. It doesn't have any of the functions except
>>>   validate and new_untrusted so that you can't get at the data at all.
>>>
>>> - UntrustedUnsafe<T> does have all the accessors needed for validation,
>>>   but you can't construct that outside of this module. Maybe simplest to
>>>   just nest this within the Untrusted<T> box.
>>>
>>> - Untrusted::validate does the unboxing and passes a reference to
>>>   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
>>>   validation code is in there and nowhere else. Similar for
>>>   validate_bytes.
>>>
>>> Of course people can still bypass this easily, but it gets a bit harder to
>>> do so, and easier for reviewers to verify everything.
>>
>> This is a great idea!
>> I think we should also remove `validate_bytes`, then you really must
>> implement `Validator`. If we figure out later that people need it, we
>> can add it later again. I added it because I thought that implementing
>> `Validator` for something very small might be very annoying, but it
>> seems like you actually want to force people to implement it :)
> 
> See further down, I think there's a real use-case for validate_bytes, or
> something really close to it at least.

Could you point me to it? I wasn't able to find the use-case you are
referring to here.

If I remove the `untrusted()` function, then I would think that I also
should remove the `validate_bytes` function, since you could just do:

    untrusted.validate_bytes(|_| true)

Which would give you access to the untrusted data without validation.
(of course one could also do this in the `Validate` trait, but I feel
like if we have exactly one place where this can happen, it will be a
lot easier to catch. We could even have some tool that looks for
`Validate` implementations that just return `Ok(untrusted)`)

---
Cheers,
Benno

>> I think we should discuss the name `UntrustedUnsafe` though, I don't
>> really like the `Unsafe` although I understand why you used it. What do
>> you think of renaming the current `Untrusted` (and the one that only
>> exposes `validate`) to `Unvalidated` and using `Untrusted` as the
>> parameter for `Validator::validate`?
> 
> Much better, I didn't like my naming either.
> 
>> Alternatively, we could use `Unverified`/`Untrusted`, because
>> unvalidated is not a "real English word". But then I think we also
>> should rename `Validator` to `Verifier` etc.
>>
>>>> +    /// Gives direct access to the underlying untrusted data.
>>>> +    ///
>>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>>>> +    /// do so use [`validate()`].
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn untrusted(&self) -> &T {
>>>> +        &self.0
>>>> +    }
>>>> +
>>>> +    /// Gives direct access to the underlying untrusted data.
>>>> +    ///
>>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>>>> +    /// do so use [`validate()`].
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn untrusted_mut(&mut self) -> &mut T {
>>>> +        &mut self.0
>>>> +    }
>>>> +
>>>> +    /// Unwraps the data and removes the untrusted marking.
>>>> +    ///
>>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
>>>> +    /// do so use [`validate()`].
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn into_inner_untrusted(self) -> T
>>>
>>> I don't like the above two since they could be easily and accidentally
>>> abused to leak untrusted data. And at least in your example I think
>>> they're fallout from being a bit too eager with annotating data as
>>> untrusted. The Folio and Mapped itself are just kernel structures, they
>>> better be correct. So I don't think it's useful to annotate those as
>>> untrusted and rely on Deref to also annotate the actual data as untrusted,
>>> because it forces you to peak behind the box a few times.
>>
>> As I wrote in patch 2, I have no idea if I added the `Untrusted<T>` in
>> the correct places, as I don't know folios. I would disagree, that these
>> methods are necessary because of marking the entire folio as untrusted,
>> they are needed to access the data from within the `Validator::validate`
>> function, but with your above suggestion, we can move them to the
>> internal type.
>>
>>> Instead I think we only want to annotate the Folio Deref:
>>>
>>> impl<S> Deref for Mapped<'_, S> {
>>>     type Target = Untrusted<[u8]>;
>>>
>>> Now there will be folios that we trust because they're kernel internal
>>> data and not file cache, so we need more flexibility here. No idea how to
>>> best make that happen, but in either case it's only the Deref data itself
>>> we don't trust, not any of the structures around it. One example would be
>>> gpu page tables. One way to implement this would be to make the Target
>>> type a generic of the folio, or well an associated type of the existing
>>> generics folio paramater:
>>>
>>> pub struct Folio<S: FolioType>
>>>
>>> pub trait FolioType {
>>> 	type Data;
>>> }
>>>
>>> impl<T> FolioType for PageCache<T> {
>>> 	type Data = Untrusted<[u8]>;
>>> }
>>>
>>> And gpu pagetables would use a something like this instead:
>>>
>>> impl FolioType for GpuFolios {
>>> 	type Data = [struct GpuPTE];
>>> }
>>>
>>> All extremely untested.
>>
>> What I would like to avoid is having to do this for every possible data
>> source. Ideally we could just wrap trusted data sources with `Untrusted`
>> and be done with it.
>> If the wrapped type is not plain data, but rather a smart pointer or
>> other abstraction, then only the underlying data is marked untrusted
>> (otherwise how would you even know that the pointer can be used?). So
>> for example one might have an `Untrusted<ARef<[u8]>>`.
> 
> Yeah I think for pure smart pointers this is great, hence why I didn't
> object to your Deref implementation. But if there's an entire
> datastructure around it (like with pagecache) I think we need to sprinkle
> associated types around to make this work.
> 
> What imo breaks annotations like this is if you have a lot of false
> positive cases that force people to jump through hoops and normalize
> having random uncecessary "validate" code all over. That defeats the
> point.
> 
>> What I think we should do instead is make our APIs that return untrusted
>> data just return `Untrusted<Folio>` and implement the following method:
>>
>>     impl Folio {
>>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
>>     }
>>
>> I think that is the best of both worlds: we don't need to do excessive
>> type shenanigans for every type carrying potentially untrusted data and
>> we get to have methods specific to untrusted data.
>>
>> However, I think implementing this method will be a bit difficult with
>> the `Untrusted`/`Unvalidated` split. Maybe we can have some `pub(crate)`
>> methods on `Unvalidated` to perform some mappings?
> 
> The thing is, folios are just a pile of contig pages, and there's nothing
> in the rules that they only contain untrusted data. Currently in rust code
> we have that's the case, but not in general. So we need that associated
> type.
> 
> But I also think Folio here is special, a lot of the other places where I
> want this annotation it's the case that the data returned is _always_
> untrusted. So we don't need to place associated types all over the
> codebase to make this work, it's just that the rfc example you've picked
> needs it.
> 
> E.g. copy_from_user is _always_ untrusted, not exception. Network packets
> we read are also always untrusted. When you have a device driver and want
> to be robust against evil implementations (external bus like usb or cloud
> virtual hw with confidential compute), then also everything you ever read
> from that device is always untrusted until validated.
> 
> And the neat thing is if we get this right, there's a lot of cases where
> the Untrusted<> wrapper doesn't matter, because we just pass from one
> untrusted to another. E.g. for the write() syscall (or an implemenation of
> that in an fs) we get a Untrusted<[u8]> from Folio and let copy_from_user
> directly write into that. But that only works if the annotations are
> exactly right, not too much, not too little.
> 
> Oh another one we need:
> 
> impl<T: AsBytes> AsBytes for Untrusted<AsBytes>
> 
> Otherwise you can't write untrusted stuff to userspace, which is really no
> issue at all (e.g. networking, where the kernel only parses the headers
> and shovels the data to userspace unchecked).
> 
>>> Now I think there are cases you can't cover with just validate, where you
>>> have multiple input data which needs to be cross-checked to ensure overall
>>> validity. But I think that we can cover that by implementing a Validator
>>> for tuples and making Untrusted a trait so that we can either Accept
>>> Untrusted<(A, B)> or (Untrusted<A>, Untrusted<B>) interchangably when
>>> calling validate().
>>
>> I could imagine us adding conversion functions that can combine
>> untrusted values. Additionally, we should probably add a `Context` type
>> to `Validator` that is an additional parameter.
>>
>>> At least with an hour of theorizing and your one example I couldn't come
>>> up with anything else.
>>
>> Yeah, we need more users of this to know the full way to express this
>> correctly. I would like to avoid huge refactorings in the future.
> 
> I think adding it to the copy_*_user functions we already have in
> upstream, and then asking Alice to rebase binder should be a really solid
> real-world testcase. And I think currently for the things in-flight
> copy*user is going to be the main source of untrusted data anyway, not so
> much page cache folios.
> 
>>>> +impl Untrusted<[u8]> {
>>>> +    /// Validate the given bytes directly.
>>>> +    ///
>>>> +    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
>>>> +    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
>>>> +    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
>>>> +    /// the [`Validator`] trait and the [`validate()`] function instead.
>>>> +    ///
>>>> +    /// # Examples
>>>> +    ///
>>>> +    /// ```
>>>> +    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
>>>> +    ///
>>>> +    /// let data: &Untrusted<[u8]> = get_untrusted_data();
>>>> +    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
>>>> +    ///     if untrusted.len() != 2 {
>>>> +    ///         return Err(());
>>>> +    ///     }
>>>> +    ///     if untrusted[0] & 0xf0 != 0 {
>>>> +    ///         return Err(());
>>>> +    ///     }
>>>> +    ///     if untrusted[1] >= 100 {
>>>> +    ///         return Err(());
>>>> +    ///     }
>>>> +    ///     Ok(())
>>>> +    /// });
>>>> +    /// match data {
>>>> +    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
>>>> +    ///     Err(()) => pr_info!("read faulty data from hardware!"),
>>>> +    /// }
>>>> +    /// ```
>>>> +    ///
>>>> +    /// [`validate()`]: Self::validate
>>>> +    pub fn validate_bytes<E>(
>>>
>>> I think this is a special case of a somewhat common in-place validation
>>> pattern. For example in in complex syscall or ioctl we need to copy
>>> structures from userspace anyway and doing yet another copy to put them
>>> into a rust structure isn't great, instead we validate in-place. So I
>>> think we need something like
>>>
>>> impl Untrusted<T> {
>>> 	pub fn validate_inplace<E>(
>>> 		&self,
>>> 		validator: impl FnOnce(&T) -> Result<(), E>,
>>> 		) -> Result <&T, E> {
>>> 		...
>>> 	}
>>> }
>>
>> I had thought about in-place validation as well, but I first wanted to
>> get a feel for how to do it, since I felt that in-place might make it
>> significantly more complicated.
>> In your proposal, you give a reference back, but maybe the data started
>> out as a `Box<[u8]>` (or how do you usually copy from userspace?), so it
>> would be better to be able to also handle owned data.
> 
> The code in upstream is just MaybUninit<[u8]>.
> 
>> Also, it might be a good idea to just make the copy to kernel and
>> validate a single step.
> 
> Yeah that'd be a nice helper, but I think conceptually you want it to be
> two steps: Often for efficiency complex structures are linearized into one
> single memory block, so that you can pull it all in with one
> copy_from_user. But validation would need to look at each piece (and it's
> often mixed, not just an array), probably together with Alice's Range
> datatype to make sure we're don't have index confusions.
> 
>>> Eventually we might want a _mut version of this too, because often uapi
>>> extensions means we need to correctly fill out default values for
>>> extensions when being called by old userspace, and that requires
>>> mutability. And it really is often an integral part of input validation.
>>
>> I see, will have to think about how to include this as well.
>>
>>> Also I think we might need an Iterator for Untrusted<I: IntoIterator> so
>>> that we can validate Untrusted<[T]> and things like that with standard map
>>> and collect and do it all inplace.
>>
>> Hmm, so a general iterator for `Unvalidated` might be a bad idea, but
>> for the `Untrusted`, it should be fine.
> 
> Yup that was my thinking too, the idea being that you write a validator
> for a single element, and then let Iterator magic handle things when you
> need to validate an entire array.
> 
>>>> +pub trait Validator {
>>>> +    /// Type of the input data that is untrusted.
>>>> +    type Input: ?Sized;
>>>> +    /// Type of the validated data.
>>>> +    type Output;
>>>
>>> So I think the explicit Output makes sense if you have multiple different
>>> untrusted input that validate to the same trusted structure, but I'm not
>>> sure this makes sense as associated types. Instead I'd go with generics
>>> and somethign like this:
>>>
>>> pub trait Validator<Input: ?Sized> {
>>>     type Err;
>>>
>>>     fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
>>> }
>>>
>>> That means you can't implement validate for types from other modules
>>> directly but need a newtype (I think at least, not sure). But I think
>>> that's actually a good thing, since often that means you're validating
>>> some generic state plus whatever your own code needs (like the
>>> inode::Params<tarfs::INodeData> in your example), and both pieces need to
>>> be consisted overall and not just individually (otherwise why does the
>>> that other module not do the parsing for you). And so explicitly treating
>>> the validated output as an explicit new type just makes sense to me. Plus
>>> with derive(Deref) it's trivial to unbox after validation.
>>
>> There might be the need to validate the same piece of data with
>> different ways and I am not convinced adding a newtype for every single
>> case is a good way to achieve it.
>> Although it would simplify the `Validator` trait... I will think a bit
>> about this.
> 
> Hm, but unless I misunderstand you already need a random type to attach
> your current trait too? So not worse if we require that for the
> less-common type of multiple ways to validate the same, and simpler for
> the common one.
> 
>>> Cheers, Sima
>>
>> Thanks a lot for taking a look!
> 
> I _really_ want this :-)
> -Sima
> --
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-21  7:45         ` Benno Lossin
@ 2024-09-23 16:08           ` Simona Vetter
  2024-09-23 16:56             ` Benno Lossin
  0 siblings, 1 reply; 18+ messages in thread
From: Simona Vetter @ 2024-09-23 16:08 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Simona Vetter, Greg KH, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, rust-for-linux,
	linux-kernel

On Sat, Sep 21, 2024 at 07:45:44AM +0000, Benno Lossin wrote:
> On 16.09.24 17:49, Simona Vetter wrote:
> > On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
> >> On 13.09.24 17:33, Simona Vetter wrote:
> >>> On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
> >>>>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
> >>>> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
> >>>>  // does not have any uninitialized portions either.
> >>>>  unsafe impl<T: AsBytes> AsBytes for [T] {}
> >>>>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
> >>>> +
> >>>> +/// Untrusted data of type `T`.
> >>>> +///
> >>>> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
> >>>> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
> >>>> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
> >>>> +/// [`validate_bytes()`] function.
> >>>> +///
> >>>> +///
> >>>> +/// [`validate()`]: Self::validate
> >>>> +/// [`validate_bytes()`]: Self::validate_bytes
> >>>> +#[repr(transparent)]
> >>>> +pub struct Untrusted<T: ?Sized>(T);
> >>>
> >>> I think we could make this a bit more strict and instead of encouraging
> >>> people to put all their validation code into a Validator<T>, force
> >>> them to. Which means assuming apis wrap all untrusted stuff in
> >>> Untrusted<T> reviewers only need to look at validate implementations and
> >>> nothing else.
> >>>
> >>> If I'm not too wrong with my rust I think this would work with slitting
> >>> Untrusted<T> into two types:
> >>>
> >>> - Untrusted<T> is just a box you can't access, and the type returned by
> >>>   apis for untrusted data. It doesn't have any of the functions except
> >>>   validate and new_untrusted so that you can't get at the data at all.
> >>>
> >>> - UntrustedUnsafe<T> does have all the accessors needed for validation,
> >>>   but you can't construct that outside of this module. Maybe simplest to
> >>>   just nest this within the Untrusted<T> box.
> >>>
> >>> - Untrusted::validate does the unboxing and passes a reference to
> >>>   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
> >>>   validation code is in there and nowhere else. Similar for
> >>>   validate_bytes.
> >>>
> >>> Of course people can still bypass this easily, but it gets a bit harder to
> >>> do so, and easier for reviewers to verify everything.
> >>
> >> This is a great idea!
> >> I think we should also remove `validate_bytes`, then you really must
> >> implement `Validator`. If we figure out later that people need it, we
> >> can add it later again. I added it because I thought that implementing
> >> `Validator` for something very small might be very annoying, but it
> >> seems like you actually want to force people to implement it :)
> > 
> > See further down, I think there's a real use-case for validate_bytes, or
> > something really close to it at least.
> 
> Could you point me to it? I wasn't able to find the use-case you are
> referring to here.
> 
> If I remove the `untrusted()` function, then I would think that I also
> should remove the `validate_bytes` function, since you could just do:
> 
>     untrusted.validate_bytes(|_| true)
> 
> Which would give you access to the untrusted data without validation.
> (of course one could also do this in the `Validate` trait, but I feel
> like if we have exactly one place where this can happen, it will be a
> lot easier to catch. We could even have some tool that looks for
> `Validate` implementations that just return `Ok(untrusted)`)

So the use-case I've meant is the in-place validation, without copying
stuff around. So it wouldn't be the specific validate_bytes, but a
validate_inplace. In my original email I did directly call it that, but it
took me a w/e to realize that validate_bytes is really just a special-case
of validate_inplace when T = [u8], and so really we only need
validate_inplace for every case where we do some in-place validation.

And yes validate_inplace would be a very easy minimal escape hatch to just
unbox the unvalidated data. But fundamentally we can't prevent that, all
we can do is make sure that the validation code is concentrated in very
specific places reviewers can easily catch, and |_| true probably not
being correct validation code is rather easy to spot.

I do think we need inplace validation for the fairly common case where the
copy_from_user is split from the validation, e.g. when you copy a big
chunk of memory which is composed itself of variable-length objects that
reside within: You'd do one big copy_from_user and then a pile of inplace
validations for each thing that's in there, maybe with the assistance of
alice's Range module to make sure you don't split up the overall memory
area badly into individual pieces.
-Sima


> 
> ---
> Cheers,
> Benno
> 
> >> I think we should discuss the name `UntrustedUnsafe` though, I don't
> >> really like the `Unsafe` although I understand why you used it. What do
> >> you think of renaming the current `Untrusted` (and the one that only
> >> exposes `validate`) to `Unvalidated` and using `Untrusted` as the
> >> parameter for `Validator::validate`?
> > 
> > Much better, I didn't like my naming either.
> > 
> >> Alternatively, we could use `Unverified`/`Untrusted`, because
> >> unvalidated is not a "real English word". But then I think we also
> >> should rename `Validator` to `Verifier` etc.
> >>
> >>>> +    /// Gives direct access to the underlying untrusted data.
> >>>> +    ///
> >>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >>>> +    /// do so use [`validate()`].
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn untrusted(&self) -> &T {
> >>>> +        &self.0
> >>>> +    }
> >>>> +
> >>>> +    /// Gives direct access to the underlying untrusted data.
> >>>> +    ///
> >>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >>>> +    /// do so use [`validate()`].
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn untrusted_mut(&mut self) -> &mut T {
> >>>> +        &mut self.0
> >>>> +    }
> >>>> +
> >>>> +    /// Unwraps the data and removes the untrusted marking.
> >>>> +    ///
> >>>> +    /// Be careful when accessing the data, as it is untrusted and still needs to be verified! To
> >>>> +    /// do so use [`validate()`].
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn into_inner_untrusted(self) -> T
> >>>
> >>> I don't like the above two since they could be easily and accidentally
> >>> abused to leak untrusted data. And at least in your example I think
> >>> they're fallout from being a bit too eager with annotating data as
> >>> untrusted. The Folio and Mapped itself are just kernel structures, they
> >>> better be correct. So I don't think it's useful to annotate those as
> >>> untrusted and rely on Deref to also annotate the actual data as untrusted,
> >>> because it forces you to peak behind the box a few times.
> >>
> >> As I wrote in patch 2, I have no idea if I added the `Untrusted<T>` in
> >> the correct places, as I don't know folios. I would disagree, that these
> >> methods are necessary because of marking the entire folio as untrusted,
> >> they are needed to access the data from within the `Validator::validate`
> >> function, but with your above suggestion, we can move them to the
> >> internal type.
> >>
> >>> Instead I think we only want to annotate the Folio Deref:
> >>>
> >>> impl<S> Deref for Mapped<'_, S> {
> >>>     type Target = Untrusted<[u8]>;
> >>>
> >>> Now there will be folios that we trust because they're kernel internal
> >>> data and not file cache, so we need more flexibility here. No idea how to
> >>> best make that happen, but in either case it's only the Deref data itself
> >>> we don't trust, not any of the structures around it. One example would be
> >>> gpu page tables. One way to implement this would be to make the Target
> >>> type a generic of the folio, or well an associated type of the existing
> >>> generics folio paramater:
> >>>
> >>> pub struct Folio<S: FolioType>
> >>>
> >>> pub trait FolioType {
> >>> 	type Data;
> >>> }
> >>>
> >>> impl<T> FolioType for PageCache<T> {
> >>> 	type Data = Untrusted<[u8]>;
> >>> }
> >>>
> >>> And gpu pagetables would use a something like this instead:
> >>>
> >>> impl FolioType for GpuFolios {
> >>> 	type Data = [struct GpuPTE];
> >>> }
> >>>
> >>> All extremely untested.
> >>
> >> What I would like to avoid is having to do this for every possible data
> >> source. Ideally we could just wrap trusted data sources with `Untrusted`
> >> and be done with it.
> >> If the wrapped type is not plain data, but rather a smart pointer or
> >> other abstraction, then only the underlying data is marked untrusted
> >> (otherwise how would you even know that the pointer can be used?). So
> >> for example one might have an `Untrusted<ARef<[u8]>>`.
> > 
> > Yeah I think for pure smart pointers this is great, hence why I didn't
> > object to your Deref implementation. But if there's an entire
> > datastructure around it (like with pagecache) I think we need to sprinkle
> > associated types around to make this work.
> > 
> > What imo breaks annotations like this is if you have a lot of false
> > positive cases that force people to jump through hoops and normalize
> > having random uncecessary "validate" code all over. That defeats the
> > point.
> > 
> >> What I think we should do instead is make our APIs that return untrusted
> >> data just return `Untrusted<Folio>` and implement the following method:
> >>
> >>     impl Folio {
> >>         pub fn read(self: &Untrusted<Self>) -> &Untrusted<[u8]>;
> >>     }
> >>
> >> I think that is the best of both worlds: we don't need to do excessive
> >> type shenanigans for every type carrying potentially untrusted data and
> >> we get to have methods specific to untrusted data.
> >>
> >> However, I think implementing this method will be a bit difficult with
> >> the `Untrusted`/`Unvalidated` split. Maybe we can have some `pub(crate)`
> >> methods on `Unvalidated` to perform some mappings?
> > 
> > The thing is, folios are just a pile of contig pages, and there's nothing
> > in the rules that they only contain untrusted data. Currently in rust code
> > we have that's the case, but not in general. So we need that associated
> > type.
> > 
> > But I also think Folio here is special, a lot of the other places where I
> > want this annotation it's the case that the data returned is _always_
> > untrusted. So we don't need to place associated types all over the
> > codebase to make this work, it's just that the rfc example you've picked
> > needs it.
> > 
> > E.g. copy_from_user is _always_ untrusted, not exception. Network packets
> > we read are also always untrusted. When you have a device driver and want
> > to be robust against evil implementations (external bus like usb or cloud
> > virtual hw with confidential compute), then also everything you ever read
> > from that device is always untrusted until validated.
> > 
> > And the neat thing is if we get this right, there's a lot of cases where
> > the Untrusted<> wrapper doesn't matter, because we just pass from one
> > untrusted to another. E.g. for the write() syscall (or an implemenation of
> > that in an fs) we get a Untrusted<[u8]> from Folio and let copy_from_user
> > directly write into that. But that only works if the annotations are
> > exactly right, not too much, not too little.
> > 
> > Oh another one we need:
> > 
> > impl<T: AsBytes> AsBytes for Untrusted<AsBytes>
> > 
> > Otherwise you can't write untrusted stuff to userspace, which is really no
> > issue at all (e.g. networking, where the kernel only parses the headers
> > and shovels the data to userspace unchecked).
> > 
> >>> Now I think there are cases you can't cover with just validate, where you
> >>> have multiple input data which needs to be cross-checked to ensure overall
> >>> validity. But I think that we can cover that by implementing a Validator
> >>> for tuples and making Untrusted a trait so that we can either Accept
> >>> Untrusted<(A, B)> or (Untrusted<A>, Untrusted<B>) interchangably when
> >>> calling validate().
> >>
> >> I could imagine us adding conversion functions that can combine
> >> untrusted values. Additionally, we should probably add a `Context` type
> >> to `Validator` that is an additional parameter.
> >>
> >>> At least with an hour of theorizing and your one example I couldn't come
> >>> up with anything else.
> >>
> >> Yeah, we need more users of this to know the full way to express this
> >> correctly. I would like to avoid huge refactorings in the future.
> > 
> > I think adding it to the copy_*_user functions we already have in
> > upstream, and then asking Alice to rebase binder should be a really solid
> > real-world testcase. And I think currently for the things in-flight
> > copy*user is going to be the main source of untrusted data anyway, not so
> > much page cache folios.
> > 
> >>>> +impl Untrusted<[u8]> {
> >>>> +    /// Validate the given bytes directly.
> >>>> +    ///
> >>>> +    /// This is a convenience method to not have to implement the [`Validator`] trait to be able to
> >>>> +    /// just parse some bytes. If the bytes that you are validating have some structure and/or you
> >>>> +    /// will parse it into a `struct` or other rust type, then it is very much recommended to use
> >>>> +    /// the [`Validator`] trait and the [`validate()`] function instead.
> >>>> +    ///
> >>>> +    /// # Examples
> >>>> +    ///
> >>>> +    /// ```
> >>>> +    /// # fn get_untrusted_data() -> &'static Untrusted<[u8]> { &[0; 8] }
> >>>> +    ///
> >>>> +    /// let data: &Untrusted<[u8]> = get_untrusted_data();
> >>>> +    /// let data: Result<&[u8], ()> = data.validate_bytes::<()>(|untrusted| {
> >>>> +    ///     if untrusted.len() != 2 {
> >>>> +    ///         return Err(());
> >>>> +    ///     }
> >>>> +    ///     if untrusted[0] & 0xf0 != 0 {
> >>>> +    ///         return Err(());
> >>>> +    ///     }
> >>>> +    ///     if untrusted[1] >= 100 {
> >>>> +    ///         return Err(());
> >>>> +    ///     }
> >>>> +    ///     Ok(())
> >>>> +    /// });
> >>>> +    /// match data {
> >>>> +    ///     Ok(data) => pr_info!("successfully validated the data: {data}"),
> >>>> +    ///     Err(()) => pr_info!("read faulty data from hardware!"),
> >>>> +    /// }
> >>>> +    /// ```
> >>>> +    ///
> >>>> +    /// [`validate()`]: Self::validate
> >>>> +    pub fn validate_bytes<E>(
> >>>
> >>> I think this is a special case of a somewhat common in-place validation
> >>> pattern. For example in in complex syscall or ioctl we need to copy
> >>> structures from userspace anyway and doing yet another copy to put them
> >>> into a rust structure isn't great, instead we validate in-place. So I
> >>> think we need something like
> >>>
> >>> impl Untrusted<T> {
> >>> 	pub fn validate_inplace<E>(
> >>> 		&self,
> >>> 		validator: impl FnOnce(&T) -> Result<(), E>,
> >>> 		) -> Result <&T, E> {
> >>> 		...
> >>> 	}
> >>> }
> >>
> >> I had thought about in-place validation as well, but I first wanted to
> >> get a feel for how to do it, since I felt that in-place might make it
> >> significantly more complicated.
> >> In your proposal, you give a reference back, but maybe the data started
> >> out as a `Box<[u8]>` (or how do you usually copy from userspace?), so it
> >> would be better to be able to also handle owned data.
> > 
> > The code in upstream is just MaybUninit<[u8]>.
> > 
> >> Also, it might be a good idea to just make the copy to kernel and
> >> validate a single step.
> > 
> > Yeah that'd be a nice helper, but I think conceptually you want it to be
> > two steps: Often for efficiency complex structures are linearized into one
> > single memory block, so that you can pull it all in with one
> > copy_from_user. But validation would need to look at each piece (and it's
> > often mixed, not just an array), probably together with Alice's Range
> > datatype to make sure we're don't have index confusions.
> > 
> >>> Eventually we might want a _mut version of this too, because often uapi
> >>> extensions means we need to correctly fill out default values for
> >>> extensions when being called by old userspace, and that requires
> >>> mutability. And it really is often an integral part of input validation.
> >>
> >> I see, will have to think about how to include this as well.
> >>
> >>> Also I think we might need an Iterator for Untrusted<I: IntoIterator> so
> >>> that we can validate Untrusted<[T]> and things like that with standard map
> >>> and collect and do it all inplace.
> >>
> >> Hmm, so a general iterator for `Unvalidated` might be a bad idea, but
> >> for the `Untrusted`, it should be fine.
> > 
> > Yup that was my thinking too, the idea being that you write a validator
> > for a single element, and then let Iterator magic handle things when you
> > need to validate an entire array.
> > 
> >>>> +pub trait Validator {
> >>>> +    /// Type of the input data that is untrusted.
> >>>> +    type Input: ?Sized;
> >>>> +    /// Type of the validated data.
> >>>> +    type Output;
> >>>
> >>> So I think the explicit Output makes sense if you have multiple different
> >>> untrusted input that validate to the same trusted structure, but I'm not
> >>> sure this makes sense as associated types. Instead I'd go with generics
> >>> and somethign like this:
> >>>
> >>> pub trait Validator<Input: ?Sized> {
> >>>     type Err;
> >>>
> >>>     fn validate(untrusted: &Untrusted<Input>) -> Result<Self, Self::Err>;
> >>> }
> >>>
> >>> That means you can't implement validate for types from other modules
> >>> directly but need a newtype (I think at least, not sure). But I think
> >>> that's actually a good thing, since often that means you're validating
> >>> some generic state plus whatever your own code needs (like the
> >>> inode::Params<tarfs::INodeData> in your example), and both pieces need to
> >>> be consisted overall and not just individually (otherwise why does the
> >>> that other module not do the parsing for you). And so explicitly treating
> >>> the validated output as an explicit new type just makes sense to me. Plus
> >>> with derive(Deref) it's trivial to unbox after validation.
> >>
> >> There might be the need to validate the same piece of data with
> >> different ways and I am not convinced adding a newtype for every single
> >> case is a good way to achieve it.
> >> Although it would simplify the `Validator` trait... I will think a bit
> >> about this.
> > 
> > Hm, but unless I misunderstand you already need a random type to attach
> > your current trait too? So not worse if we require that for the
> > less-common type of multiple ways to validate the same, and simpler for
> > the common one.
> > 
> >>> Cheers, Sima
> >>
> >> Thanks a lot for taking a look!
> > 
> > I _really_ want this :-)
> > -Sima
> > --
> > Simona Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-23 16:08           ` Simona Vetter
@ 2024-09-23 16:56             ` Benno Lossin
  2024-09-24  8:05               ` Simona Vetter
  0 siblings, 1 reply; 18+ messages in thread
From: Benno Lossin @ 2024-09-23 16:56 UTC (permalink / raw)
  To: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On 23.09.24 18:08, Simona Vetter wrote:
> On Sat, Sep 21, 2024 at 07:45:44AM +0000, Benno Lossin wrote:
>> On 16.09.24 17:49, Simona Vetter wrote:
>>> On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
>>>> On 13.09.24 17:33, Simona Vetter wrote:
>>>>> On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
>>>>>>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
>>>>>> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
>>>>>>  // does not have any uninitialized portions either.
>>>>>>  unsafe impl<T: AsBytes> AsBytes for [T] {}
>>>>>>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
>>>>>> +
>>>>>> +/// Untrusted data of type `T`.
>>>>>> +///
>>>>>> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
>>>>>> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
>>>>>> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
>>>>>> +/// [`validate_bytes()`] function.
>>>>>> +///
>>>>>> +///
>>>>>> +/// [`validate()`]: Self::validate
>>>>>> +/// [`validate_bytes()`]: Self::validate_bytes
>>>>>> +#[repr(transparent)]
>>>>>> +pub struct Untrusted<T: ?Sized>(T);
>>>>>
>>>>> I think we could make this a bit more strict and instead of encouraging
>>>>> people to put all their validation code into a Validator<T>, force
>>>>> them to. Which means assuming apis wrap all untrusted stuff in
>>>>> Untrusted<T> reviewers only need to look at validate implementations and
>>>>> nothing else.
>>>>>
>>>>> If I'm not too wrong with my rust I think this would work with slitting
>>>>> Untrusted<T> into two types:
>>>>>
>>>>> - Untrusted<T> is just a box you can't access, and the type returned by
>>>>>   apis for untrusted data. It doesn't have any of the functions except
>>>>>   validate and new_untrusted so that you can't get at the data at all.
>>>>>
>>>>> - UntrustedUnsafe<T> does have all the accessors needed for validation,
>>>>>   but you can't construct that outside of this module. Maybe simplest to
>>>>>   just nest this within the Untrusted<T> box.
>>>>>
>>>>> - Untrusted::validate does the unboxing and passes a reference to
>>>>>   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
>>>>>   validation code is in there and nowhere else. Similar for
>>>>>   validate_bytes.
>>>>>
>>>>> Of course people can still bypass this easily, but it gets a bit harder to
>>>>> do so, and easier for reviewers to verify everything.
>>>>
>>>> This is a great idea!
>>>> I think we should also remove `validate_bytes`, then you really must
>>>> implement `Validator`. If we figure out later that people need it, we
>>>> can add it later again. I added it because I thought that implementing
>>>> `Validator` for something very small might be very annoying, but it
>>>> seems like you actually want to force people to implement it :)
>>>
>>> See further down, I think there's a real use-case for validate_bytes, or
>>> something really close to it at least.
>>
>> Could you point me to it? I wasn't able to find the use-case you are
>> referring to here.
>>
>> If I remove the `untrusted()` function, then I would think that I also
>> should remove the `validate_bytes` function, since you could just do:
>>
>>     untrusted.validate_bytes(|_| true)
>>
>> Which would give you access to the untrusted data without validation.
>> (of course one could also do this in the `Validate` trait, but I feel
>> like if we have exactly one place where this can happen, it will be a
>> lot easier to catch. We could even have some tool that looks for
>> `Validate` implementations that just return `Ok(untrusted)`)
> 
> So the use-case I've meant is the in-place validation, without copying
> stuff around. So it wouldn't be the specific validate_bytes, but a
> validate_inplace. In my original email I did directly call it that, but it
> took me a w/e to realize that validate_bytes is really just a special-case
> of validate_inplace when T = [u8], and so really we only need
> validate_inplace for every case where we do some in-place validation.

Ahh I see, then we don't need `validate_bytes`.

> And yes validate_inplace would be a very easy minimal escape hatch to just
> unbox the unvalidated data. But fundamentally we can't prevent that, all
> we can do is make sure that the validation code is concentrated in very
> specific places reviewers can easily catch, and |_| true probably not
> being correct validation code is rather easy to spot.

I don't think that it necessarily has to be an escape hatch. I think I
can design an API where you can choose if it will be done in-place vs
copied, while still using the `Validate` trait.

> I do think we need inplace validation for the fairly common case where the
> copy_from_user is split from the validation, e.g. when you copy a big
> chunk of memory which is composed itself of variable-length objects that
> reside within: You'd do one big copy_from_user and then a pile of inplace
> validations for each thing that's in there, maybe with the assistance of
> alice's Range module to make sure you don't split up the overall memory
> area badly into individual pieces.

Yes, will definitely add it.

---
Cheers,
Benno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] rust: add untrusted data abstraction
  2024-09-23 16:56             ` Benno Lossin
@ 2024-09-24  8:05               ` Simona Vetter
  0 siblings, 0 replies; 18+ messages in thread
From: Simona Vetter @ 2024-09-24  8:05 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Greg KH, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, rust-for-linux, linux-kernel

On Mon, Sep 23, 2024 at 04:56:08PM +0000, Benno Lossin wrote:
> On 23.09.24 18:08, Simona Vetter wrote:
> > On Sat, Sep 21, 2024 at 07:45:44AM +0000, Benno Lossin wrote:
> >> On 16.09.24 17:49, Simona Vetter wrote:
> >>> On Fri, Sep 13, 2024 at 04:49:29PM +0000, Benno Lossin wrote:
> >>>> On 13.09.24 17:33, Simona Vetter wrote:
> >>>>> On Fri, Sep 13, 2024 at 11:26:57AM +0000, Benno Lossin wrote:
> >>>>>>  /// Used to transfer ownership to and from foreign (non-Rust) languages.
> >>>>>> @@ -532,3 +533,248 @@ unsafe impl AsBytes for str {}
> >>>>>>  // does not have any uninitialized portions either.
> >>>>>>  unsafe impl<T: AsBytes> AsBytes for [T] {}
> >>>>>>  unsafe impl<T: AsBytes, const N: usize> AsBytes for [T; N] {}
> >>>>>> +
> >>>>>> +/// Untrusted data of type `T`.
> >>>>>> +///
> >>>>>> +/// When reading data from userspace, hardware or other external untrusted sources, the data must
> >>>>>> +/// be validated before it is used for logic within the kernel. To do so, the [`validate()`]
> >>>>>> +/// function exists and uses the [`Validator`] trait. For raw bytes validation there also is the
> >>>>>> +/// [`validate_bytes()`] function.
> >>>>>> +///
> >>>>>> +///
> >>>>>> +/// [`validate()`]: Self::validate
> >>>>>> +/// [`validate_bytes()`]: Self::validate_bytes
> >>>>>> +#[repr(transparent)]
> >>>>>> +pub struct Untrusted<T: ?Sized>(T);
> >>>>>
> >>>>> I think we could make this a bit more strict and instead of encouraging
> >>>>> people to put all their validation code into a Validator<T>, force
> >>>>> them to. Which means assuming apis wrap all untrusted stuff in
> >>>>> Untrusted<T> reviewers only need to look at validate implementations and
> >>>>> nothing else.
> >>>>>
> >>>>> If I'm not too wrong with my rust I think this would work with slitting
> >>>>> Untrusted<T> into two types:
> >>>>>
> >>>>> - Untrusted<T> is just a box you can't access, and the type returned by
> >>>>>   apis for untrusted data. It doesn't have any of the functions except
> >>>>>   validate and new_untrusted so that you can't get at the data at all.
> >>>>>
> >>>>> - UntrustedUnsafe<T> does have all the accessors needed for validation,
> >>>>>   but you can't construct that outside of this module. Maybe simplest to
> >>>>>   just nest this within the Untrusted<T> box.
> >>>>>
> >>>>> - Untrusted::validate does the unboxing and passes a reference to
> >>>>>   UntrustedUnsafe<T> to Validator::validate, to guarantee that all the
> >>>>>   validation code is in there and nowhere else. Similar for
> >>>>>   validate_bytes.
> >>>>>
> >>>>> Of course people can still bypass this easily, but it gets a bit harder to
> >>>>> do so, and easier for reviewers to verify everything.
> >>>>
> >>>> This is a great idea!
> >>>> I think we should also remove `validate_bytes`, then you really must
> >>>> implement `Validator`. If we figure out later that people need it, we
> >>>> can add it later again. I added it because I thought that implementing
> >>>> `Validator` for something very small might be very annoying, but it
> >>>> seems like you actually want to force people to implement it :)
> >>>
> >>> See further down, I think there's a real use-case for validate_bytes, or
> >>> something really close to it at least.
> >>
> >> Could you point me to it? I wasn't able to find the use-case you are
> >> referring to here.
> >>
> >> If I remove the `untrusted()` function, then I would think that I also
> >> should remove the `validate_bytes` function, since you could just do:
> >>
> >>     untrusted.validate_bytes(|_| true)
> >>
> >> Which would give you access to the untrusted data without validation.
> >> (of course one could also do this in the `Validate` trait, but I feel
> >> like if we have exactly one place where this can happen, it will be a
> >> lot easier to catch. We could even have some tool that looks for
> >> `Validate` implementations that just return `Ok(untrusted)`)
> > 
> > So the use-case I've meant is the in-place validation, without copying
> > stuff around. So it wouldn't be the specific validate_bytes, but a
> > validate_inplace. In my original email I did directly call it that, but it
> > took me a w/e to realize that validate_bytes is really just a special-case
> > of validate_inplace when T = [u8], and so really we only need
> > validate_inplace for every case where we do some in-place validation.
> 
> Ahh I see, then we don't need `validate_bytes`.
> 
> > And yes validate_inplace would be a very easy minimal escape hatch to just
> > unbox the unvalidated data. But fundamentally we can't prevent that, all
> > we can do is make sure that the validation code is concentrated in very
> > specific places reviewers can easily catch, and |_| true probably not
> > being correct validation code is rather easy to spot.
> 
> I don't think that it necessarily has to be an escape hatch. I think I
> can design an API where you can choose if it will be done in-place vs
> copied, while still using the `Validate` trait.

Yeah if you can put both copy and in-place validation into the Validate
trait then might be cleaner, I just didn't come up with a way to do that.

> > I do think we need inplace validation for the fairly common case where the
> > copy_from_user is split from the validation, e.g. when you copy a big
> > chunk of memory which is composed itself of variable-length objects that
> > reside within: You'd do one big copy_from_user and then a pile of inplace
> > validations for each thing that's in there, maybe with the assistance of
> > alice's Range module to make sure you don't split up the overall memory
> > area badly into individual pieces.
> 
> Yes, will definitely add it.

Sounds good!
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-09-24  8:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240913112643.542914-1-benno.lossin@proton.me>
2024-09-13 11:26 ` [PATCH 1/3] rust: add untrusted data abstraction Benno Lossin
2024-09-13 13:41   ` Finn Behrens
2024-09-13 13:47     ` Benno Lossin
2024-09-13 15:33   ` Simona Vetter
2024-09-13 16:49     ` Benno Lossin
2024-09-16 15:49       ` Simona Vetter
2024-09-18 15:40         ` Benno Lossin
2024-09-18 17:09           ` Greg KH
2024-09-18 17:33             ` Benno Lossin
2024-09-18 17:39               ` Greg KH
2024-09-20 14:29           ` Simona Vetter
2024-09-20 15:28             ` Benno Lossin
2024-09-21  7:45         ` Benno Lossin
2024-09-23 16:08           ` Simona Vetter
2024-09-23 16:56             ` Benno Lossin
2024-09-24  8:05               ` Simona Vetter
2024-09-13 11:27 ` [RFC PATCH 2/3] WIP: rust: fs: mark data returned by inodes untrusted Benno Lossin
2024-09-13 11:27 ` [RFC PATCH 3/3] WIP: rust: tarfs: use untrusted data API Benno Lossin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).