All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next V5 0/5] HW Device hot-removal support
@ 2015-06-22 14:47 Yishai Hadas
       [not found] ` <1434984438-21733-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Yishai Hadas @ 2015-06-22 14:47 UTC (permalink / raw
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Currently, if there is any user space application using an IB device,
it is impossible to unload the HW device driver for this device.

Similarly, if the device is hot-unplugged or reset, the device driver
hardware removal flow blocks until all user contexts are destroyed.

This patchset removes the above limitations from both uverbs and ucma.

The IB-core and uverbs layers are still required to remain loaded as
long as there are user applications using the verbs API. However, the
hardware device drivers are not blocked any more by the user space
activity.

To support this, the hardware device needs to expose a new kernel API
named 'disassociate_ucontext'. The device driver is given a ucontext
to detach from, and it should block this user context from any future
hardware access. In the IB-core level, we use this interface to
deactivate all ucontext that address a specific device when handling a
remove_one callback for it.

In the RDMA-CM layer, a similar change is needed in the ucma module,
to prevent blocking of the remove_one operation. This allows for
hot-removal of devices with RDMA-CM users in the user space.

The first two patches are preparation for this series.
The first patch fixes a reference counting issue pointed by Jason Gunthorpe.
The second patch is a preparation step for deploying RCU for the device
removal flow.

The third patch introduces the new API between the HW device driver and
the IB core. For devices which implement the functionality, IB core
will use it in remove_one, disassociating any active ucontext from the
hardware device. Other drivers that didn't implement it will behave as
today, remove_one will block until all ucontexts referring the device
are destroyed before returning.

The fourth patch provides implementation of this API for the mlx4
driver.

The last patch extends ucma to avoid blocking remove_one operation in
the cma module. When such device removal event is received, ucma is
turning all user contexts to zombie contexts. This is done by
releasing all underlying resources and preventing any further user
operations on the context.

Changes from V4:
#patch #1,#3 - addressed Jason's comments.
#patch #2, #4 - rebased upon last stuff.

Changes from V3:
Add 2 patches as a preparation for this series, details above.
patch #3: Change the locking schema based on Jason's comments.

Changes from V2:
patch #1: Rebase over ODP patches.

Changes from V1:
patch #1: Use uverbs flags instead of disassociate support, drop fatal_event_raised flag.
patch #3: Add support in ucma for handling device removal.

Changes from V0:
patch #1: ib_uverbs_close, reduced mutex scope to enable tasks run in parallel.

Yishai Hadas (5):
  IB/uverbs: Fix reference counting usage of event files
  IB/uverbs: Explicitly pass ib_dev to uverbs commands
  IB/uverbs: Enable device removal when there are active user space
    applications
  IB/mlx4_ib: Disassociate support
  IB/ucma: HW Device hot-removal support

 drivers/infiniband/core/ucma.c        |  130 ++++++++++-
 drivers/infiniband/core/uverbs.h      |   13 +-
 drivers/infiniband/core/uverbs_cmd.c  |  114 ++++++----
 drivers/infiniband/core/uverbs_main.c |  406 +++++++++++++++++++++++++++------
 drivers/infiniband/hw/mlx4/main.c     |  139 +++++++++++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h  |   13 +
 include/rdma/ib_verbs.h               |    1 +
 7 files changed, 690 insertions(+), 126 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found] ` <1434984438-21733-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-22 14:47   ` Yishai Hadas
       [not found]     ` <1434984438-21733-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-22 14:47   ` [PATCH for-next V5 2/5] IB/uverbs: Explicitly pass ib_dev to uverbs commands Yishai Hadas
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Yishai Hadas @ 2015-06-22 14:47 UTC (permalink / raw
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Fix the reference counting usage to be handled in the event file
creation/destruction function, instead of being done by the caller.
This is done for both async/non-async event files.

Based on Jason Gunthorpe report at https://www.mail-archive.com/
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg24680.html:
"The existing code for this is broken, in ib_uverbs_get_context all
the error paths between ib_uverbs_alloc_event_file and the
kref_get(file->ref) are wrong - this will result in fput() which will
call ib_uverbs_event_close, which will try to do kref_put and
ib_unregister_event_handler - which are no longer paired."

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/uverbs.h      |    1 +
 drivers/infiniband/core/uverbs_cmd.c  |   11 +--------
 drivers/infiniband/core/uverbs_main.c |   41 ++++++++++++++++++++++++++++-----
 3 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index ba365b6..60e6e3d 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -178,6 +178,7 @@ void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj);
 
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 					int is_async);
+void ib_uverbs_free_async_event_file(struct ib_uverbs_file *uverbs_file);
 struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd);
 
 void ib_uverbs_release_ucq(struct ib_uverbs_file *file,
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index bbb02ff..5720a92 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -367,16 +367,6 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		goto err_file;
 	}
 
-	file->async_file = filp->private_data;
-
-	INIT_IB_EVENT_HANDLER(&file->event_handler, file->device->ib_dev,
-			      ib_uverbs_event_handler);
-	ret = ib_register_event_handler(&file->event_handler);
-	if (ret)
-		goto err_file;
-
-	kref_get(&file->async_file->ref);
-	kref_get(&file->ref);
 	file->ucontext = ucontext;
 
 	fd_install(resp.async_fd, filp);
@@ -386,6 +376,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	return in_len;
 
 err_file:
+	ib_uverbs_free_async_event_file(file);
 	fput(filp);
 
 err_fd:
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index f6eef2d..ed54e4e 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -406,10 +406,9 @@ static int ib_uverbs_event_close(struct inode *inode, struct file *filp)
 	}
 	spin_unlock_irq(&file->lock);
 
-	if (file->is_async) {
+	if (file->is_async)
 		ib_unregister_event_handler(&file->uverbs_file->event_handler);
-		kref_put(&file->uverbs_file->ref, ib_uverbs_release_file);
-	}
+	kref_put(&file->uverbs_file->ref, ib_uverbs_release_file);
 	kref_put(&file->ref, ib_uverbs_release_event_file);
 
 	return 0;
@@ -541,13 +540,20 @@ void ib_uverbs_event_handler(struct ib_event_handler *handler,
 				NULL, NULL);
 }
 
+void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file)
+{
+	kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
+	file->async_file = NULL;
+}
+
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 					int is_async)
 {
 	struct ib_uverbs_event_file *ev_file;
 	struct file *filp;
+	int ret;
 
-	ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL);
+	ev_file = kzalloc(sizeof(*ev_file), GFP_KERNEL);
 	if (!ev_file)
 		return ERR_PTR(-ENOMEM);
 
@@ -557,15 +563,38 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 	init_waitqueue_head(&ev_file->poll_wait);
 	ev_file->uverbs_file = uverbs_file;
 	ev_file->async_queue = NULL;
-	ev_file->is_async    = is_async;
 	ev_file->is_closed   = 0;
 
 	filp = anon_inode_getfile("[infinibandevent]", &uverbs_event_fops,
 				  ev_file, O_RDONLY);
 	if (IS_ERR(filp))
-		kfree(ev_file);
+		goto err;
+
+	kref_get(&uverbs_file->ref);
+	if (is_async) {
+		WARN_ON(uverbs_file->async_file);
+		uverbs_file->async_file = ev_file;
+		kref_get(&uverbs_file->async_file->ref);
+		INIT_IB_EVENT_HANDLER(&uverbs_file->event_handler,
+				      uverbs_file->device->ib_dev,
+				      ib_uverbs_event_handler);
+		ret = ib_register_event_handler(&uverbs_file->event_handler);
+		if (ret)
+			goto put_file;
+
+		/* At that point async file stuff was fully set */
+		ev_file->is_async = 1;
+	}
 
 	return filp;
+
+put_file:
+	fput(filp);
+	uverbs_file->async_file = NULL;
+	filp = ERR_PTR(ret);
+err:
+	kref_put(&ev_file->ref, ib_uverbs_release_event_file);
+	return filp;
 }
 
 /*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next V5 2/5] IB/uverbs: Explicitly pass ib_dev to uverbs commands
       [not found] ` <1434984438-21733-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-22 14:47   ` [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files Yishai Hadas
@ 2015-06-22 14:47   ` Yishai Hadas
  2015-06-22 14:47   ` [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications Yishai Hadas
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Yishai Hadas @ 2015-06-22 14:47 UTC (permalink / raw
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Done in preparation for deploying RCU for the device removal
flow. Allows isolating the RCU handling to the uverb_main layer and
keeping the uverbs_cmd code as is.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/uverbs.h      |    3 +
 drivers/infiniband/core/uverbs_cmd.c  |  103 ++++++++++++++++++++++-----------
 drivers/infiniband/core/uverbs_main.c |   21 +++++--
 3 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 60e6e3d..dc002a1 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -177,6 +177,7 @@ extern struct idr ib_uverbs_rule_idr;
 void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj);
 
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
+					struct ib_device *ib_dev,
 					int is_async);
 void ib_uverbs_free_async_event_file(struct ib_uverbs_file *uverbs_file);
 struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd);
@@ -213,6 +214,7 @@ struct ib_uverbs_flow_spec {
 
 #define IB_UVERBS_DECLARE_CMD(name)					\
 	ssize_t ib_uverbs_##name(struct ib_uverbs_file *file,		\
+				 struct ib_device *ib_dev,              \
 				 const char __user *buf, int in_len,	\
 				 int out_len)
 
@@ -254,6 +256,7 @@ IB_UVERBS_DECLARE_CMD(close_xrcd);
 
 #define IB_UVERBS_DECLARE_EX_CMD(name)				\
 	int ib_uverbs_ex_##name(struct ib_uverbs_file *file,	\
+				struct ib_device *ib_dev,		\
 				struct ib_udata *ucore,		\
 				struct ib_udata *uhw)
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 5720a92..29443c0 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -282,13 +282,13 @@ static void put_xrcd_read(struct ib_uobject *uobj)
 }
 
 ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
+			      struct ib_device *ib_dev,
 			      const char __user *buf,
 			      int in_len, int out_len)
 {
 	struct ib_uverbs_get_context      cmd;
 	struct ib_uverbs_get_context_resp resp;
 	struct ib_udata                   udata;
-	struct ib_device                 *ibdev = file->device->ib_dev;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 	struct ib_device_attr		  dev_attr;
 #endif
@@ -313,13 +313,13 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	ucontext = ibdev->alloc_ucontext(ibdev, &udata);
+	ucontext = ib_dev->alloc_ucontext(ib_dev, &udata);
 	if (IS_ERR(ucontext)) {
 		ret = PTR_ERR(ucontext);
 		goto err;
 	}
 
-	ucontext->device = ibdev;
+	ucontext->device = ib_dev;
 	INIT_LIST_HEAD(&ucontext->pd_list);
 	INIT_LIST_HEAD(&ucontext->mr_list);
 	INIT_LIST_HEAD(&ucontext->mw_list);
@@ -340,7 +340,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	ucontext->odp_mrs_count = 0;
 	INIT_LIST_HEAD(&ucontext->no_private_counters);
 
-	ret = ib_query_device(ibdev, &dev_attr);
+	ret = ib_query_device(ib_dev, &dev_attr);
 	if (ret)
 		goto err_free;
 	if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
@@ -355,7 +355,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		goto err_free;
 	resp.async_fd = ret;
 
-	filp = ib_uverbs_alloc_event_file(file, 1);
+	filp = ib_uverbs_alloc_event_file(file, ib_dev, 1);
 	if (IS_ERR(filp)) {
 		ret = PTR_ERR(filp);
 		goto err_fd;
@@ -384,7 +384,7 @@ err_fd:
 
 err_free:
 	put_pid(ucontext->tgid);
-	ibdev->dealloc_ucontext(ucontext);
+	ib_dev->dealloc_ucontext(ucontext);
 
 err:
 	mutex_unlock(&file->mutex);
@@ -392,11 +392,12 @@ err:
 }
 
 static void copy_query_dev_fields(struct ib_uverbs_file *file,
+				  struct ib_device *ib_dev,
 				  struct ib_uverbs_query_device_resp *resp,
 				  struct ib_device_attr *attr)
 {
 	resp->fw_ver		= attr->fw_ver;
-	resp->node_guid		= file->device->ib_dev->node_guid;
+	resp->node_guid		= ib_dev->node_guid;
 	resp->sys_image_guid	= attr->sys_image_guid;
 	resp->max_mr_size	= attr->max_mr_size;
 	resp->page_size_cap	= attr->page_size_cap;
@@ -434,10 +435,11 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file,
 	resp->max_srq_sge		= attr->max_srq_sge;
 	resp->max_pkeys			= attr->max_pkeys;
 	resp->local_ca_ack_delay	= attr->local_ca_ack_delay;
-	resp->phys_port_cnt		= file->device->ib_dev->phys_port_cnt;
+	resp->phys_port_cnt		= ib_dev->phys_port_cnt;
 }
 
 ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
+			       struct ib_device *ib_dev,
 			       const char __user *buf,
 			       int in_len, int out_len)
 {
@@ -452,12 +454,12 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	ret = ib_query_device(file->device->ib_dev, &attr);
+	ret = ib_query_device(ib_dev, &attr);
 	if (ret)
 		return ret;
 
 	memset(&resp, 0, sizeof resp);
-	copy_query_dev_fields(file, &resp, &attr);
+	copy_query_dev_fields(file, ib_dev, &resp, &attr);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
@@ -467,6 +469,7 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf,
 			     int in_len, int out_len)
 {
@@ -481,7 +484,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
 	if (copy_from_user(&cmd, buf, sizeof cmd))
 		return -EFAULT;
 
-	ret = ib_query_port(file->device->ib_dev, cmd.port_num, &attr);
+	ret = ib_query_port(ib_dev, cmd.port_num, &attr);
 	if (ret)
 		return ret;
 
@@ -506,7 +509,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
 	resp.active_width    = attr.active_width;
 	resp.active_speed    = attr.active_speed;
 	resp.phys_state      = attr.phys_state;
-	resp.link_layer      = rdma_port_get_link_layer(file->device->ib_dev,
+	resp.link_layer      = rdma_port_get_link_layer(ib_dev,
 							cmd.port_num);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
@@ -517,6 +520,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
+			   struct ib_device *ib_dev,
 			   const char __user *buf,
 			   int in_len, int out_len)
 {
@@ -544,14 +548,13 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
 	init_uobj(uobj, 0, file->ucontext, &pd_lock_class);
 	down_write(&uobj->mutex);
 
-	pd = file->device->ib_dev->alloc_pd(file->device->ib_dev,
-					    file->ucontext, &udata);
+	pd = ib_dev->alloc_pd(ib_dev, file->ucontext, &udata);
 	if (IS_ERR(pd)) {
 		ret = PTR_ERR(pd);
 		goto err;
 	}
 
-	pd->device  = file->device->ib_dev;
+	pd->device  = ib_dev;
 	pd->uobject = uobj;
 	atomic_set(&pd->usecnt, 0);
 
@@ -591,6 +594,7 @@ err:
 }
 
 ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf,
 			     int in_len, int out_len)
 {
@@ -711,6 +715,7 @@ static void xrcd_table_delete(struct ib_uverbs_device *dev,
 }
 
 ssize_t ib_uverbs_open_xrcd(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -769,15 +774,14 @@ ssize_t ib_uverbs_open_xrcd(struct ib_uverbs_file *file,
 	down_write(&obj->uobject.mutex);
 
 	if (!xrcd) {
-		xrcd = file->device->ib_dev->alloc_xrcd(file->device->ib_dev,
-							file->ucontext, &udata);
+		xrcd = ib_dev->alloc_xrcd(ib_dev, file->ucontext, &udata);
 		if (IS_ERR(xrcd)) {
 			ret = PTR_ERR(xrcd);
 			goto err;
 		}
 
 		xrcd->inode   = inode;
-		xrcd->device  = file->device->ib_dev;
+		xrcd->device  = ib_dev;
 		atomic_set(&xrcd->usecnt, 0);
 		mutex_init(&xrcd->tgt_qp_mutex);
 		INIT_LIST_HEAD(&xrcd->tgt_qp_list);
@@ -848,6 +852,7 @@ err_tree_mutex_unlock:
 }
 
 ssize_t ib_uverbs_close_xrcd(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf, int in_len,
 			     int out_len)
 {
@@ -925,6 +930,7 @@ void ib_uverbs_dealloc_xrcd(struct ib_uverbs_device *dev,
 }
 
 ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
+			 struct ib_device *ib_dev,
 			 const char __user *buf, int in_len,
 			 int out_len)
 {
@@ -1034,6 +1040,7 @@ err_free:
 }
 
 ssize_t ib_uverbs_rereg_mr(struct ib_uverbs_file *file,
+			   struct ib_device *ib_dev,
 			   const char __user *buf, int in_len,
 			   int out_len)
 {
@@ -1127,6 +1134,7 @@ put_uobjs:
 }
 
 ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
+			   struct ib_device *ib_dev,
 			   const char __user *buf, int in_len,
 			   int out_len)
 {
@@ -1165,8 +1173,9 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_alloc_mw(struct ib_uverbs_file *file,
-			 const char __user *buf, int in_len,
-			 int out_len)
+			   struct ib_device *ib_dev,
+			   const char __user *buf, int in_len,
+			   int out_len)
 {
 	struct ib_uverbs_alloc_mw      cmd;
 	struct ib_uverbs_alloc_mw_resp resp;
@@ -1247,8 +1256,9 @@ err_free:
 }
 
 ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
-			   const char __user *buf, int in_len,
-			   int out_len)
+			     struct ib_device *ib_dev,
+			     const char __user *buf, int in_len,
+			     int out_len)
 {
 	struct ib_uverbs_dealloc_mw cmd;
 	struct ib_mw               *mw;
@@ -1285,6 +1295,7 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
+				      struct ib_device *ib_dev,
 				      const char __user *buf, int in_len,
 				      int out_len)
 {
@@ -1304,7 +1315,7 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
 		return ret;
 	resp.fd = ret;
 
-	filp = ib_uverbs_alloc_event_file(file, 0);
+	filp = ib_uverbs_alloc_event_file(file, ib_dev, 0);
 	if (IS_ERR(filp)) {
 		put_unused_fd(resp.fd);
 		return PTR_ERR(filp);
@@ -1322,6 +1333,7 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
 }
 
 static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
+					struct ib_device *ib_dev,
 				       struct ib_udata *ucore,
 				       struct ib_udata *uhw,
 				       struct ib_uverbs_ex_create_cq *cmd,
@@ -1370,14 +1382,14 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
 	if (cmd_sz > offsetof(typeof(*cmd), flags) + sizeof(cmd->flags))
 		attr.flags = cmd->flags;
 
-	cq = file->device->ib_dev->create_cq(file->device->ib_dev, &attr,
+	cq = ib_dev->create_cq(ib_dev, &attr,
 					     file->ucontext, uhw);
 	if (IS_ERR(cq)) {
 		ret = PTR_ERR(cq);
 		goto err_file;
 	}
 
-	cq->device        = file->device->ib_dev;
+	cq->device        = ib_dev;
 	cq->uobject       = &obj->uobject;
 	cq->comp_handler  = ib_uverbs_comp_handler;
 	cq->event_handler = ib_uverbs_cq_event_handler;
@@ -1438,6 +1450,7 @@ static int ib_uverbs_create_cq_cb(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -1466,7 +1479,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
 	cmd_ex.comp_vector = cmd.comp_vector;
 	cmd_ex.comp_channel = cmd.comp_channel;
 
-	obj = create_cq(file, &ucore, &uhw, &cmd_ex,
+	obj = create_cq(file, ib_dev, &ucore, &uhw, &cmd_ex,
 			offsetof(typeof(cmd_ex), comp_channel) +
 			sizeof(cmd.comp_channel), ib_uverbs_create_cq_cb,
 			NULL);
@@ -1489,6 +1502,7 @@ static int ib_uverbs_ex_create_cq_cb(struct ib_uverbs_file *file,
 }
 
 int ib_uverbs_ex_create_cq(struct ib_uverbs_file *file,
+			 struct ib_device *ib_dev,
 			   struct ib_udata *ucore,
 			   struct ib_udata *uhw)
 {
@@ -1514,7 +1528,7 @@ int ib_uverbs_ex_create_cq(struct ib_uverbs_file *file,
 			     sizeof(resp.response_length)))
 		return -ENOSPC;
 
-	obj = create_cq(file, ucore, uhw, &cmd,
+	obj = create_cq(file, ib_dev, ucore, uhw, &cmd,
 			min(ucore->inlen, sizeof(cmd)),
 			ib_uverbs_ex_create_cq_cb, NULL);
 
@@ -1525,6 +1539,7 @@ int ib_uverbs_ex_create_cq(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_resize_cq(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -1588,6 +1603,7 @@ static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
 }
 
 ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
+			  struct ib_device *ib_dev,
 			  const char __user *buf, int in_len,
 			  int out_len)
 {
@@ -1639,6 +1655,7 @@ out_put:
 }
 
 ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file,
+				struct ib_device *ib_dev,
 				const char __user *buf, int in_len,
 				int out_len)
 {
@@ -1661,6 +1678,7 @@ ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf, int in_len,
 			     int out_len)
 {
@@ -1713,6 +1731,7 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -1908,6 +1927,7 @@ err_put:
 }
 
 ssize_t ib_uverbs_open_qp(struct ib_uverbs_file *file,
+			  struct ib_device *ib_dev,
 			  const char __user *buf, int in_len, int out_len)
 {
 	struct ib_uverbs_open_qp        cmd;
@@ -2002,6 +2022,7 @@ err_put:
 }
 
 ssize_t ib_uverbs_query_qp(struct ib_uverbs_file *file,
+			   struct ib_device *ib_dev,
 			   const char __user *buf, int in_len,
 			   int out_len)
 {
@@ -2116,6 +2137,7 @@ static int modify_qp_mask(enum ib_qp_type qp_type, int mask)
 }
 
 ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -2212,6 +2234,7 @@ out:
 }
 
 ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf, int in_len,
 			     int out_len)
 {
@@ -2270,6 +2293,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -2514,6 +2538,7 @@ err:
 }
 
 ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -2563,6 +2588,7 @@ out:
 }
 
 ssize_t ib_uverbs_post_srq_recv(struct ib_uverbs_file *file,
+				struct ib_device *ib_dev,
 				const char __user *buf, int in_len,
 				int out_len)
 {
@@ -2612,6 +2638,7 @@ out:
 }
 
 ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf, int in_len,
 			    int out_len)
 {
@@ -2704,6 +2731,7 @@ err:
 }
 
 ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf, int in_len, int out_len)
 {
 	struct ib_uverbs_destroy_ah cmd;
@@ -2740,6 +2768,7 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_attach_mcast(struct ib_uverbs_file *file,
+			       struct ib_device *ib_dev,
 			       const char __user *buf, int in_len,
 			       int out_len)
 {
@@ -2787,6 +2816,7 @@ out_put:
 }
 
 ssize_t ib_uverbs_detach_mcast(struct ib_uverbs_file *file,
+			       struct ib_device *ib_dev,
 			       const char __user *buf, int in_len,
 			       int out_len)
 {
@@ -2867,6 +2897,7 @@ static int kern_spec_to_ib_spec(struct ib_uverbs_flow_spec *kern_spec,
 }
 
 int ib_uverbs_ex_create_flow(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     struct ib_udata *ucore,
 			     struct ib_udata *uhw)
 {
@@ -3027,6 +3058,7 @@ err_free_attr:
 }
 
 int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
+			      struct ib_device *ib_dev,
 			      struct ib_udata *ucore,
 			      struct ib_udata *uhw)
 {
@@ -3069,6 +3101,7 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
 }
 
 static int __uverbs_create_xsrq(struct ib_uverbs_file *file,
+				struct ib_device *ib_dev,
 				struct ib_uverbs_create_xsrq *cmd,
 				struct ib_udata *udata)
 {
@@ -3202,6 +3235,7 @@ err:
 }
 
 ssize_t ib_uverbs_create_srq(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf, int in_len,
 			     int out_len)
 {
@@ -3229,7 +3263,7 @@ ssize_t ib_uverbs_create_srq(struct ib_uverbs_file *file,
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	ret = __uverbs_create_xsrq(file, &xcmd, &udata);
+	ret = __uverbs_create_xsrq(file, ib_dev, &xcmd, &udata);
 	if (ret)
 		return ret;
 
@@ -3237,6 +3271,7 @@ ssize_t ib_uverbs_create_srq(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_create_xsrq(struct ib_uverbs_file *file,
+			      struct ib_device *ib_dev,
 			      const char __user *buf, int in_len, int out_len)
 {
 	struct ib_uverbs_create_xsrq     cmd;
@@ -3254,7 +3289,7 @@ ssize_t ib_uverbs_create_xsrq(struct ib_uverbs_file *file,
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
-	ret = __uverbs_create_xsrq(file, &cmd, &udata);
+	ret = __uverbs_create_xsrq(file, ib_dev, &cmd, &udata);
 	if (ret)
 		return ret;
 
@@ -3262,6 +3297,7 @@ ssize_t ib_uverbs_create_xsrq(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_modify_srq(struct ib_uverbs_file *file,
+			     struct ib_device *ib_dev,
 			     const char __user *buf, int in_len,
 			     int out_len)
 {
@@ -3292,6 +3328,7 @@ ssize_t ib_uverbs_modify_srq(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_query_srq(struct ib_uverbs_file *file,
+			    struct ib_device *ib_dev,
 			    const char __user *buf,
 			    int in_len, int out_len)
 {
@@ -3332,6 +3369,7 @@ ssize_t ib_uverbs_query_srq(struct ib_uverbs_file *file,
 }
 
 ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
+			      struct ib_device *ib_dev,
 			      const char __user *buf, int in_len,
 			      int out_len)
 {
@@ -3389,16 +3427,15 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 }
 
 int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
+			      struct ib_device *ib_dev,
 			      struct ib_udata *ucore,
 			      struct ib_udata *uhw)
 {
 	struct ib_uverbs_ex_query_device_resp resp;
 	struct ib_uverbs_ex_query_device  cmd;
 	struct ib_device_attr attr;
-	struct ib_device *device;
 	int err;
 
-	device = file->device->ib_dev;
 	if (ucore->inlen < sizeof(cmd))
 		return -EINVAL;
 
@@ -3419,11 +3456,11 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 
 	memset(&attr, 0, sizeof(attr));
 
-	err = device->query_device(device, &attr, uhw);
+	err = ib_dev->query_device(ib_dev, &attr, uhw);
 	if (err)
 		return err;
 
-	copy_query_dev_fields(file, &resp.base, &attr);
+	copy_query_dev_fields(file, ib_dev, &resp.base, &attr);
 	resp.comp_mask = 0;
 
 	if (ucore->outlen < resp.response_length + sizeof(resp.odp_caps))
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index ed54e4e..6dea148 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -79,6 +79,7 @@ static DEFINE_SPINLOCK(map_lock);
 static DECLARE_BITMAP(dev_map, IB_UVERBS_MAX_DEVICES);
 
 static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
+				     struct ib_device *ib_dev,
 				     const char __user *buf, int in_len,
 				     int out_len) = {
 	[IB_USER_VERBS_CMD_GET_CONTEXT]		= ib_uverbs_get_context,
@@ -119,6 +120,7 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
 };
 
 static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file *file,
+				    struct ib_device *ib_dev,
 				    struct ib_udata *ucore,
 				    struct ib_udata *uhw) = {
 	[IB_USER_VERBS_EX_CMD_CREATE_FLOW]	= ib_uverbs_ex_create_flow,
@@ -547,6 +549,7 @@ void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file)
 }
 
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
+					struct ib_device	*ib_dev,
 					int is_async)
 {
 	struct ib_uverbs_event_file *ev_file;
@@ -576,7 +579,7 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 		uverbs_file->async_file = ev_file;
 		kref_get(&uverbs_file->async_file->ref);
 		INIT_IB_EVENT_HANDLER(&uverbs_file->event_handler,
-				      uverbs_file->device->ib_dev,
+				      ib_dev,
 				      ib_uverbs_event_handler);
 		ret = ib_register_event_handler(&uverbs_file->event_handler);
 		if (ret)
@@ -630,9 +633,13 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 			     size_t count, loff_t *pos)
 {
 	struct ib_uverbs_file *file = filp->private_data;
+	struct ib_device *ib_dev = file->device->ib_dev;
 	struct ib_uverbs_cmd_hdr hdr;
 	__u32 flags;
 
+	if (!ib_dev)
+		return -ENODEV;
+
 	if (count < sizeof hdr)
 		return -EINVAL;
 
@@ -659,13 +666,13 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		    command != IB_USER_VERBS_CMD_GET_CONTEXT)
 			return -EINVAL;
 
-		if (!(file->device->ib_dev->uverbs_cmd_mask & (1ull << command)))
+		if (!(ib_dev->uverbs_cmd_mask & (1ull << command)))
 			return -ENOSYS;
 
 		if (hdr.in_words * 4 != count)
 			return -EINVAL;
 
-		return uverbs_cmd_table[command](file,
+		return uverbs_cmd_table[command](file, ib_dev,
 						 buf + sizeof(hdr),
 						 hdr.in_words * 4,
 						 hdr.out_words * 4);
@@ -692,7 +699,7 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		if (!file->ucontext)
 			return -EINVAL;
 
-		if (!(file->device->ib_dev->uverbs_ex_cmd_mask & (1ull << command)))
+		if (!(ib_dev->uverbs_ex_cmd_mask & (1ull << command)))
 			return -ENOSYS;
 
 		if (count < (sizeof(hdr) + sizeof(ex_hdr)))
@@ -733,6 +740,7 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 				       ex_hdr.provider_out_words * 8);
 
 		err = uverbs_ex_cmd_table[command](file,
+						   ib_dev,
 						   &ucore,
 						   &uhw);
 
@@ -748,11 +756,12 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma)
 {
 	struct ib_uverbs_file *file = filp->private_data;
+	struct ib_device *ib_dev = file->device->ib_dev;
 
-	if (!file->ucontext)
+	if (!ib_dev || !file->ucontext)
 		return -ENODEV;
 	else
-		return file->device->ib_dev->mmap(file->ucontext, vma);
+		return ib_dev->mmap(file->ucontext, vma);
 }
 
 /*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications
       [not found] ` <1434984438-21733-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-22 14:47   ` [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files Yishai Hadas
  2015-06-22 14:47   ` [PATCH for-next V5 2/5] IB/uverbs: Explicitly pass ib_dev to uverbs commands Yishai Hadas
@ 2015-06-22 14:47   ` Yishai Hadas
       [not found]     ` <1434984438-21733-4-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-22 14:47   ` [PATCH for-next V5 4/5] IB/mlx4_ib: Disassociate support Yishai Hadas
  2015-06-22 14:47   ` [PATCH for-next V5 5/5] IB/ucma: HW Device hot-removal support Yishai Hadas
  4 siblings, 1 reply; 17+ messages in thread
From: Yishai Hadas @ 2015-06-22 14:47 UTC (permalink / raw
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Enables the uverbs_remove_one to succeed despite the fact that there are
running IB applications working with the given ib device.  This functionality
enables a HW device to be unbind/reset despite the fact that there are running
user space applications using it.

It exposes a new IB kernel API named 'disassociate_ucontext' which lets a
driver detaching its HW resources from a given user context without
crashing/terminating the application. In case a driver implemented the above
API and registered with ib_uverb there will be no dependency between its device
to its uverbs_device. Upon calling remove_one of ib_uverbs the call should
return after disassociating the open HW resources without waiting to clients
disconnecting. In case driver didn't implement this API there will be no change
to current behaviour and uverbs_remove_one will return only when last client
has disconnected and reference count on uverbs device became 0.

In case the lower driver device was removed any application will continue
working over some zombie HCA, further calls will ended with an immediate error.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/uverbs.h      |    9 +-
 drivers/infiniband/core/uverbs_main.c |  370 ++++++++++++++++++++++++++------
 include/rdma/ib_verbs.h               |    1 +
 3 files changed, 309 insertions(+), 71 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index dc002a1..06c80ba 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -89,11 +89,15 @@ struct ib_uverbs_device {
 	int					num_comp_vectors;
 	struct completion			comp;
 	struct device			       *dev;
-	struct ib_device		       *ib_dev;
+	struct ib_device	__rcu	       *ib_dev;
 	int					devnum;
 	struct cdev			        cdev;
 	struct rb_root				xrcd_tree;
 	struct mutex				xrcd_tree_mutex;
+	struct srcu_struct			disassociate_srcu;
+	struct mutex				lists_mutex; /* protect lists */
+	struct list_head			uverbs_file_list;
+	struct list_head			uverbs_events_file_list;
 };
 
 struct ib_uverbs_event_file {
@@ -105,6 +109,7 @@ struct ib_uverbs_event_file {
 	wait_queue_head_t			poll_wait;
 	struct fasync_struct		       *async_queue;
 	struct list_head			event_list;
+	struct list_head			list;
 };
 
 struct ib_uverbs_file {
@@ -114,6 +119,8 @@ struct ib_uverbs_file {
 	struct ib_ucontext		       *ucontext;
 	struct ib_event_handler			event_handler;
 	struct ib_uverbs_event_file	       *async_file;
+	struct list_head			list;
+	int					is_closed;
 };
 
 struct ib_uverbs_event {
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 6dea148..02a9b82 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -137,7 +137,12 @@ static void ib_uverbs_release_dev(struct kref *ref)
 	struct ib_uverbs_device *dev =
 		container_of(ref, struct ib_uverbs_device, ref);
 
-	complete(&dev->comp);
+	if (!dev->ib_dev) {
+		cleanup_srcu_struct(&dev->disassociate_srcu);
+		kfree(dev);
+	} else {
+		complete(&dev->comp);
+	}
 }
 
 static void ib_uverbs_release_event_file(struct kref *ref)
@@ -203,9 +208,6 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 {
 	struct ib_uobject *uobj, *tmp;
 
-	if (!context)
-		return 0;
-
 	context->closing = 1;
 
 	list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) {
@@ -309,8 +311,15 @@ static void ib_uverbs_release_file(struct kref *ref)
 {
 	struct ib_uverbs_file *file =
 		container_of(ref, struct ib_uverbs_file, ref);
-
-	module_put(file->device->ib_dev->owner);
+	struct ib_device *ib_dev;
+	int srcu_key;
+
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	ib_dev = srcu_dereference(file->device->ib_dev,
+				  &file->device->disassociate_srcu);
+	if (ib_dev && !ib_dev->disassociate_ucontext)
+		module_put(ib_dev->owner);
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
 	kref_put(&file->device->ref, ib_uverbs_release_dev);
 
 	kfree(file);
@@ -333,9 +342,18 @@ static ssize_t ib_uverbs_event_read(struct file *filp, char __user *buf,
 			return -EAGAIN;
 
 		if (wait_event_interruptible(file->poll_wait,
-					     !list_empty(&file->event_list)))
+					     (!list_empty(&file->event_list) ||
+			/* The barriers built into wait_event_interruptible()
+			 * and wake_up() make this ib_dev check RCU protected
+			 */
+					     !file->uverbs_file->device->ib_dev)))
 			return -ERESTARTSYS;
 
+		/* If device was disassociated and no event exists set an error */
+		if (list_empty(&file->event_list) &&
+		    !file->uverbs_file->device->ib_dev)
+			return -EIO;
+
 		spin_lock_irq(&file->lock);
 	}
 
@@ -398,8 +416,11 @@ static int ib_uverbs_event_close(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_event_file *file = filp->private_data;
 	struct ib_uverbs_event *entry, *tmp;
+	int closed_already = 0;
 
+	mutex_lock(&file->uverbs_file->device->lists_mutex);
 	spin_lock_irq(&file->lock);
+	closed_already = file->is_closed;
 	file->is_closed = 1;
 	list_for_each_entry_safe(entry, tmp, &file->event_list, list) {
 		if (entry->counter)
@@ -407,9 +428,14 @@ static int ib_uverbs_event_close(struct inode *inode, struct file *filp)
 		kfree(entry);
 	}
 	spin_unlock_irq(&file->lock);
+	if (!closed_already) {
+		list_del(&file->list);
+		if (file->is_async)
+			ib_unregister_event_handler(&file->uverbs_file->
+				event_handler);
+	}
+	mutex_unlock(&file->uverbs_file->device->lists_mutex);
 
-	if (file->is_async)
-		ib_unregister_event_handler(&file->uverbs_file->event_handler);
 	kref_put(&file->uverbs_file->ref, ib_uverbs_release_file);
 	kref_put(&file->ref, ib_uverbs_release_event_file);
 
@@ -574,6 +600,11 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 		goto err;
 
 	kref_get(&uverbs_file->ref);
+	mutex_lock(&uverbs_file->device->lists_mutex);
+	list_add_tail(&ev_file->list,
+		      &uverbs_file->device->uverbs_events_file_list);
+	mutex_unlock(&uverbs_file->device->lists_mutex);
+
 	if (is_async) {
 		WARN_ON(uverbs_file->async_file);
 		uverbs_file->async_file = ev_file;
@@ -633,12 +664,11 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 			     size_t count, loff_t *pos)
 {
 	struct ib_uverbs_file *file = filp->private_data;
-	struct ib_device *ib_dev = file->device->ib_dev;
+	struct ib_device *ib_dev;
 	struct ib_uverbs_cmd_hdr hdr;
 	__u32 flags;
-
-	if (!ib_dev)
-		return -ENODEV;
+	int srcu_key;
+	ssize_t ret;
 
 	if (count < sizeof hdr)
 		return -EINVAL;
@@ -646,6 +676,14 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 	if (copy_from_user(&hdr, buf, sizeof hdr))
 		return -EFAULT;
 
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	ib_dev = srcu_dereference(file->device->ib_dev,
+				  &file->device->disassociate_srcu);
+	if (!ib_dev) {
+		ret = -EIO;
+		goto out;
+	}
+
 	flags = (hdr.command &
 		 IB_USER_VERBS_CMD_FLAGS_MASK) >> IB_USER_VERBS_CMD_FLAGS_SHIFT;
 
@@ -653,26 +691,36 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		__u32 command;
 
 		if (hdr.command & ~(__u32)(IB_USER_VERBS_CMD_FLAGS_MASK |
-					   IB_USER_VERBS_CMD_COMMAND_MASK))
-			return -EINVAL;
+					   IB_USER_VERBS_CMD_COMMAND_MASK)) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
 
 		if (command >= ARRAY_SIZE(uverbs_cmd_table) ||
-		    !uverbs_cmd_table[command])
-			return -EINVAL;
+		    !uverbs_cmd_table[command]) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		if (!file->ucontext &&
-		    command != IB_USER_VERBS_CMD_GET_CONTEXT)
-			return -EINVAL;
+		    command != IB_USER_VERBS_CMD_GET_CONTEXT) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (!(ib_dev->uverbs_cmd_mask & (1ull << command)))
-			return -ENOSYS;
+		if (!(ib_dev->uverbs_cmd_mask & (1ull << command))) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (hdr.in_words * 4 != count)
-			return -EINVAL;
+		if (hdr.in_words * 4 != count) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		return uverbs_cmd_table[command](file, ib_dev,
+		ret = uverbs_cmd_table[command](file, ib_dev,
 						 buf + sizeof(hdr),
 						 hdr.in_words * 4,
 						 hdr.out_words * 4);
@@ -683,51 +731,72 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		struct ib_uverbs_ex_cmd_hdr ex_hdr;
 		struct ib_udata ucore;
 		struct ib_udata uhw;
-		int err;
 		size_t written_count = count;
 
 		if (hdr.command & ~(__u32)(IB_USER_VERBS_CMD_FLAGS_MASK |
-					   IB_USER_VERBS_CMD_COMMAND_MASK))
-			return -EINVAL;
+					   IB_USER_VERBS_CMD_COMMAND_MASK)) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
 
 		if (command >= ARRAY_SIZE(uverbs_ex_cmd_table) ||
-		    !uverbs_ex_cmd_table[command])
-			return -ENOSYS;
+		    !uverbs_ex_cmd_table[command]) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (!file->ucontext)
-			return -EINVAL;
+		if (!file->ucontext) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (!(ib_dev->uverbs_ex_cmd_mask & (1ull << command)))
-			return -ENOSYS;
+		if (!(ib_dev->uverbs_ex_cmd_mask & (1ull << command))) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (count < (sizeof(hdr) + sizeof(ex_hdr)))
-			return -EINVAL;
+		if (count < (sizeof(hdr) + sizeof(ex_hdr))) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (copy_from_user(&ex_hdr, buf + sizeof(hdr), sizeof(ex_hdr)))
-			return -EFAULT;
+		if (copy_from_user(&ex_hdr, buf + sizeof(hdr), sizeof(ex_hdr))) {
+			ret = -EFAULT;
+			goto out;
+		}
 
 		count -= sizeof(hdr) + sizeof(ex_hdr);
 		buf += sizeof(hdr) + sizeof(ex_hdr);
 
-		if ((hdr.in_words + ex_hdr.provider_in_words) * 8 != count)
-			return -EINVAL;
+		if ((hdr.in_words + ex_hdr.provider_in_words) * 8 != count) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (ex_hdr.cmd_hdr_reserved)
-			return -EINVAL;
+		if (ex_hdr.cmd_hdr_reserved) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		if (ex_hdr.response) {
-			if (!hdr.out_words && !ex_hdr.provider_out_words)
-				return -EINVAL;
+			if (!hdr.out_words && !ex_hdr.provider_out_words) {
+				ret = -EINVAL;
+				goto out;
+			}
 
 			if (!access_ok(VERIFY_WRITE,
 				       (void __user *) (unsigned long) ex_hdr.response,
-				       (hdr.out_words + ex_hdr.provider_out_words) * 8))
-				return -EFAULT;
+				       (hdr.out_words + ex_hdr.provider_out_words) * 8)) {
+				ret = -EFAULT;
+				goto out;
+			}
 		} else {
-			if (hdr.out_words || ex_hdr.provider_out_words)
-				return -EINVAL;
+			if (hdr.out_words || ex_hdr.provider_out_words) {
+				ret = -EINVAL;
+				goto out;
+			}
 		}
 
 		INIT_UDATA_BUF_OR_NULL(&ucore, buf, (unsigned long) ex_hdr.response,
@@ -739,29 +808,43 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 				       ex_hdr.provider_in_words * 8,
 				       ex_hdr.provider_out_words * 8);
 
-		err = uverbs_ex_cmd_table[command](file,
+		ret = uverbs_ex_cmd_table[command](file,
 						   ib_dev,
 						   &ucore,
 						   &uhw);
-
-		if (err)
-			return err;
-
-		return written_count;
+		if (!ret)
+			ret = written_count;
+	} else {
+		ret = -ENOSYS;
 	}
 
-	return -ENOSYS;
+out:
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
+	return ret;
 }
 
 static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma)
 {
 	struct ib_uverbs_file *file = filp->private_data;
-	struct ib_device *ib_dev = file->device->ib_dev;
+	struct ib_device *ib_dev;
+	int ret = 0;
+	int srcu_key;
 
-	if (!ib_dev || !file->ucontext)
-		return -ENODEV;
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	ib_dev = srcu_dereference(file->device->ib_dev,
+				  &file->device->disassociate_srcu);
+	if (!ib_dev) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (!file->ucontext)
+		ret = -ENODEV;
 	else
-		return ib_dev->mmap(file->ucontext, vma);
+		ret = ib_dev->mmap(file->ucontext, vma);
+out:
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
+	return ret;
 }
 
 /*
@@ -778,7 +861,10 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_device *dev;
 	struct ib_uverbs_file *file;
+	struct ib_device *ib_dev;
 	int ret;
+	int module_dependent;
+	int srcu_key;
 
 	dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev);
 	if (dev)
@@ -786,15 +872,34 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	else
 		return -ENXIO;
 
-	if (!try_module_get(dev->ib_dev->owner)) {
-		ret = -ENODEV;
+	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
+	mutex_lock(&dev->lists_mutex);
+	ib_dev = srcu_dereference(dev->ib_dev,
+				  &dev->disassociate_srcu);
+	if (!ib_dev) {
+		ret = -EIO;
 		goto err;
 	}
 
-	file = kmalloc(sizeof *file, GFP_KERNEL);
+	/* In case IB device supports disassociate ucontext, there is no hard
+	 * dependency between uverbs device and its low level device.
+	 */
+	module_dependent = !(ib_dev->disassociate_ucontext);
+
+	if (module_dependent) {
+		if (!try_module_get(ib_dev->owner)) {
+			ret = -ENODEV;
+			goto err;
+		}
+	}
+
+	file = kzalloc(sizeof(*file), GFP_KERNEL);
 	if (!file) {
 		ret = -ENOMEM;
-		goto err_module;
+		if (module_dependent)
+			goto err_module;
+
+		goto err;
 	}
 
 	file->device	 = dev;
@@ -804,13 +909,18 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	mutex_init(&file->mutex);
 
 	filp->private_data = file;
+	list_add_tail(&file->list, &dev->uverbs_file_list);
+	mutex_unlock(&dev->lists_mutex);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 
 	return nonseekable_open(inode, filp);
 
 err_module:
-	module_put(dev->ib_dev->owner);
+	module_put(ib_dev->owner);
 
 err:
+	mutex_unlock(&dev->lists_mutex);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 	kref_put(&dev->ref, ib_uverbs_release_dev);
 	return ret;
 }
@@ -818,8 +928,18 @@ err:
 static int ib_uverbs_close(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_file *file = filp->private_data;
-
-	ib_uverbs_cleanup_ucontext(file, file->ucontext);
+	struct ib_ucontext *ucontext = NULL;
+
+	mutex_lock(&file->device->lists_mutex);
+	ucontext = file->ucontext;
+	file->ucontext = NULL;
+	if (!file->is_closed) {
+		list_del(&file->list);
+		file->is_closed = 1;
+	}
+	mutex_unlock(&file->device->lists_mutex);
+	if (ucontext)
+		ib_uverbs_cleanup_ucontext(file, ucontext);
 
 	if (file->async_file)
 		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
@@ -855,12 +975,21 @@ static struct ib_client uverbs_client = {
 static ssize_t show_ibdev(struct device *device, struct device_attribute *attr,
 			  char *buf)
 {
+	int ret = -ENODEV;
+	int srcu_key;
 	struct ib_uverbs_device *dev = dev_get_drvdata(device);
+	struct ib_device *ib_dev;
 
 	if (!dev)
 		return -ENODEV;
 
-	return sprintf(buf, "%s\n", dev->ib_dev->name);
+	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
+	ib_dev = srcu_dereference(dev->ib_dev, &dev->disassociate_srcu);
+	if (ib_dev)
+		ret = sprintf(buf, "%s\n", ib_dev->name);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
+
+	return ret;
 }
 static DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL);
 
@@ -868,11 +997,19 @@ static ssize_t show_dev_abi_version(struct device *device,
 				    struct device_attribute *attr, char *buf)
 {
 	struct ib_uverbs_device *dev = dev_get_drvdata(device);
+	int ret = -ENODEV;
+	int srcu_key;
+	struct ib_device *ib_dev;
 
 	if (!dev)
 		return -ENODEV;
+	srcu_key = srcu_read_lock(&dev->disassociate_srcu);
+	ib_dev = srcu_dereference(dev->ib_dev, &dev->disassociate_srcu);
+	if (ib_dev)
+		ret = sprintf(buf, "%d\n", ib_dev->uverbs_abi_ver);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 
-	return sprintf(buf, "%d\n", dev->ib_dev->uverbs_abi_ver);
+	return ret;
 }
 static DEVICE_ATTR(abi_version, S_IRUGO, show_dev_abi_version, NULL);
 
@@ -912,6 +1049,7 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	int devnum;
 	dev_t base;
 	struct ib_uverbs_device *uverbs_dev;
+	int ret;
 
 	if (!device->alloc_ucontext)
 		return;
@@ -924,6 +1062,13 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	init_completion(&uverbs_dev->comp);
 	uverbs_dev->xrcd_tree = RB_ROOT;
 	mutex_init(&uverbs_dev->xrcd_tree_mutex);
+	mutex_init(&uverbs_dev->lists_mutex);
+	ret = init_srcu_struct(&uverbs_dev->disassociate_srcu);
+	if (ret)
+		goto err_init;
+
+	INIT_LIST_HEAD(&uverbs_dev->uverbs_file_list);
+	INIT_LIST_HEAD(&uverbs_dev->uverbs_events_file_list);
 
 	spin_lock(&map_lock);
 	devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES);
@@ -944,7 +1089,7 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	}
 	spin_unlock(&map_lock);
 
-	uverbs_dev->ib_dev           = device;
+	rcu_assign_pointer(uverbs_dev->ib_dev, device);
 	uverbs_dev->num_comp_vectors = device->num_comp_vectors;
 
 	cdev_init(&uverbs_dev->cdev, NULL);
@@ -980,15 +1125,81 @@ err_cdev:
 		clear_bit(devnum, overflow_map);
 
 err:
+	cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
+
+err_init:
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
 	wait_for_completion(&uverbs_dev->comp);
 	kfree(uverbs_dev);
 	return;
 }
 
+static void ib_uverbs_free_hw_resources(struct ib_uverbs_device *uverbs_dev,
+					struct ib_device *ib_dev)
+{
+	struct ib_uverbs_file *file;
+	struct ib_uverbs_event_file *event_file;
+	struct ib_event event;
+
+	/* Pending running commands to terminate */
+	synchronize_srcu(&uverbs_dev->disassociate_srcu);
+	event.event = IB_EVENT_DEVICE_FATAL;
+	event.element.port_num = 0;
+	event.device = ib_dev;
+
+	mutex_lock(&uverbs_dev->lists_mutex);
+	while (!list_empty(&uverbs_dev->uverbs_file_list)) {
+		struct ib_ucontext *ucontext;
+
+		file = list_first_entry(&uverbs_dev->uverbs_file_list,
+					struct ib_uverbs_file, list);
+		file->is_closed = 1;
+		ucontext = file->ucontext;
+		list_del(&file->list);
+		file->ucontext = NULL;
+		kref_get(&file->ref);
+		mutex_unlock(&uverbs_dev->lists_mutex);
+		/* We must release the mutex before going ahead and calling
+		 * disassociate_ucontext. disassociate_ucontext might end up
+		 * indirectly calling uverbs_close, for example due to freeing
+		 * the resources (e.g mmput).
+		 */
+		ib_uverbs_event_handler(&file->event_handler, &event);
+		if (ucontext) {
+			ib_dev->disassociate_ucontext(ucontext);
+			ib_uverbs_cleanup_ucontext(file, ucontext);
+		}
+
+		mutex_lock(&uverbs_dev->lists_mutex);
+		kref_put(&file->ref, ib_uverbs_release_file);
+	}
+
+	while (!list_empty(&uverbs_dev->uverbs_events_file_list)) {
+		event_file = list_first_entry(&uverbs_dev->
+					      uverbs_events_file_list,
+					      struct ib_uverbs_event_file,
+					      list);
+		spin_lock_irq(&event_file->lock);
+		event_file->is_closed = 1;
+		spin_unlock_irq(&event_file->lock);
+
+		list_del(&event_file->list);
+		if (event_file->is_async) {
+			ib_unregister_event_handler(&event_file->uverbs_file->
+						    event_handler);
+			event_file->uverbs_file->event_handler.device = NULL;
+		}
+
+		wake_up_interruptible(&event_file->poll_wait);
+		kill_fasync(&event_file->async_queue, SIGIO, POLL_IN);
+	}
+	mutex_unlock(&uverbs_dev->lists_mutex);
+}
+
 static void ib_uverbs_remove_one(struct ib_device *device)
 {
 	struct ib_uverbs_device *uverbs_dev = ib_get_client_data(device, &uverbs_client);
+	int wait_clients = 1;
 
 	if (!uverbs_dev)
 		return;
@@ -1002,9 +1213,28 @@ static void ib_uverbs_remove_one(struct ib_device *device)
 	else
 		clear_bit(uverbs_dev->devnum - IB_UVERBS_MAX_DEVICES, overflow_map);
 
+	if (device->disassociate_ucontext) {
+		/* We disassociate HW resources and immediately returning, not
+		 * pending to active userspace clients. Upon returning ib_device
+		 * may be freed internally and is not valid any more.
+		 * uverbs_device is still available, when all clients close
+		 * their files, the uverbs device ref count will be zero and its
+		 * resources will be freed.
+		 * Note: At that step no more files can be opened on that cdev
+		 * as it was deleted, however active clients can still issue
+		 * commands and close their open files.
+		 */
+		rcu_assign_pointer(uverbs_dev->ib_dev, NULL);
+		ib_uverbs_free_hw_resources(uverbs_dev, device);
+		wait_clients = 0;
+	}
+	/* ref count taken as part of add one is put back in both modes. */
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
-	wait_for_completion(&uverbs_dev->comp);
-	kfree(uverbs_dev);
+	if (wait_clients) {
+		wait_for_completion(&uverbs_dev->comp);
+		cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
+		kfree(uverbs_dev);
+	}
 }
 
 static char *uverbs_devnode(struct device *dev, umode_t *mode)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 986fddb..e30552f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1724,6 +1724,7 @@ struct ib_device {
 	int			   (*destroy_flow)(struct ib_flow *flow_id);
 	int			   (*check_mr_status)(struct ib_mr *mr, u32 check_mask,
 						      struct ib_mr_status *mr_status);
+	void			   (*disassociate_ucontext)(struct ib_ucontext *ibcontext);
 
 	struct ib_dma_mapping_ops   *dma_ops;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next V5 4/5] IB/mlx4_ib: Disassociate support
       [not found] ` <1434984438-21733-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-06-22 14:47   ` [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications Yishai Hadas
@ 2015-06-22 14:47   ` Yishai Hadas
  2015-06-22 14:47   ` [PATCH for-next V5 5/5] IB/ucma: HW Device hot-removal support Yishai Hadas
  4 siblings, 0 replies; 17+ messages in thread
From: Yishai Hadas @ 2015-06-22 14:47 UTC (permalink / raw
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Implements the IB core disassociate_ucontext API. The driver detaches the HW
resources for a given user context to prevent a dependency between application
termination and device disconnecting. This is done by managing the VMAs that
were mapped to the HW bars such as door bell and blueflame. When need to detach
remap them to an arbitrary kernel page returned by the zap API.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Jack Morgenstein <jackm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    |  139 +++++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   13 +++
 2 files changed, 150 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 166da78..456c60c 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -691,7 +691,7 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev,
 		resp.cqe_size	      = dev->dev->caps.cqe_size;
 	}
 
-	context = kmalloc(sizeof *context, GFP_KERNEL);
+	context = kzalloc(sizeof(*context), GFP_KERNEL);
 	if (!context)
 		return ERR_PTR(-ENOMEM);
 
@@ -728,21 +728,143 @@ static int mlx4_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
 	return 0;
 }
 
+static void  mlx4_ib_vma_open(struct vm_area_struct *area)
+{
+	/* vma_open is called when a new VMA is created on top of our VMA.
+	 * This is done through either mremap flow or split_vma (usually due
+	 * to mlock, madvise, munmap, etc.). We do not support a clone of the
+	 * vma, as this VMA is strongly hardware related. Therefore we set the
+	 * vm_ops of the newly created/cloned VMA to NULL, to prevent it from
+	 * calling us again and trying to do incorrect actions. We assume that
+	 * the original vma size is exactly a single page that there will be no
+	 * "splitting" operations on.
+	 */
+	area->vm_ops = NULL;
+}
+
+static void  mlx4_ib_vma_close(struct vm_area_struct *area)
+{
+	struct mlx4_ib_vma_private_data *mlx4_ib_vma_priv_data;
+
+	/* It's guaranteed that all VMAs opened on a FD are closed before the
+	 * file itself is closed, therefore no sync is needed with the regular
+	 * closing flow. (e.g. mlx4_ib_dealloc_ucontext) However need a sync
+	 * with accessing the vma as part of mlx4_ib_disassociate_ucontext.
+	 * The close operation is usually called under mm->mmap_sem except when
+	 * process is exiting.  The exiting case is handled explicitly as part
+	 * of mlx4_ib_disassociate_ucontext.
+	 */
+	mlx4_ib_vma_priv_data = (struct mlx4_ib_vma_private_data *)
+				area->vm_private_data;
+
+	/* set the vma context pointer to null in the mlx4_ib driver's private
+	 * data to protect against a race condition in mlx4_ib_dissassociate_ucontext().
+	 */
+	mlx4_ib_vma_priv_data->vma = NULL;
+}
+
+static const struct vm_operations_struct mlx4_ib_vm_ops = {
+	.open = mlx4_ib_vma_open,
+	.close = mlx4_ib_vma_close
+};
+
+static void mlx4_ib_disassociate_ucontext(struct ib_ucontext *ibcontext)
+{
+	int i;
+	int ret = 0;
+	struct vm_area_struct *vma;
+	struct mlx4_ib_ucontext *context = to_mucontext(ibcontext);
+	struct task_struct *owning_process  = NULL;
+	struct mm_struct   *owning_mm       = NULL;
+
+	owning_process = get_pid_task(ibcontext->tgid, PIDTYPE_PID);
+	if (!owning_process)
+		return;
+
+	owning_mm = get_task_mm(owning_process);
+	if (!owning_mm) {
+		pr_info("no mm, disassociate ucontext is pending task termination\n");
+		while (1) {
+			/* make sure that task is dead before returning, it may
+			 * prevent a rare case of module down in parallel to a
+			 * call to mlx4_ib_vma_close.
+			 */
+			put_task_struct(owning_process);
+			msleep(1);
+			owning_process = get_pid_task(ibcontext->tgid,
+						      PIDTYPE_PID);
+			if (!owning_process ||
+			    owning_process->state == TASK_DEAD) {
+				pr_info("disassociate ucontext done, task was terminated\n");
+				/* in case task was dead need to release the task struct */
+				if (owning_process)
+					put_task_struct(owning_process);
+				return;
+			}
+		}
+	}
+
+	/* need to protect from a race on closing the vma as part of
+	 * mlx4_ib_vma_close().
+	 */
+	down_read(&owning_mm->mmap_sem);
+	for (i = 0; i < HW_BAR_COUNT; i++) {
+		vma = context->hw_bar_info[i].vma;
+		if (!vma)
+			continue;
+
+		ret = zap_vma_ptes(context->hw_bar_info[i].vma,
+				   context->hw_bar_info[i].vma->vm_start,
+				   PAGE_SIZE);
+		if (ret) {
+			pr_err("Error: zap_vma_ptes failed for index=%d, ret=%d\n", i, ret);
+			BUG_ON(1);
+		}
+
+		/* context going to be destroyed, should not access ops any more */
+		context->hw_bar_info[i].vma->vm_ops = NULL;
+	}
+
+	up_read(&owning_mm->mmap_sem);
+	mmput(owning_mm);
+	put_task_struct(owning_process);
+}
+
+static void mlx4_ib_set_vma_data(struct vm_area_struct *vma,
+				 struct mlx4_ib_vma_private_data *vma_private_data)
+{
+	vma_private_data->vma = vma;
+	vma->vm_private_data = vma_private_data;
+	vma->vm_ops =  &mlx4_ib_vm_ops;
+}
+
 static int mlx4_ib_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 {
 	struct mlx4_ib_dev *dev = to_mdev(context->device);
+	struct mlx4_ib_ucontext *mucontext = to_mucontext(context);
 
 	if (vma->vm_end - vma->vm_start != PAGE_SIZE)
 		return -EINVAL;
 
 	if (vma->vm_pgoff == 0) {
+		/* We prevent double mmaping on same context */
+		if (mucontext->hw_bar_info[HW_BAR_DB].vma)
+			return -EINVAL;
+
 		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 		if (io_remap_pfn_range(vma, vma->vm_start,
 				       to_mucontext(context)->uar.pfn,
 				       PAGE_SIZE, vma->vm_page_prot))
 			return -EAGAIN;
+
+		mlx4_ib_set_vma_data(vma, &mucontext->hw_bar_info[HW_BAR_DB]);
+
 	} else if (vma->vm_pgoff == 1 && dev->dev->caps.bf_reg_size != 0) {
+		/* We prevent double mmaping on same context */
+		if (mucontext->hw_bar_info[HW_BAR_BF].vma)
+			return -EINVAL;
+
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
 
 		if (io_remap_pfn_range(vma, vma->vm_start,
@@ -750,9 +872,18 @@ static int mlx4_ib_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 				       dev->dev->caps.num_uars,
 				       PAGE_SIZE, vma->vm_page_prot))
 			return -EAGAIN;
+
+		mlx4_ib_set_vma_data(vma, &mucontext->hw_bar_info[HW_BAR_BF]);
+
 	} else if (vma->vm_pgoff == 3) {
 		struct mlx4_clock_params params;
-		int ret = mlx4_get_internal_clock_params(dev->dev, &params);
+		int ret;
+
+		/* We prevent double mmaping on same context */
+		if (mucontext->hw_bar_info[HW_BAR_CLOCK].vma)
+			return -EINVAL;
+
+		ret = mlx4_get_internal_clock_params(dev->dev, &params);
 
 		if (ret)
 			return ret;
@@ -765,6 +896,9 @@ static int mlx4_ib_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 				       >> PAGE_SHIFT,
 				       PAGE_SIZE, vma->vm_page_prot))
 			return -EAGAIN;
+
+		mlx4_ib_set_vma_data(vma,
+				     &mucontext->hw_bar_info[HW_BAR_CLOCK]);
 	} else {
 		return -EINVAL;
 	}
@@ -2322,6 +2456,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	ibdev->ib_dev.detach_mcast	= mlx4_ib_mcg_detach;
 	ibdev->ib_dev.process_mad	= mlx4_ib_process_mad;
 	ibdev->ib_dev.get_port_immutable = mlx4_port_immutable;
+	ibdev->ib_dev.disassociate_ucontext = mlx4_ib_disassociate_ucontext;
 
 	if (!mlx4_is_slave(ibdev->dev)) {
 		ibdev->ib_dev.alloc_fmr		= mlx4_ib_fmr_alloc;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 7933adf..4d7f6ed 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -70,11 +70,24 @@ extern int mlx4_ib_sm_guid_assign;
 
 #define MLX4_IB_UC_STEER_QPN_ALIGN 1
 #define MLX4_IB_UC_MAX_NUM_QPS     256
+
+enum hw_bar_type {
+	HW_BAR_BF,
+	HW_BAR_DB,
+	HW_BAR_CLOCK,
+	HW_BAR_COUNT
+};
+
+struct mlx4_ib_vma_private_data {
+	struct vm_area_struct *vma;
+};
+
 struct mlx4_ib_ucontext {
 	struct ib_ucontext	ibucontext;
 	struct mlx4_uar		uar;
 	struct list_head	db_page_list;
 	struct mutex		db_page_mutex;
+	struct mlx4_ib_vma_private_data hw_bar_info[HW_BAR_COUNT];
 };
 
 struct mlx4_ib_pd {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH for-next V5 5/5] IB/ucma: HW Device hot-removal support
       [not found] ` <1434984438-21733-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-06-22 14:47   ` [PATCH for-next V5 4/5] IB/mlx4_ib: Disassociate support Yishai Hadas
@ 2015-06-22 14:47   ` Yishai Hadas
  4 siblings, 0 replies; 17+ messages in thread
From: Yishai Hadas @ 2015-06-22 14:47 UTC (permalink / raw
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Currently, IB/cma remove_one flow blocks until all user descriptor managed by
IB/ucma are released. This prevents hot-removal of IB devices. This patch
allows IB/cma to remove devices regardless of user space activity. Upon getting
the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources
for the given ucontext. The ucontext itself is still alive till its explicit
destroying by its creator.

Running applications at that time will have some zombie device, further
operations may fail.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/ucma.c |  130 ++++++++++++++++++++++++++++++++++++----
 1 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index ad45469..e35990f 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -74,6 +74,7 @@ struct ucma_file {
 	struct list_head	ctx_list;
 	struct list_head	event_list;
 	wait_queue_head_t	poll_wait;
+	struct workqueue_struct	*close_wq;
 };
 
 struct ucma_context {
@@ -89,6 +90,9 @@ struct ucma_context {
 
 	struct list_head	list;
 	struct list_head	mc_list;
+	int			closing;
+	int			destroying;
+	struct work_struct	close_work;
 };
 
 struct ucma_multicast {
@@ -107,6 +111,7 @@ struct ucma_event {
 	struct list_head	list;
 	struct rdma_cm_id	*cm_id;
 	struct rdma_ucm_event_resp resp;
+	struct work_struct	close_work;
 };
 
 static DEFINE_MUTEX(mut);
@@ -132,8 +137,12 @@ static struct ucma_context *ucma_get_ctx(struct ucma_file *file, int id)
 
 	mutex_lock(&mut);
 	ctx = _ucma_find_context(id, file);
-	if (!IS_ERR(ctx))
-		atomic_inc(&ctx->ref);
+	if (!IS_ERR(ctx)) {
+		if (ctx->closing)
+			ctx = ERR_PTR(-EIO);
+		else
+			atomic_inc(&ctx->ref);
+	}
 	mutex_unlock(&mut);
 	return ctx;
 }
@@ -144,6 +153,34 @@ static void ucma_put_ctx(struct ucma_context *ctx)
 		complete(&ctx->comp);
 }
 
+static void ucma_close_event_id(struct work_struct *work)
+{
+	struct ucma_event *uevent_close =  container_of(work, struct ucma_event, close_work);
+
+	rdma_destroy_id(uevent_close->cm_id);
+	kfree(uevent_close);
+}
+
+static void ucma_close_id(struct work_struct *work)
+{
+	struct ucma_context *ctx =  container_of(work, struct ucma_context, close_work);
+
+	/* Fence to ensure that ctx->closing was seen by all
+	 * ucma_get_ctx running
+	 */
+	mutex_lock(&mut);
+	mutex_unlock(&mut);
+
+	/* once all inflight tasks are finished, we close all underlying
+	 * resources. The context is still alive till its explicit destryoing
+	 * by its creator.
+	 */
+	ucma_put_ctx(ctx);
+	wait_for_completion(&ctx->comp);
+	/* No new events will be generated after destroying the id. */
+	rdma_destroy_id(ctx->cm_id);
+}
+
 static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file)
 {
 	struct ucma_context *ctx;
@@ -152,6 +189,7 @@ static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file)
 	if (!ctx)
 		return NULL;
 
+	INIT_WORK(&ctx->close_work, ucma_close_id);
 	atomic_set(&ctx->ref, 1);
 	init_completion(&ctx->comp);
 	INIT_LIST_HEAD(&ctx->mc_list);
@@ -242,6 +280,42 @@ static void ucma_set_event_context(struct ucma_context *ctx,
 	}
 }
 
+/* Called with file->mut locked for the relevant context. */
+static void ucma_removal_event_handler(struct rdma_cm_id *cm_id)
+{
+	struct ucma_context *ctx = cm_id->context;
+	struct ucma_event *con_req_eve;
+	int event_found = 0;
+
+	if (ctx->destroying)
+		return;
+
+	/* only if context is pointing to cm_id that it owns it and can be
+	 * queued to be closed, otherwise that cm_id is an inflight one that
+	 * is part of that context event list pending to be detached and
+	 * reattached to its new context as part of ucma_get_event,
+	 * handled separately below.
+	 */
+	if (ctx->cm_id == cm_id) {
+		ctx->closing = 1;
+		queue_work(ctx->file->close_wq, &ctx->close_work);
+		return;
+	}
+
+	list_for_each_entry(con_req_eve, &ctx->file->event_list, list) {
+		if (con_req_eve->cm_id == cm_id &&
+		    con_req_eve->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) {
+			list_del(&con_req_eve->list);
+			INIT_WORK(&con_req_eve->close_work, ucma_close_event_id);
+			queue_work(ctx->file->close_wq, &con_req_eve->close_work);
+			event_found = 1;
+			break;
+		}
+	}
+	if (!event_found)
+		printk(KERN_ERR "ucma_removal_event_handler: warning: connect request event wasn't found\n");
+}
+
 static int ucma_event_handler(struct rdma_cm_id *cm_id,
 			      struct rdma_cm_event *event)
 {
@@ -276,14 +350,21 @@ static int ucma_event_handler(struct rdma_cm_id *cm_id,
 		 * We ignore events for new connections until userspace has set
 		 * their context.  This can only happen if an error occurs on a
 		 * new connection before the user accepts it.  This is okay,
-		 * since the accept will just fail later.
+		 * since the accept will just fail later. However, we do need
+		 * to release the underlying HW resources in case of a device
+		 * removal event.
 		 */
+		if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL)
+			ucma_removal_event_handler(cm_id);
+
 		kfree(uevent);
 		goto out;
 	}
 
 	list_add_tail(&uevent->list, &ctx->file->event_list);
 	wake_up_interruptible(&ctx->file->poll_wait);
+	if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL)
+		ucma_removal_event_handler(cm_id);
 out:
 	mutex_unlock(&ctx->file->mut);
 	return ret;
@@ -442,9 +523,15 @@ static void ucma_cleanup_mc_events(struct ucma_multicast *mc)
 }
 
 /*
- * We cannot hold file->mut when calling rdma_destroy_id() or we can
- * deadlock.  We also acquire file->mut in ucma_event_handler(), and
- * rdma_destroy_id() will wait until all callbacks have completed.
+ * ucma_free_ctx is called after the underlying rdma CM-ID is destroyed. At
+ * this point, no new events will be reported from the hardware. However, we
+ * still need to cleanup the UCMA context for this ID. Specifically, there
+ * might be events that have not yet been consumed by the user space software.
+ * These might include pending connect requests which we have not completed
+ * processing.  We cannot call rdma_destroy_id while holding the lock of the
+ * context (file->mut), as it might cause a deadlock. We therefore extract all
+ * relevant events from the context pending events list while holding the
+ * mutex. After that we release them as needed.
  */
 static int ucma_free_ctx(struct ucma_context *ctx)
 {
@@ -452,8 +539,6 @@ static int ucma_free_ctx(struct ucma_context *ctx)
 	struct ucma_event *uevent, *tmp;
 	LIST_HEAD(list);
 
-	/* No new events will be generated after destroying the id. */
-	rdma_destroy_id(ctx->cm_id);
 
 	ucma_cleanup_multicast(ctx);
 
@@ -501,10 +586,20 @@ static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf,
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
-	ucma_put_ctx(ctx);
-	wait_for_completion(&ctx->comp);
-	resp.events_reported = ucma_free_ctx(ctx);
+	mutex_lock(&ctx->file->mut);
+	ctx->destroying = 1;
+	mutex_unlock(&ctx->file->mut);
+
+	flush_workqueue(ctx->file->close_wq);
+	/* At this point it's guaranteed that there is no inflight
+	 * closing task */
+	if (!ctx->closing) {
+		ucma_put_ctx(ctx);
+		wait_for_completion(&ctx->comp);
+		rdma_destroy_id(ctx->cm_id);
+	}
 
+	resp.events_reported = ucma_free_ctx(ctx);
 	if (copy_to_user((void __user *)(unsigned long)cmd.response,
 			 &resp, sizeof(resp)))
 		ret = -EFAULT;
@@ -1529,6 +1624,7 @@ static int ucma_open(struct inode *inode, struct file *filp)
 	INIT_LIST_HEAD(&file->ctx_list);
 	init_waitqueue_head(&file->poll_wait);
 	mutex_init(&file->mut);
+	file->close_wq = create_singlethread_workqueue("ucma_close_id");
 
 	filp->private_data = file;
 	file->filp = filp;
@@ -1543,16 +1639,28 @@ static int ucma_close(struct inode *inode, struct file *filp)
 
 	mutex_lock(&file->mut);
 	list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
+		ctx->destroying = 1;
 		mutex_unlock(&file->mut);
 
 		mutex_lock(&mut);
 		idr_remove(&ctx_idr, ctx->id);
 		mutex_unlock(&mut);
 
+		flush_workqueue(file->close_wq);
+		/* At that step once ctx was marked as destroying and workqueue
+		 * was flushed we are safe from any inflights handlers that
+		 * might put other closing task.
+		 */
+		if (!ctx->closing)
+			/* rdma_destroy_id ensures that no event handlers are
+			 * inflight for that id before releasing it.
+			 */
+			rdma_destroy_id(ctx->cm_id);
 		ucma_free_ctx(ctx);
 		mutex_lock(&file->mut);
 	}
 	mutex_unlock(&file->mut);
+	destroy_workqueue(file->close_wq);
 	kfree(file);
 	return 0;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found]     ` <1434984438-21733-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-24 17:57       ` Jason Gunthorpe
       [not found]         ` <20150624175738.GC21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2015-06-24 17:57 UTC (permalink / raw
  To: Yishai Hadas
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On Mon, Jun 22, 2015 at 05:47:14PM +0300, Yishai Hadas wrote:
>  	fd_install(resp.async_fd, filp);
> @@ -386,6 +376,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
>  	return in_len;
>  
>  err_file:
> +	ib_uverbs_free_async_event_file(file);
>  	fput(filp);

This looks really weird. 

> +void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file)
> +{
> +	kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
> +	file->async_file = NULL;
> +}

So that put is supposed to pair with this get?

> +		uverbs_file->async_file = ev_file;
> +		kref_get(&uverbs_file->async_file->ref);

[..]

> +	fput(filp);
> +	uverbs_file->async_file = NULL;

But isn't it null?

Again again, WTF? async_file is a kref'd thing, if you copy or assign
to it you need to manipulate the kref, so the null assign should be
dropping the ref.

Whis looks good enough to remove ib_uverbs_free_async_event_file, if
the flip is created OK then the uverbs_file->async file can remain set
until the uverbs file is closed.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications
       [not found]     ` <1434984438-21733-4-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-24 18:25       ` Jason Gunthorpe
       [not found]         ` <20150624182519.GD21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2015-06-24 18:25 UTC (permalink / raw
  To: Yishai Hadas
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On Mon, Jun 22, 2015 at 05:47:16PM +0300, Yishai Hadas wrote:
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -137,7 +137,12 @@ static void ib_uverbs_release_dev(struct kref *ref)
>  	struct ib_uverbs_device *dev =
>  		container_of(ref, struct ib_uverbs_device, ref);
>  
> -	complete(&dev->comp);
> +	if (!dev->ib_dev) {
> +		cleanup_srcu_struct(&dev->disassociate_srcu);
> +		kfree(dev);
> +	} else {
> +		complete(&dev->comp);
> +	}

Oy.. It is so gross to see a kref now being simultaneously used for
actual memory accounting and also for general reference counting.

It is also locked wrong, for instance:

@@ -792,13 +889,18 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
 err:
+	mutex_unlock(&dev->lists_mutex);
+	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
 	kref_put(&dev->ref, ib_uverbs_release_dev);

Is not holding the RCU lock while ib_uverbs_release_dev is reading
ib_dev. The barriers in kref are not strong enough to guarentee the RCU
protected data will be visible. (remember when I asked if you checked
all of these?)

Obviously you can't hold the disassociate_srcu and call kref_put, so
maybe grabbing it in release_dev would work. I didn't look that
closely.

But really, don't make a kref do two kinds of things, it just doesn't
make any sense. You should split it into a proper memory ownership
tracking kref that always does kfree and a counter used to cause
complete().

The rest looked OK now..

> +                       /* The barriers built into wait_event_interruptible()
> +                        * and wake_up() make this ib_dev check RCU protected
> +                        */

No..

 The barriers built into wait_event_interruptible() and wake_up()
 guarentee this will see the null set without using RCU

> +	if (device->disassociate_ucontext) {
> +		/* We disassociate HW resources and immediately returning, not
> +		 * pending to active userspace clients. Upon returning ib_device
> +		 * may be freed internally and is not valid any more.
> +		 * uverbs_device is still available, when all clients close
> +		 * their files, the uverbs device ref count will be zero and its
> +		 * resources will be freed.
> +		 * Note: At that step no more files can be opened on that cdev
> +		 * as it was deleted, however active clients can still issue
> +		 * commands and close their open files.
> +		 */

Clean up the grammer..

We disassociate HW resources and immediately return. Userspace will
see a EIO errno for all future access. Upon returning, ib_device may be
freed internally and is not valid any more.  uverbs_device is still
available until all clients close their files, then the uverbs device
ref count will be zero and its resources will be freed.  Note: At this
point no more files can be opened since the cdev was deleted, however
active clients can still issue commands and close their open files.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found]         ` <20150624175738.GC21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-25 11:46           ` Yishai Hadas
       [not found]             ` <558BE9FA.9060402-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Yishai Hadas @ 2015-06-25 11:46 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On 6/24/2015 8:57 PM, Jason Gunthorpe wrote:
> On Mon, Jun 22, 2015 at 05:47:14PM +0300, Yishai Hadas wrote:
>>   	fd_install(resp.async_fd, filp);
>> @@ -386,6 +376,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
>>   	return in_len;
>>
>>   err_file:
>> +	ib_uverbs_free_async_event_file(file);
>>   	fput(filp);
>
> This looks really weird.

We need to cleanup all by that step otherwise we might leak that async 
file, see my last comment below which clarifies that point.
As ib_uverbs_release_event_file is static function in uverbs_main it 
makes sense to expose this cleanup function similar to 
ib_uverbs_alloc_event_file which is exposed from uverbs_main and make 
the allocation.

>> +void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file)
>> +{
>> +	kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
>> +	file->async_file = NULL;
>> +}
>
> So that put is supposed to pair with this get?
Yes

>
>> +		uverbs_file->async_file = ev_file;
>> +		kref_get(&uverbs_file->async_file->ref);
>
> [..]
>
>> +	fput(filp);
>> +	uverbs_file->async_file = NULL;
>
> But isn't it null?
>

No, this NULL is set when ib_uverbs_alloc_event_file failed internally, 
the above flow occurs when file was created OK but later it fails via 
copy_to_user.

> Again again, WTF? async_file is a kref'd thing, if you copy or assign
> to it you need to manipulate the kref, so the null assign should be
> dropping the ref.
>

The logic in ib_uverbs_alloc_event_file if kref oriented, the original 
code which uses free didn't follow this convention, currently in case 
ib_uverbs_alloc_event_file failed internally it uses 
kref_put(&ev_file->ref, ib_uverbs_release_event_file) which internally 
frees the memory, in case we had an error after that the file was 
created (e.g ib_register_event_handler) need to use fput(filp) to make 
an extra cleanup which again uses ref count handling.
The uverbs_file->async_file = NULL command just cleans the previous 
setting of uverbs_file->async_file = ev_file that was done before to 
prevent it be cleanup when the uverbs_file itself is closed.

> Whis looks good enough to remove ib_uverbs_free_async_event_file, if
> the flip is created OK then the uverbs_file->async file can remain set
> until the uverbs file is closed.

This is wrong, as file->ucontext is NULL because of the failure in 
copy_to_user, application might recall to ib_uverbs_get_context in that 
case we leak that previous async_file, that's why the code must cleanup 
and put NULL by that step. To your request put 
WARN_ON(uverbs_file->async_file) as part of ib_uverbs_alloc_event_file 
to make sure that we don't hit that case.

This patch fixes a bug that currently exists upstream that you pointed 
on that may cause an oops, hope that above clarifies the issues and you 
can ACK on.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications
       [not found]         ` <20150624182519.GD21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-25 13:51           ` Yishai Hadas
       [not found]             ` <558C0775.4000104-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Yishai Hadas @ 2015-06-25 13:51 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On 6/24/2015 9:25 PM, Jason Gunthorpe wrote:
> On Mon, Jun 22, 2015 at 05:47:16PM +0300, Yishai Hadas wrote:
>> +++ b/drivers/infiniband/core/uverbs_main.c
>> @@ -137,7 +137,12 @@ static void ib_uverbs_release_dev(struct kref *ref)
>>   	struct ib_uverbs_device *dev =
>>   		container_of(ref, struct ib_uverbs_device, ref);
>>
>> -	complete(&dev->comp);
>> +	if (!dev->ib_dev) {
>> +		cleanup_srcu_struct(&dev->disassociate_srcu);
>> +		kfree(dev);
>> +	} else {
>> +		complete(&dev->comp);
>> +	}
>
> Oy.. It is so gross to see a kref now being simultaneously used for
> actual memory accounting and also for general reference counting.
>
> It is also locked wrong, for instance:
Not correct, see below.

>
> @@ -792,13 +889,18 @@  static int ib_uverbs_open(struct inode *inode, struct file *filp)
>   err:
> +	mutex_unlock(&dev->lists_mutex);
> +	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
>   	kref_put(&dev->ref, ib_uverbs_release_dev);
>
> Is not holding the RCU lock while ib_uverbs_release_dev is reading
> ib_dev. The barriers in kref are not strong enough to guarentee the RCU
> protected data will be visible. (remember when I asked if you checked
> all of these?)
>

This is not a locking problem, the only option that here the reference 
count becomes 0 is if ib_uverbs_remove_one was previously called and 
decreased the reference count that was taken upon load. However, it was 
done after that rcu_assign_pointer(uverbs_dev->ib_dev, NULL) was called 
so the check whether if (!dev->ib_dev) is fully protected and can't race 
with HW removal flow.


> Obviously you can't hold the disassociate_srcu and call kref_put, so
> maybe grabbing it in release_dev would work. I didn't look that
> closely.
>
> But really, don't make a kref do two kinds of things, it just doesn't
> make any sense. You should split it into a proper memory ownership
> tracking kref that always does kfree and a counter used to cause
> complete().

The kref is used to manage the uverbs_dev allocation, the internal code 
in ib_uverbs_release_dev depends on the state. Usually the natural place 
to free the memory is as part of the release function as done in other 
kernel places. In case that ib_device was previously removed it can be 
safely freed here as it's called when the last client disconnected, this 
logic is introduced by this patch. In case there is a need to wait 
clients as the driver doesn't support HW device removal the free can't 
be done internally but must be done externally in ib_uverbs_remove_one 
when last client disconnected and complete should be used instead. As of 
I believe that we can stay with only one kref that manage the uverbs_dev 
which is safe as I pointed above.

>
> The rest looked OK now..
Thanks

>
>> +                       /* The barriers built into wait_event_interruptible()
>> +                        * and wake_up() make this ib_dev check RCU protected
>> +                        */
>
> No..
>
>   The barriers built into wait_event_interruptible() and wake_up()
>   guarentee this will see the null set without using RCU
>
Will fix the comment

>> +	if (device->disassociate_ucontext) {
>> +		/* We disassociate HW resources and immediately returning, not
>> +		 * pending to active userspace clients. Upon returning ib_device
>> +		 * may be freed internally and is not valid any more.
>> +		 * uverbs_device is still available, when all clients close
>> +		 * their files, the uverbs device ref count will be zero and its
>> +		 * resources will be freed.
>> +		 * Note: At that step no more files can be opened on that cdev
>> +		 * as it was deleted, however active clients can still issue
>> +		 * commands and close their open files.
>> +		 */
>
> Clean up the grammer..
>
> We disassociate HW resources and immediately return. Userspace will
> see a EIO errno for all future access. Upon returning, ib_device may be
> freed internally and is not valid any more.  uverbs_device is still
> available until all clients close their files, then the uverbs device
> ref count will be zero and its resources will be freed.  Note: At this
> point no more files can be opened since the cdev was deleted, however
> active clients can still issue commands and close their open files.

OK
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications
       [not found]             ` <558C0775.4000104-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-25 17:09               ` Jason Gunthorpe
       [not found]                 ` <20150625170937.GE21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2015-06-25 17:09 UTC (permalink / raw
  To: Yishai Hadas
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On Thu, Jun 25, 2015 at 04:51:49PM +0300, Yishai Hadas wrote:
> On 6/24/2015 9:25 PM, Jason Gunthorpe wrote:
> >Is not holding the RCU lock while ib_uverbs_release_dev is reading
> >ib_dev. The barriers in kref are not strong enough to guarentee the RCU
> >protected data will be visible. (remember when I asked if you checked
> >all of these?)
> 
> This is not a locking problem, the only option that here the reference count
> becomes 0 is if ib_uverbs_remove_one was previously called and decreased the
> reference count that was taken upon load. However, it was done after that
> rcu_assign_pointer(uverbs_dev->ib_dev, NULL) was called so the check whether
> if (!dev->ib_dev) is fully protected and can't race with HW removal flow.

The problem with RCU is not to do with instruction concurrancy, you
need to convince yourself the unlocked accesses have *data* coherency,
that is what rcu_derference and friends are all about.

So, looked too briefly yesterday (sorry, busy), but it turns out that
atomic_sub_return guarentees a full smp_mb(), so the data barrier in
kref_put *IS* strong enough for this use. Thus it is OK.

> The kref is used to manage the uverbs_dev allocation, the internal code in
> ib_uverbs_release_dev depends on the state. Usually the natural place to
> free the memory is as part of the release function as done in other kernel
> places. In case that ib_device was previously removed it can be safely freed
> here as it's called when the last client disconnected, this logic is
> introduced by this patch. In case there is a need to wait clients as the
> driver doesn't support HW device removal the free can't be done internally
> but must be done externally in ib_uverbs_remove_one when last client
> disconnected and complete should be used instead. As of I believe that we
> can stay with only one kref that manage the uverbs_dev which is safe as I
> pointed above.

Many other places in the kernel split the 'can i complete remove'
counter (ie the active count) from the kref. It just makes things
simpler and cleaner

When you changed the flow to do both things, it breaks the error
handling, ie this:

@@ -924,6 +1062,13 @@  static void ib_uverbs_add_one(struct ib_device *device)
	uverbs_dev = kzalloc(sizeof *uverbs_dev, GFP_KERNEL);
 	init_completion(&uverbs_dev->comp);
 	uverbs_dev->xrcd_tree = RB_ROOT;
 	mutex_init(&uverbs_dev->xrcd_tree_mutex);
+	mutex_init(&uverbs_dev->lists_mutex);
+	ret = init_srcu_struct(&uverbs_dev->disassociate_srcu);
+	if (ret)
+		goto err_init;

+err_init:
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
 	wait_for_completion(&uverbs_dev->comp);
 	kfree(uverbs_dev);

Is now a double kfree.

So, to fix this, we need to have two different uses of kput, one that
is protected by the implicit locking of add_one/remove_one, and one
that is usable otherwise.

The former should always be the same, and wrapped in a function:

// ONLY callable from remove_one/add_one
static void free_uverbs_dev(..)
{
	bool need_wait = uverbs_dev->ib_dev;
	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
	if (need_wait) {
		wait_for_completion(&uverbs_dev->comp);
 		cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
		kfree(uverbs_dev);
	}
}

Workable, but it sure makes the kref weird and ugly and abnormal.

99% of the time, seeing a dereference after a kref would be an error,
only when a kref is pretending to be an atomic does it matter. I am of
the opinion that abusing kref for something that is not memory
reference counting is wrong, it makes the code hard to audit and
understand because everyone *expects* standard rules to be followed
with kref.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found]             ` <558BE9FA.9060402-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-25 17:52               ` Jason Gunthorpe
       [not found]                 ` <20150625175228.GF21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2015-06-25 17:52 UTC (permalink / raw
  To: Yishai Hadas
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On Thu, Jun 25, 2015 at 02:46:02PM +0300, Yishai Hadas wrote:
> On 6/24/2015 8:57 PM, Jason Gunthorpe wrote:
> >On Mon, Jun 22, 2015 at 05:47:14PM +0300, Yishai Hadas wrote:
> >>  	fd_install(resp.async_fd, filp);
> >>@@ -386,6 +376,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
> >>  	return in_len;
> >>
> >>  err_file:
> >>+	ib_uverbs_free_async_event_file(file);
> >>  	fput(filp);
> >
> >This looks really weird.
> 
> We need to cleanup all by that step otherwise we might leak that async file,
> see my last comment below which clarifies that point.
> As ib_uverbs_release_event_file is static function in uverbs_main it makes
> sense to expose this cleanup function similar to ib_uverbs_alloc_event_file
> which is exposed from uverbs_main and make the allocation.

Okay, sure, that makes sense.

> >Again again, WTF? async_file is a kref'd thing, if you copy or assign
> >to it you need to manipulate the kref, so the null assign should be
> >dropping the ref.
>
> The logic in ib_uverbs_alloc_event_file if kref oriented, the original code
> which uses free didn't follow this convention, currently in case

Yes, and you fixed it, this is good:

@@ -557,15 +563,38 @@  struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
	ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL);
	kref_init(&ev_file->ref);
        if (IS_ERR(filp))
+		goto err;
+err:
+	kref_put(&ev_file->ref, ib_uverbs_release_event_file);

The kref_init and the kref_put are a matching pair, great!

> had an error after that the file was created (e.g ib_register_event_handler)
> need to use fput(filp) to make an extra cleanup which again uses ref count
> handling.

Yep, that is a good idea too. But understand what it means,

When this happens:
 	filp = anon_inode_getfile("[infinibandevent]", &uverbs_event_fops,
 				  ev_file, O_RDONLY);
The unref pairs for these gets:
	kref_init(&ev_file->ref);
+	kref_get(&uverbs_file->ref);

Are logically moved into flip, and now fput(flip) will call
ib_uverbs_event_close() and both those krefs will be released.

For this reason, for clarity, I would also move the
'kref_get(&uverbs_file->ref);' to be before anon_inode_getfile()

Which means this:

@@ -557,15 +563,38 @@  struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
+	kref_get(&uverbs_file->ref);
+	if (is_async) {
+		if (ret)
+			goto put_file;
+put_file:
+	fput(filp);
+err:
+	kref_put(&ev_file->ref, ib_uverbs_release_event_file);

Is a double put on ev_file, since fput -> ib_uverbs_event_close does one, and
then err does another. (remember, the ref was moved into the flip)

Next:

+	if (is_async) {
+		uverbs_file->async_file = ev_file;
+		kref_get(&uverbs_file->async_file->ref);
+		if (ret)
+			goto put_file;
+put_file:
+	uverbs_file->async_file = NULL;

Where is the paired kref_put? There are two I can see:

+void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file)
+{
+	kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
-----
static int ib_uverbs_close(struct inode *inode, struct file *filp)
		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);

As far as I can tell, neither of these are called, so we have an
unbalanced put.

Which brings me back to my statement:

> >Again again, WTF? async_file is a kref'd thing, if you copy or assign
> >to it you need to manipulate the kref, so the null assign should be
> >dropping the ref.

So, the two possible fixes are
1) Make the call to ib_uverbs_free_async_event_file() happen even if
   ib_uverbs_alloc_event_file() fails
2) Do in ib_uverbs_alloc_event_file
+	kref_put(&uverbs_file->async_file->ref, ib_uverbs_release_event_file);
+	uverbs_file->async_file = NULL;
   Which may as well just be a call to ib_uverbs_free_async_event_file()

In either situation the naked async_file = NULL is fixed to have a
kref beside it.

This is basic invariant of how kref works. Read the 'rules' in
Documentation/kref.txt

So, as I reviewer, I see a kref'd variable (async_file) being
manipulated without corresponding kref code. I *KNOW* something is
wrong. You answer didn't address this, you needed to ID where the
pair'd kref_put was and justify having the '= NULL' so far away from
it.

> >Whis looks good enough to remove ib_uverbs_free_async_event_file, if
> >the flip is created OK then the uverbs_file->async file can remain set
> >until the uverbs file is closed.
> 
> This is wrong, as file->ucontext is NULL because of the failure in
> copy_to_user, application might recall to ib_uverbs_get_context in that case
> we leak that previous async_file, that's why the code must cleanup and put
> NULL by that step. To your request put WARN_ON(uverbs_file->async_file) as
> part of ib_uverbs_alloc_event_file to make sure that we don't hit that case.

Yes, that makes sense.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found]                 ` <20150625175228.GF21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-28 14:33                   ` Yishai Hadas
       [not found]                     ` <559005A0.4070800-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Yishai Hadas @ 2015-06-28 14:33 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On 6/25/2015 8:52 PM, Jason Gunthorpe wrote:
> On Thu, Jun 25, 2015 at 02:46:02PM +0300, Yishai Hadas wrote:
>> On 6/24/2015 8:57 PM, Jason Gunthorpe wrote:
>>> On Mon, Jun 22, 2015 at 05:47:14PM +0300, Yishai Hadas wrote:
>>>>   	fd_install(resp.async_fd, filp);
>>>> @@ -386,6 +376,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
>>>>   	return in_len;
>>>>
>>>>   err_file:
>>>> +	ib_uverbs_free_async_event_file(file);
>>>>   	fput(filp);
>>>
>>> This looks really weird.
>>
>> We need to cleanup all by that step otherwise we might leak that async file,
>> see my last comment below which clarifies that point.
>> As ib_uverbs_release_event_file is static function in uverbs_main it makes
>> sense to expose this cleanup function similar to ib_uverbs_alloc_event_file
>> which is exposed from uverbs_main and make the allocation.
>
> Okay, sure, that makes sense.
>
>>> Again again, WTF? async_file is a kref'd thing, if you copy or assign
>>> to it you need to manipulate the kref, so the null assign should be
>>> dropping the ref.
>>
>> The logic in ib_uverbs_alloc_event_file if kref oriented, the original code
>> which uses free didn't follow this convention, currently in case
>
> Yes, and you fixed it, this is good:
>
> @@ -557,15 +563,38 @@  struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
> 	ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL);
> 	kref_init(&ev_file->ref);
>          if (IS_ERR(filp))
> +		goto err;
> +err:
> +	kref_put(&ev_file->ref, ib_uverbs_release_event_file);
>
> The kref_init and the kref_put are a matching pair, great!
>
>> had an error after that the file was created (e.g ib_register_event_handler)
>> need to use fput(filp) to make an extra cleanup which again uses ref count
>> handling.
>
> Yep, that is a good idea too. But understand what it means,
>
> When this happens:
>   	filp = anon_inode_getfile("[infinibandevent]", &uverbs_event_fops,
>   				  ev_file, O_RDONLY);
> The unref pairs for these gets:
> 	kref_init(&ev_file->ref);
> +	kref_get(&uverbs_file->ref);
>
> Are logically moved into flip, and now fput(flip) will call
> ib_uverbs_event_close() and both those krefs will be released.
>
> For this reason, for clarity, I would also move the
> 'kref_get(&uverbs_file->ref);' to be before anon_inode_getfile()

Taking kref_get(&uverbs_file->ref) before anon_inode_getfile has 
succeeded is not correct, only when anon_inode_getfile succeeds we can 
call fput(filp) which internally makes the kref_get on the uverbs_file 
as part of ib_uverbs_event_close. In current code we take ref count on 
the ev_file as part kref_init(&ev_file->ref) which is put as part of the 
"err" label
err:
	kref_put(&ev_file->ref, ib_uverbs_release_event_file);
In case anon_inode_getfile has succeeded we take also the ref_count on 
the uverbs_file which will be put back upon closing the file and the 
kref put/get are fully symmetric. For the async case we also fully 
symmetric here, see next comment below.


> Which means this:
>
> @@ -557,15 +563,38 @@  struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
> +	kref_get(&uverbs_file->ref);
> +	if (is_async) {
> +		if (ret)
> +			goto put_file;
> +put_file:
> +	fput(filp);
> +err:
> +	kref_put(&ev_file->ref, ib_uverbs_release_event_file);
>
> Is a double put on ev_file, since fput -> ib_uverbs_event_close does one, and
> then err does another. (remember, the ref was moved into the flip)
>
> Next:
>
> +	if (is_async) {
> +		uverbs_file->async_file = ev_file;
> +		kref_get(&uverbs_file->async_file->ref);
> +		if (ret)
> +			goto put_file;
> +put_file:
> +	uverbs_file->async_file = NULL;
>
> Where is the paired kref_put? There are two I can see:
>
> +void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file)
> +{
> +	kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
> -----
> static int ib_uverbs_close(struct inode *inode, struct file *filp)
> 		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
>
> As far as I can tell, neither of these are called, so we have an
> unbalanced put.

You are wrong here, we have here balanced put, the first is done as part 
of fput(filp) -> ib_uverbs_event_close_file -> kref_put(&file->ref, 
ib_uverbs_release_event_file) and the second at the end of this function 
as part of the err: label kref_put(&ev_file->ref, 
ib_uverbs_release_event_file);




> So, as I reviewer, I see a kref'd variable (async_file) being
> manipulated without corresponding kref code. I *KNOW* something is
> wrong. You answer didn't address this, you needed to ID where the
> pair'd kref_put was and justify having the '= NULL' so far away from
> it.

See my comment above, we have here pair kref_put, the 
uverbs_file->async_file = NULL comes to prevent a third in-correct 
kref_put as part of ib_uverbs_close.

Please review and let me know whether you can ACK on this patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found]                     ` <559005A0.4070800-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-29 17:40                       ` Jason Gunthorpe
       [not found]                         ` <20150629174038.GB2755-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2015-06-29 17:40 UTC (permalink / raw
  To: Yishai Hadas
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On Sun, Jun 28, 2015 at 05:33:04PM +0300, Yishai Hadas wrote:

> You are wrong here, we have here balanced put, the first is done as
> part of fput(filp) -> ib_uverbs_event_close_file ->
> kref_put(&file->ref, ib_uverbs_release_event_file) and the second at
> the end of this function as part of the err: label
> kref_put(&ev_file->ref, ib_uverbs_release_event_file);

Ugh, that is so gross, seriously? Sure, the counts work out, but that
is *totally unreadable*.

Sigh, just use this:

 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
                                        int is_async)
 {
        struct ib_uverbs_event_file *ev_file;
        struct file *filp;
+       int ret;
 
-       ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL);
+       ev_file = kzalloc(sizeof(*ev_file), GFP_KERNEL);
        if (!ev_file)
                return ERR_PTR(-ENOMEM);
 
@@ -555,15 +561,41 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
        INIT_LIST_HEAD(&ev_file->event_list);
        init_waitqueue_head(&ev_file->poll_wait);
        ev_file->uverbs_file = uverbs_file;
+       kref_get(&ev_file->uverbs_file->ref);
        ev_file->async_queue = NULL;
-       ev_file->is_async    = is_async;
        ev_file->is_closed   = 0;
 
        filp = anon_inode_getfile("[infinibandevent]", &uverbs_event_fops,
                                  ev_file, O_RDONLY);
        if (IS_ERR(filp))
-               kfree(ev_file);
+               goto err_put_refs;
+
+       if (is_async) {
+               WARN_ON(uverbs_file->async_file);
+               uverbs_file->async_file = ev_file;
+               kref_get(&uverbs_file->async_file->ref);
+               INIT_IB_EVENT_HANDLER(&uverbs_file->event_handler,
+                                     uverbs_file->device->ib_dev,
+                                     ib_uverbs_event_handler);
+               ret = ib_register_event_handler(&uverbs_file->event_handler);
+               if (ret)
+                       goto err_put_file;
+
+               /* At that point async file stuff was fully set */
+               ev_file->is_async = 1;
+       }
+
+       return filp;
+
+err_put_file:
+       fput(filp);
+       kref_put(&uverbs_file->async_file->ref, ib_uverbs_release_file);
+       uverbs_file->async_file = NULL;
+       return ERR_PTR(ret);
 
+err_put_refs:
+       kref_put(&ev_file->uverbs_file->ref, ib_uverbs_release_file);
+       kref_put(&ev_file->ref, ib_uverbs_release_event_file);
        return filp;
 }

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found]                         ` <20150629174038.GB2755-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-29 21:22                           ` Yishai Hadas
       [not found]                             ` <5591B6FA.4000400-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Yishai Hadas @ 2015-06-29 21:22 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On 6/29/2015 8:40 PM, Jason Gunthorpe wrote:
> On Sun, Jun 28, 2015 at 05:33:04PM +0300, Yishai Hadas wrote:
>
>> You are wrong here, we have here balanced put, the first is done as
>> part of fput(filp) -> ib_uverbs_event_close_file ->
>> kref_put(&file->ref, ib_uverbs_release_event_file) and the second at
>> the end of this function as part of the err: label
>> kref_put(&ev_file->ref, ib_uverbs_release_event_file);
>
> Ugh, that is so gross, seriously? Sure, the counts work out, but that
> is *totally unreadable*.
>
> Sigh, just use this:
>
>   struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
>                                          int is_async)
>   {
>          struct ib_uverbs_event_file *ev_file;
>          struct file *filp;
> +       int ret;
>
> -       ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL);
> +       ev_file = kzalloc(sizeof(*ev_file), GFP_KERNEL);
>          if (!ev_file)
>                  return ERR_PTR(-ENOMEM);
>
> @@ -555,15 +561,41 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
>          INIT_LIST_HEAD(&ev_file->event_list);
>          init_waitqueue_head(&ev_file->poll_wait);
>          ev_file->uverbs_file = uverbs_file;
> +       kref_get(&ev_file->uverbs_file->ref);
>          ev_file->async_queue = NULL;
> -       ev_file->is_async    = is_async;
>          ev_file->is_closed   = 0;
>
>          filp = anon_inode_getfile("[infinibandevent]", &uverbs_event_fops,
>                                    ev_file, O_RDONLY);
>          if (IS_ERR(filp))
> -               kfree(ev_file);
> +               goto err_put_refs;
> +
> +       if (is_async) {
> +               WARN_ON(uverbs_file->async_file);
> +               uverbs_file->async_file = ev_file;
> +               kref_get(&uverbs_file->async_file->ref);
> +               INIT_IB_EVENT_HANDLER(&uverbs_file->event_handler,
> +                                     uverbs_file->device->ib_dev,
> +                                     ib_uverbs_event_handler);
> +               ret = ib_register_event_handler(&uverbs_file->event_handler);
> +               if (ret)
> +                       goto err_put_file;
> +
> +               /* At that point async file stuff was fully set */
> +               ev_file->is_async = 1;
> +       }
> +
> +       return filp;
> +
> +err_put_file:
> +       fput(filp);
> +       kref_put(&uverbs_file->async_file->ref, ib_uverbs_release_file);

It should be:
kref_put(&uverbs_file->async_file->ref, ib_uverbs_release_event_file)
instead of:
kref_put(&uverbs_file->async_file->ref, ib_uverbs_release_file);

Please note that in that approach we duplicate above line as it appears 
here and in below err_put_refs label as uverbs_file->async_file is 
really ev_file.

In my patch we used one line instead of those duplicated lines, however 
as you think that it clarifies things will go with your suggestion after 
fixing above note, thanks.


> +       uverbs_file->async_file = NULL;
> +       return ERR_PTR(ret);
>
> +err_put_refs:
> +       kref_put(&ev_file->uverbs_file->ref, ib_uverbs_release_file);
> +       kref_put(&ev_file->ref, ib_uverbs_release_event_file);
>          return filp;
>   }
>
> Jason
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications
       [not found]                 ` <20150625170937.GE21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-29 21:33                   ` Yishai Hadas
  0 siblings, 0 replies; 17+ messages in thread
From: Yishai Hadas @ 2015-06-29 21:33 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On 6/25/2015 8:09 PM, Jason Gunthorpe wrote:
> On Thu, Jun 25, 2015 at 04:51:49PM +0300, Yishai Hadas wrote:
>> On 6/24/2015 9:25 PM, Jason Gunthorpe wrote:
>>> Is not holding the RCU lock while ib_uverbs_release_dev is reading
>>> ib_dev. The barriers in kref are not strong enough to guarentee the RCU
>>> protected data will be visible. (remember when I asked if you checked
>>> all of these?)
>>
>> This is not a locking problem, the only option that here the reference count
>> becomes 0 is if ib_uverbs_remove_one was previously called and decreased the
>> reference count that was taken upon load. However, it was done after that
>> rcu_assign_pointer(uverbs_dev->ib_dev, NULL) was called so the check whether
>> if (!dev->ib_dev) is fully protected and can't race with HW removal flow.
>
> The problem with RCU is not to do with instruction concurrancy, you
> need to convince yourself the unlocked accesses have *data* coherency,
> that is what rcu_derference and friends are all about.
>
> So, looked too briefly yesterday (sorry, busy), but it turns out that
> atomic_sub_return guarentees a full smp_mb(), so the data barrier in
> kref_put *IS* strong enough for this use. Thus it is OK.
>
>> The kref is used to manage the uverbs_dev allocation, the internal code in
>> ib_uverbs_release_dev depends on the state. Usually the natural place to
>> free the memory is as part of the release function as done in other kernel
>> places. In case that ib_device was previously removed it can be safely freed
>> here as it's called when the last client disconnected, this logic is
>> introduced by this patch. In case there is a need to wait clients as the
>> driver doesn't support HW device removal the free can't be done internally
>> but must be done externally in ib_uverbs_remove_one when last client
>> disconnected and complete should be used instead. As of I believe that we
>> can stay with only one kref that manage the uverbs_dev which is safe as I
>> pointed above.
>
> Many other places in the kernel split the 'can i complete remove'
> counter (ie the active count) from the kref. It just makes things
> simpler and cleaner

OK, will use 2 krefs one will control the complete and the second will 
manage the memory, V6 will have this change in.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files
       [not found]                             ` <5591B6FA.4000400-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-29 21:48                               ` Jason Gunthorpe
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2015-06-29 21:48 UTC (permalink / raw
  To: Yishai Hadas
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, raindel-VPRAkNaXOzVWk0Htik3J/w,
	jackm-VPRAkNaXOzVWk0Htik3J/w

On Tue, Jun 30, 2015 at 12:22:02AM +0300, Yishai Hadas wrote:
> It should be:
> kref_put(&uverbs_file->async_file->ref, ib_uverbs_release_event_file)
> instead of:
> kref_put(&uverbs_file->async_file->ref, ib_uverbs_release_file);

Right

> Please note that in that approach we duplicate above line as it appears here
> and in below err_put_refs label as uverbs_file->async_file is really
> ev_file.

Well, sort of. Yes, it is the same value, but no, it isn't the same
kref.

Just like locking always works on data, not code, krefs should always
be applied to the pointer not to the pointed data. The kref
manipulation should always be near the pointer value change it is
working for, and have a clear relation to the pointer it is krefing.

> In my patch we used one line instead of those duplicated lines, however as
> you think that it clarifies things will go with your suggestion after fixing
> above note, thanks.

You shouldn't think of them as duplicates, they are for different
things, even if the actual functionality is the same. It is
*documentation*

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-06-29 21:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-22 14:47 [PATCH for-next V5 0/5] HW Device hot-removal support Yishai Hadas
     [not found] ` <1434984438-21733-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-22 14:47   ` [PATCH for-next V5 1/5] IB/uverbs: Fix reference counting usage of event files Yishai Hadas
     [not found]     ` <1434984438-21733-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-24 17:57       ` Jason Gunthorpe
     [not found]         ` <20150624175738.GC21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-25 11:46           ` Yishai Hadas
     [not found]             ` <558BE9FA.9060402-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-25 17:52               ` Jason Gunthorpe
     [not found]                 ` <20150625175228.GF21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-28 14:33                   ` Yishai Hadas
     [not found]                     ` <559005A0.4070800-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-29 17:40                       ` Jason Gunthorpe
     [not found]                         ` <20150629174038.GB2755-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-29 21:22                           ` Yishai Hadas
     [not found]                             ` <5591B6FA.4000400-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-29 21:48                               ` Jason Gunthorpe
2015-06-22 14:47   ` [PATCH for-next V5 2/5] IB/uverbs: Explicitly pass ib_dev to uverbs commands Yishai Hadas
2015-06-22 14:47   ` [PATCH for-next V5 3/5] IB/uverbs: Enable device removal when there are active user space applications Yishai Hadas
     [not found]     ` <1434984438-21733-4-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-24 18:25       ` Jason Gunthorpe
     [not found]         ` <20150624182519.GD21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-25 13:51           ` Yishai Hadas
     [not found]             ` <558C0775.4000104-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-25 17:09               ` Jason Gunthorpe
     [not found]                 ` <20150625170937.GE21033-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-29 21:33                   ` Yishai Hadas
2015-06-22 14:47   ` [PATCH for-next V5 4/5] IB/mlx4_ib: Disassociate support Yishai Hadas
2015-06-22 14:47   ` [PATCH for-next V5 5/5] IB/ucma: HW Device hot-removal support Yishai Hadas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.