All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kuo Hugo <tonytkdk@gmail.com>
Cc: Hugo Kuo <hugo@swiftstack.com>,
	Eric Sandeen <sandeen@sandeen.net>,
	Darrell Bishop <darrell@swiftstack.com>,
	xfs@oss.sgi.com
Subject: Re: Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20
Date: Mon, 20 Jul 2015 11:12:56 -0400	[thread overview]
Message-ID: <20150720151256.GA17816@bfoster.bfoster> (raw)
In-Reply-To: <CA++_uhvwR1KucdHWnPzS5ysFuYyssFnUB95kS-piC_pRnq=dXw@mail.gmail.com>

On Mon, Jul 20, 2015 at 10:30:31PM +0800, Kuo Hugo wrote:
> Hi Brain,
> 
> >I don’t know much about the Swift bug. A BUG() or crash in the kernel is
> generally always a kernel bug, regardless of what userspace is doing. It
> >certainly could be that whatever userspace is doing to trigger the kernel
> bug is a bug in the userspace application, but either way it shouldn’t
> cause the >kernel to crash. By the same token, if Swift is updated to fix
> the aforementioned bug and the kernel crash no longer reproduces, that
> doesn’t >necessarily mean the kernel bug is fixed (just potentially hidden).
> 
> Understand.
> 
> [Previous Message]
> 
> The valid inode has an inode number of 13668207561.
> - The fsname for this inode is "sdb."
> - The inode does appear to have a non-NULL if_data:
> 
>     if_u1 = {
>       if_extents = 0xffff88084feaf5c0,
>       if_ext_irec = 0xffff88084feaf5c0,
>       if_data = 0xffff88084feaf5c0 "\004"
>     },
> 
>         find <mntpath> -inum 13668207561
> 
> Q1: Were you able to track down the directory inode mentioned in the
> previous message?
> 
> Ans: Yes, it’s the directory/file as below. /srv/node/d224 is the mount
> point of /dev/sdb . This is the original location of the path. This folder
> includes the file 1436266052.71893.ts now. The .ts file is 0 size
> 
> 
> [root@r2obj01 ~]# find /srv/node/d224 -inum 13668207561
> /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32
> 
> [root@r2obj01 ~]# ls -lrt
> /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32
> -rw------- 1 swift swift 0 Jul 7 22:37 1436266052.71893.ts
> 
> Q2: Is it some kind of internal directory used by the application (e.g.,
> perhaps related to the quarantine mechanism mentioned in the bug)?
> 
> Ans: Yes, it’s a directory which accessing by application.
> 

Ok, so I take it that we have a directory per object based on some kind
of hash. The directory presumably contains the object along with
whatever metadata is tracked.

> 
>  37 ffff8810718343c0 ffff88105b9d32c0 ffff8808745aa5e8 REG  [eventpoll]
>  38 ffff8808713da780 ffff880010c9a900 ffff88096368a188 REG
> /srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts
>  39 ffff880871cb03c0 ffff880495a8b380 ffff8808a5e6c988 REG
> /srv/node/d224/tmp/tmpSpnrHg
> 
>  40 ffff8808715b4540 ffff8804819c58c0 ffff8802381f8d88 DIR
> /srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32
> 
> The above operation in the swift-object-server was doing python function
> call to rename the file
> * /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts*
> as
> */srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts*
> 
> os.rename(old, new)
> 
> And it crashed at this point. In the Q1, we found the inum is pointing to
> the directory
> /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32 .
> 

The original stacktrace shows the crash in a readdir request. I'm sure
there are multiple things going on here (and there are a couple rename
traces in the vmcore sitting on locks), of course, but where does the
information about the rename come from?

> We found that multiple(over 10) DELETE from application against the target
> file at almost same moment. The DELETE is removing the original file in the
> directory and create new empty .ts file in this directory. I suspect that
> multiple os.rename on the same file in that directory will cause the kernel
> panic.
> 
> And the file
> /srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts
> was not created.
> 

I'm not quite following here because I don't have enough context about
what the application server is doing. So far, it sounds like we somehow
have multiple threads competing to rename the same file..? Is there
anything else in this directory at the time this sequence executes
(e.g., a file with object data that also gets quarantined)?

Ideally, we'd ultimately like to translate this into a sequence of
operations as seen by the fs that hopefully trigger the problem. We
might have to start by reproducing through the application server.
Looking back at that bug report, it sounds like a 'DELETE' is a
high-level server operation that can consist of multiple sub-operations
at the filesystem level (e.g., list, conditional rename if *.ts file
exists, etc.). Do you have enough information through any of the above
to try and run something against Swift that might explicitly reproduce
the problem? For example, have one thread that creates and recreates the
same object repeatedly and many more competing threads that try to
remove (or whatever results in the quarantine) it? Note that I'm just
grasping at straws here, you might be able to design a more accurate
reproducer based on what it looks like is happening within Swift.

Brian

> Regards // Hugo
> ​

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2015-07-20 15:13 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-18 11:56 Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20 Kuo Hugo
2015-06-18 13:31 ` Brian Foster
2015-06-18 14:29   ` Kuo Hugo
2015-06-18 14:59     ` Eric Sandeen
2015-07-09 10:57       ` Kuo Hugo
2015-07-09 12:51         ` Brian Foster
2015-07-09 13:20           ` Kuo Hugo
2015-07-09 13:27             ` Kuo Hugo
2015-07-09 15:18             ` Brian Foster
2015-07-09 16:40               ` Kuo Hugo
2015-07-09 18:32                 ` Brian Foster
2015-07-10  5:36                   ` Kuo Hugo
2015-07-10 10:39                     ` Kuo Hugo
2015-07-10 16:25                       ` Kuo Hugo
2015-07-13 12:52                     ` Brian Foster
2015-07-13 14:06                       ` Kuo Hugo
2015-07-13 17:01                         ` Brian Foster
2015-07-13 18:10                           ` Kuo Hugo
2015-07-17 19:39                             ` Kuo Hugo
2015-07-20 11:46                               ` Brian Foster
2015-07-20 14:30                                 ` Kuo Hugo
2015-07-20 15:12                                   ` Brian Foster [this message]
2015-07-22  8:54                                     ` Kuo Hugo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150720151256.GA17816@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=darrell@swiftstack.com \
    --cc=hugo@swiftstack.com \
    --cc=sandeen@sandeen.net \
    --cc=tonytkdk@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.