From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id AFA0D7FA9 for ; Wed, 22 Jul 2015 03:54:17 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 83A78304048 for ; Wed, 22 Jul 2015 01:54:17 -0700 (PDT) Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by cuda.sgi.com with ESMTP id sOpxcrDXeDplqeIa (version=TLSv1 cipher=RC4-SHA bits=128 verify=NO) for ; Wed, 22 Jul 2015 01:54:12 -0700 (PDT) Received: by wicgb10 with SMTP id gb10so88032128wic.1 for ; Wed, 22 Jul 2015 01:54:11 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150720151256.GA17816@bfoster.bfoster> References: <20150709183255.GG63282@bfoster.bfoster> <20150713125214.GA50787@bfoster.bfoster> <20150713170158.GB50787@bfoster.bfoster> <20150720114648.GB53450@bfoster.bfoster> <20150720151256.GA17816@bfoster.bfoster> Date: Wed, 22 Jul 2015 16:54:11 +0800 Message-ID: Subject: Re: Data can't be wrote to XFS RIP [] xfs_dir2_sf_get_parent_ino+0xa/0x20 From: Kuo Hugo List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============3561209290032458346==" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Brian Foster Cc: Hugo Kuo , Eric Sandeen , Darrell Bishop , xfs@oss.sgi.com --===============3561209290032458346== Content-Type: multipart/alternative; boundary=047d7b4724e21bc373051b72eab9 --047d7b4724e21bc373051b72eab9 Content-Type: text/plain; charset=UTF-8 Hi Brain, >The original stacktrace shows the crash in a readdir request. I'm sure >there are multiple things going on here (and there are a couple rename >traces in the vmcore sitting on locks), of course, but where does the >information about the rename come from? I tracked source code of the application. It moves data to a quarantined area(another folder on same disk) under some conditions. In the bug report, it indicates a condition that DELETE(create empty file in a directory) object + list the directory will cause data MOVE (os.rename) to quarantined area(another folder). The os.rename function call is the only function of the application to touch quarantined folder. >I'm not quite following here because I don't have enough context about >what the application server is doing. So far, it sounds like we somehow >have multiple threads competing to rename the same file..? Is there >anything else in this directory at the time this sequence executes >(e.g., a file with object data that also gets quarantined)? The previous behavior (a bug in the application) should not trigger Kernel panic. Yes, there's multiple threads competing to DELETE(create a empty file) in the same directory also move the existing one to the quarantined area. I think this is the root cause of kernel panic. The scenario is 10 application workers raise 10 thread to do same thing in the same moment. >Ideally, we'd ultimately like to translate this into a sequence of >operations as seen by the fs that hopefully trigger the problem. We >might have to start by reproducing through the application server. >Looking back at that bug report, it sounds like a 'DELETE' is a >high-level server operation that can consist of multiple sub-operations >at the filesystem level (e.g., list, conditional rename if *.ts file >exists, etc.). Do you have enough information through any of the above >to try and run something against Swift that might explicitly reproduce >the problem? For example, have one thread that creates and recreates the >same object repeatedly and many more competing threads that try to >remove (or whatever results in the quarantine) it? Note that I'm just >grasping at straws here, you might be able to design a more accurate >reproducer based on what it looks like is happening within Swift. We observe this issue on production cluster. It's hard to have a free gear with 100% same HW to test it currently. I'll try to figure out an approach to reproduce it. I'll update this mail thread if I can make it. Thanks // Hugo --047d7b4724e21bc373051b72eab9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
= Hi Brain,

>The original stacktrace shows the crash in a readdir request. I= 9;m sure
= >there are multiple things going on here (and there are a couple rename<= /span>
>trace= s in the vmcore sitting on locks), of course, but where does the
>information abo= ut the rename come from?

I tracked source code of the application. It moves = data to a quarantined area(another folder on same disk) under some conditio= ns. In the bug report, it indicates a condition that DELETE(create empty fi= le in a directory) object + list the directory will cause data MOVE (os.ren= ame) to quarantined area(another folder). The os.rename function call is th= e only function of the application to touch quarantined folder.=C2=A0

>I'= m not quite following here because I don't have enough context about
>what th= e application server is doing. So far, it sounds like we somehow
>have multiple t= hreads competing to rename the same file..? Is there
>anything else in this direc= tory at the time this sequence executes
= >(e.g., a file with object data that also= gets quarantined)?

The previous beh= avior (a bug in the application) should not trigger Kernel panic. Yes, ther= e's multiple threads competing to DELETE(create a empty file) in the sa= me directory also move the existing one to the quarantined area. I think th= is is the root cause of kernel panic. The scenario is 10 application worker= s raise 10 thread to do same thing in the same moment.=C2=A0

>Ideally, we'd ul= timately like to translate this into a sequence of
>operations as seen by the fs = that hopefully trigger the problem. We
<= span style=3D"font-size:14px">>might have to start by reproducing throug= h the application server.

>Looking back at that bug report, it sounds like a '= ;DELETE' is a
>high-level server operation that can consist of multiple sub-o= perations
>at the filesystem level (e.g., list, conditional rename if *.ts file
>exists= , etc.). Do you have enough information through any of the above
>to try and run = something against Swift that might explicitly reproduce
>the problem? For example= , have one thread that creates and recreates the
>same object repeatedly and many= more competing threads that try to
>remove (or whatever results in the quarantin= e) it? Note that I'm just

>grasping at straws here, you might be able to desi= gn a more accurate
>reproducer based on what it looks like is happening within Sw= ift.

We observe this issue on producti= on cluster. It's hard to have a free gear with 100% same HW to test it = currently.=C2=A0
I'll try to figure out an approach to reproduce it. I'l= l update this mail thread if I can make it.=C2=A0

Thanks // Hugo

--047d7b4724e21bc373051b72eab9-- --===============3561209290032458346== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs --===============3561209290032458346==--