Hi Brain, >The original stacktrace shows the crash in a readdir request. I'm sure >there are multiple things going on here (and there are a couple rename >traces in the vmcore sitting on locks), of course, but where does the >information about the rename come from? I tracked source code of the application. It moves data to a quarantined area(another folder on same disk) under some conditions. In the bug report, it indicates a condition that DELETE(create empty file in a directory) object + list the directory will cause data MOVE (os.rename) to quarantined area(another folder). The os.rename function call is the only function of the application to touch quarantined folder. >I'm not quite following here because I don't have enough context about >what the application server is doing. So far, it sounds like we somehow >have multiple threads competing to rename the same file..? Is there >anything else in this directory at the time this sequence executes >(e.g., a file with object data that also gets quarantined)? The previous behavior (a bug in the application) should not trigger Kernel panic. Yes, there's multiple threads competing to DELETE(create a empty file) in the same directory also move the existing one to the quarantined area. I think this is the root cause of kernel panic. The scenario is 10 application workers raise 10 thread to do same thing in the same moment. >Ideally, we'd ultimately like to translate this into a sequence of >operations as seen by the fs that hopefully trigger the problem. We >might have to start by reproducing through the application server. >Looking back at that bug report, it sounds like a 'DELETE' is a >high-level server operation that can consist of multiple sub-operations >at the filesystem level (e.g., list, conditional rename if *.ts file >exists, etc.). Do you have enough information through any of the above >to try and run something against Swift that might explicitly reproduce >the problem? For example, have one thread that creates and recreates the >same object repeatedly and many more competing threads that try to >remove (or whatever results in the quarantine) it? Note that I'm just >grasping at straws here, you might be able to design a more accurate >reproducer based on what it looks like is happening within Swift. We observe this issue on production cluster. It's hard to have a free gear with 100% same HW to test it currently. I'll try to figure out an approach to reproduce it. I'll update this mail thread if I can make it. Thanks // Hugo