<subbu>
chrisseaton[m]1, documenting? or something else?
<chrisseaton[m]1>
Just writing some notes at the moment
<chrisseaton[m]1>
Will share with you before I publish
_whitelogger has joined #jruby
<subbu>
sounds good! thanks.
ur5us has quit [Ping timeout: 256 seconds]
_whitelogger has joined #jruby
ur5us has joined #jruby
drbobbeaty has joined #jruby
drbobbeaty has quit [Ping timeout: 256 seconds]
ur5us has quit [Ping timeout: 256 seconds]
nirvdrum has joined #jruby
jmalves has joined #jruby
<jmalves>
Hey all,
<jmalves>
We are seing a very sporadic NPE in our testing pipeline that seems to happen when calling a ruby block in Java. This block might be called from multiple threads concurrently. I have not been able to reproduce this NPE locally.
<jmalves>
Our JRuby version is 9.2.7.0. The NPE stack trace is:
<jmalves>
I was looking at the code in InterpretedIRBlockBody, and I was wondering if the method `commonYieldPath` can be called concurrently? If so that might explain the cause of this NPE.
<jmalves>
(I can create an issue about this if that would be more appropriate, still trying to get a reproduction :/)
<headius[m]>
jmalves: yo
<jmalves>
Hey Headius!
<headius[m]>
hmm ok let's see
<headius[m]>
off the top of my head, that path should be safe
<headius[m]>
there are places in the IR logic where we do some set-up lazily, though and it has bitten us on concurrency once in a while
<headius[m]>
I'll roll my source back to 9.2.7 and have a look at that line, but you should open an issue now
<jmalves>
The way the method ensureInstrsReady() is used + commonYieldPath can make interpreterContext be set to nil if fullInterpreterContext can ever be nil
<headius[m]>
it may be something we fixed but if you hit it, others might hit it... best to have a record
<headius[m]>
oh nice, you have looked into it a bit already
<headius[m]>
I think I remember looking into this at one point myself
<headius[m]>
it's the promotion from "startup" interpreter context to the full one, and wasn't being done atomically
<headius[m]>
enebo: do you remember fixing anything here?
<jmalves>
Like there is the nil check here `if (interpreterContext == null) {` then `interpreterContext = closure.getInterpreterContext(); ` gets called. Next thread does not into the if because `interpreterContext` is not null any longer so when it runs ` interpreterContext = fullInterpreterContext; ` this sets it back to nil
<jmalves>
(in the example above the first thread got paused at `interpreterContext = closure.getInterpreterContext();` i.e.: never run `fullInterpreterContext = interpreterContext;`)
<jmalves>
Not sure if I was clear or what I said made sense
<headius[m]>
yeah I think you're on the right track
<headius[m]>
there are changes in here since 9.2.7 but I'm not sure if they fix the problem or not
<headius[m]>
I would say probably not
drbobbeaty has joined #jruby
<headius[m]>
ick, so many state changes
<headius[m]>
well this should be easy enough to reproduce... generate new blocks in a loop and try to call them concurrently
<headius[m]>
pretty aggressively threading there but no NPE
<headius[m]>
the eval might be messing with it... we do things slightly differently for evals in some cases
<headius[m]>
jmalves: thank you
<jmalves>
Why do you do `block = eval " ... "` ?
<headius[m]>
to make sure it's a new block every time... otherwise if it safely transitions to "full" it will never fail after that
<headius[m]>
you don't see it often in your app because you have to catch it just right
<jmalves>
@headius I was debugging your script and it does not seem to call `commonYieldPath`
<headius[m]>
oh good catch
<headius[m]>
it probably calls the "direct" logic
<jmalves>
Actually it is a little bit more under there in the stack
<jmalves>
at org.jruby.runtime.InterpretedIRBlockBody.ensureInstrsReady(InterpretedIRBlockBody.java:62)
<jmalves>
at org.jruby.runtime.InterpretedIRBlockBody.yieldDirect(InterpretedIRBlockBody.java:104)
<jmalves>
at org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:85)
<jmalves>
at org.jruby.runtime.Block.yieldSpecific(Block.java:134)
<jmalves>
at org.jruby.RubyFixnum.times(RubyFixnum.java:278)
<jmalves>
at org.jruby.RubyInteger$INVOKER$i$0$0$times.call(RubyInteger$INVOKER$i$0$0$times.gen:-1)
<jmalves>
at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:555)
<jmalves>
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:80)
<jmalves>
at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:89)
<jmalves>
at org.jruby.ir.instructions.CallBase.interpret(CallBase.java:537)
<headius[m]>
I see an unrelated concurrency issue... if the block is promoted to a full build at the same time a thread updates the ic, it may overwrite the full build
<jmalves>
at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:362)
<jmalves>
at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)
<jmalves>
at org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:128)
<jmalves>
at org.jruby.runtime.InterpretedIRBlockBody.commonYieldPath(InterpretedIRBlockBody.java:137)
<headius[m]>
I will just comment out the direct logic
<jmalves>
headius was able to reproduce locally using our own code and eval!
<headius[m]>
oh nice!
<jmalves>
Now I am not sure how I can generalize this :/
subbu is now known as subbu|away
<headius[m]>
can you show me the code where you create the block?
<headius[m]>
you have that ConcurrentHashMap snippit but is that it?
<headius[m]>
still no luck on my end... I'll let you play with it for a bit in your codebase and see if you can narrow it down
<headius[m]>
I'm auditing places where we set IRScope.interpreterContext to see why it would be null at this point
<jmalves>
Thanks for taking a look at this @headius. We have only seen this problem in our specs but I suspect it could happen also out of them. However outside of specs it is not too common to call these methods from ruby.
<headius[m]>
well I'm glad you haven't seen it in production yet
<headius[m]>
hopefully we'll get you on a fixed release soon
<headius[m]>
hey I have to run for a bit now but you could try it with JRuby 9.2.11.0 since the repro is small now
nirvdrum has quit [Ping timeout: 240 seconds]
<headius[m]>
not failing won't be conclusive but perhaps encouraging
<jmalves>
I also have to go now, I will try that tomorrow!
<headius[m]>
ok, I'll let my loop run
jmalves has quit [Remote host closed the connection]
<headius[m]>
jmalves: in the future, don't assume this is something you're doing wrong... NPE specifically always means a bug to me (treat it like a hard crash), and pretty much any other low-level Java exception probably means bug too
subbu|away is now known as subbu
jmalves has joined #jruby
jmalves has quit [Remote host closed the connection]
jmalves has joined #jruby
jmalves has quit [Ping timeout: 264 seconds]
jmalves has joined #jruby
jmalves has quit [Remote host closed the connection]
<enebo[m]>
HUZZAH...clearly IR serialization got fixed by mergingh ir.print option flag PR :)
<enebo[m]>
so something odd happened in that sequel run. Is it a special day like Mayan end of the world?
<headius[m]>
That's weird
<headius[m]>
I didn't look at the failures on the branch but maybe it was some network glitch that affected everything
<headius[m]>
I feel like we should continue moving jobs off of Travis
<chrisseaton[m]1>
Feels like travis is getting worse more quickly now
<headius[m]>
Well I think they have one person maintaining it now
<headius[m]>
I really don't understand how they couldn't make a business out of it
<headius[m]>
Somebody majorly screwed up their market position
<kalenp[m]>
headius Following up from that screenshot I shared on Friday, we aren't able to upgrade to 9.2.11 because of the IR serialization bug. But looks like that just got fixed! Looking forward to the next release so we can pick that up.
<headius[m]>
Well lucky you it looks like we will do a 9.2.12
<kalenp[m]>
I don't twitter, but I might have some coworkers (Looker) who could share it.
<headius[m]>
Surprising number of people have hit that issue
<kalenp[m]>
Some of them may have been coworkers ;)
<headius[m]>
Sounds great!
drbobbeaty has quit [Ping timeout: 265 seconds]
jmalves has joined #jruby
jmalves has quit [Ping timeout: 264 seconds]
snickers has joined #jruby
ur5us has joined #jruby
<headius[m]>
jmalves: one of our JVM friends appears to have figured out JVM flags to reproduce it 100%
<headius[m]>
I'm got my best people working on it 🧐
<headius[m]>
enebo: can't remember if you met Charlie Gracie from J9 team but he moved over to working on Hotspot at I think Microsoft
<jeremyevans>
headius[m]: failure could be rounding issue specific to jdbc-sqlite3, I guess
<headius[m]>
it passed on the subsequent build so something's flaky somewhere
<enebo[m]>
headius: ah I saw him comment and I was confused...ok he is on hotspot codebase now :)
<enebo[m]>
shadowing variable was removed from Ruby
<jeremyevans>
I'll see if I can reproduce in a loop
<enebo[m]>
looks like end of 2018
<headius[m]>
yeah it's been over a year
<enebo[m]>
Another rando jruby option in grey
<headius[m]>
I think he's still in Ottawa though
<headius[m]>
jmalves: oops sorry, I meant chrisseaton ... got my bugs crossed
<headius[m]>
jmalves: I let your stuff run for hours and did not reproduce
<chrisseaton[m]1>
My bug with inlining? Shouldn't need to let it run for hours - it only compiles once (unless there are other bugs) and it's either inlined or it isn't. Should be able to confirm or not in seconds.
<headius[m]>
chrisseaton: no, that was actually for jmalves
<headius[m]>
you get the Charlie Gracie flags which I think you might have seen already
<headius[m]>
and it repros without tiered compilation so there's something else wrong
<chrisseaton[m]1>
Oh sorry I understand now
<headius[m]>
two JDK bugs in two weeks, we're on a roll
<enebo[m]>
Any idea on how often this non-inlining happens?
<headius[m]>
we were able to reproduce it almost every time yesterday without any special flags
<enebo[m]>
I more mean how many sites end up being affectred
<headius[m]>
in theory it could be affecting every indy ruby to ruby call site
<headius[m]>
I doubt it affects ruby to java call sites
<enebo[m]>
That would be pretty amazing based on the fact indy seems to help
<headius[m]>
yeah see my last point... a very large percentage of calls are into core classes, and at the moment there's no evidence to indicate they're affected
<enebo[m]>
Ah yeah I guess so but if you are looking at Rails I wonder a bit...avticesupport pretty much wraps a lot of stuff
<headius[m]>
right, something like rails probably tips that scale a lot
<enebo[m]>
and goes through about 200 levels of ruby stack :)
<headius[m]>
so it could still be a very big opportunity
<headius[m]>
even when they call core classes they usually do it through some monkey patch
<enebo[m]>
but that is a funny observation...most microbenchmarks have a large amount of core being called
<headius[m]>
there's some classloader involvement here because it started showing up when Chris forced the methods to JIT instead of AOT
<enebo[m]>
well the monkey patch was my original AS comment but I don't know hwo many actually get consumed...at this point just thinking about AR there are many many levels of ruby to ruby calls
<headius[m]>
we do most of our microbenchmarking at CLI, which AOTs
<enebo[m]>
but we are tangenting a bit too much that it might be all ruby to ruby calls
<enebo[m]>
that is also true
<headius[m]>
in any case there's gobs of small Ruby methods getting hit very hard in Rails, and those not inlining is a big loss
<enebo[m]>
yeah
<enebo[m]>
ok this used warning stuff is almost working but dinner prep has come upon me
<headius[m]>
disabling tiered would have been a decent workaround but Charlie's flags threw water on that
<headius[m]>
so we have no workaround again
<headius[m]>
possibly disabling one shot classloaders but that's not really workable
<headius[m]>
all failures during the build, which doesn't use 9.3 🤔
jmalves has joined #jruby
snickers has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
jmalves has quit [Ping timeout: 256 seconds]
<headius[m]>
I'm going to restart this and see what happens
<chrisseaton[m]1>
enebo: a problem with my bug is that you don't see it if you do a single file test case, so anyone exploring JRuby assembly output would likely miss it
<headius[m]>
we could probably soften the CLI AOT since we can jit blocks properly now
<headius[m]>
that was the main reason we did AOT, since so many benchmarks are just a bag of toplevel blocks
<headius[m]>
sigh, of course the fails before passed now... I don't know whether to blame maven, JRuby 9.1, or travis
<chrisseaton[m]1>
It's benchmarks with while loops that are the real problem, I guess
<headius[m]>
yeah, OSR would be a pretty big challenge
<headius[m]>
but at that point we're adding optimization mostly for benchmarking
<chrisseaton[m]1>
Don't want to minimise work needed, but why? You already have a side-stack for the backtrace and local variables if needed don't you?
<headius[m]>
the interpreter that runs at boot doesn't execute things the same way as as bytecode, so the variables would be rather different
<chrisseaton[m]1>
We're looking at expanding kinds of OSR in TruffleRuby, actually not for benchmarking reasons but to we can get smaller compilation units that compile independently - like massive switches. I thought JRuby did that to some extent?
<headius[m]>
it's doable using the same sort of mechanism as deopt would use, but we don't have that yet either
<headius[m]>
we used to "chunk" large methods but it was only based on top-level lines... so nothing inside a loop or switch
<chrisseaton[m]1>
That's what I was thinking of
<headius[m]>
we don't have that currently for the IR-based pipeline
<headius[m]>
it may be easier to chunk things in IR... it's somewhere on the long list of opportunities
<headius[m]>
chunking also introduces challenges for stack traces, since we have to mine the Java stack trace and translate it into Ruby... chunking can't cause there to be double frames