rusk has quit [Write error: Connection reset by peer]
rusk has joined #jruby
rusk has quit [Read error: Connection reset by peer]
rusk has joined #jruby
drbobbeaty has quit [Ping timeout: 260 seconds]
nirvdrum has joined #jruby
shellac has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
_whitelogger has joined #jruby
nirvdrum has joined #jruby
_whitelogger has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
drbobbeaty has joined #jruby
lucasb has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
<headius[m]>
Good morning!
<headius[m]>
enebo: I just realized that compileAndLoadFile was still parsing
<enebo[m]>
ah ok I was hoping something was happening
<headius[m]>
I'm trying to refine things a bit before getting back to slides
<headius[m]>
also looking at adding the cache logic to non-compile loadFile
<headius[m]>
-X+C has other side effects like forcing early JIT
<enebo[m]>
yeah personally I think we want --dev to work with caching
<enebo[m]>
so I think caching should be somewhat orthogonal
<headius[m]>
right
<headius[m]>
this would allow that
nirvdrum has joined #jruby
sagax has quit [Remote host closed the connection]
sagax has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
xardion has quit [Remote host closed the connection]
xardion has joined #jruby
<headius[m]>
playing with some other tooling to dig deeper...
<headius[m]>
-XX:+TraceBytecodes in JDK debug builds prints out every bytecode executed
<headius[m]>
jruby --dev -e 1 executes over 33M
<enebo[m]>
so did decoupling improve things a bit?
<headius[m]>
flame graph puts most of the boot time in the preludes, which largely points at RubyGems
<enebo[m]>
-X+C will force all IR executed to do a full build
<headius[m]>
a bit, I'll try to get some details in a moment
<headius[m]>
driving today's exploration from a slide perspective
<headius[m]>
enebo: I will push remaining local changes
<enebo[m]>
ok
<headius[m]>
heh with cached classes we execute 35M
nirvdrum has joined #jruby
<enebo[m]>
hmm
<enebo[m]>
so colder since it is all new code
<headius[m]>
cold vs hot doesn't matter for this measure
<enebo[m]>
I believe though one big issue is that this will still file system search all files
<headius[m]>
that's certainly true...the classes are based on the full path of a found file
<enebo[m]>
and it has to resolve via LOAD_PATH
<enebo[m]>
Not really sure I want to go there but if we said up to point X we will just load this script of stuff then all that file system stuff would disappear
<enebo[m]>
I definitely feel that has issues we don't want
<enebo[m]>
I am pretty curious about the cost of continuous resolution though
<enebo[m]>
anyways I am going on a tangent since file actitivity really ends up being like 1 more File create
<enebo[m]>
both before/after is doing load_path resolution and persist is just checking one more file per resolution
<enebo[m]>
I think more broadly we could so a resolutionless core boot but it woudl be a future
<enebo[m]>
So as a future it would be nice to see how expensive resolution is
<headius[m]>
resolving files is certainly a nontrivial cost
<headius[m]>
I'm not sure how it compares to actually parsing+compiling+interpreting those files though
<enebo[m]>
well at one point someone did look at that a bit (nirvdrum?) and it was significant cost
<enebo[m]>
I don't recall how significant (I thought more than 10% but less than 20%)
<enebo[m]>
nirvdrum: do you recall...or were you even who I am thinking of
<enebo[m]>
This obviously was a long time ago as well
<headius[m]>
emitting cached classes with indy increases the count to 48M
<headius[m]>
enebo: my changes are pushed
<headius[m]>
-X+C -Xcompile.class.cache to dump, then just -Xcompile.class.cache to use it (works with --dev)
<headius[m]>
oh not just that
<headius[m]>
also CLASSPATH
<headius[m]>
for making use of them that is
<enebo[m]>
so CLASSPATH=~/.ir
<enebo[m]>
qualitatively class load of AOT removed parse+build+interp with extra file load + reflective construct + bytecode interp
<headius[m]>
I pushed the commit that uses system classloader also
<headius[m]>
we'll want to revisit that perhaps
<headius[m]>
enebo: yes that's about the size of it
<enebo[m]>
when using appcds we still have to file check ~/.ir right?
<enebo[m]>
even though we will not load that
<enebo[m]>
err meaning we will check status of that but it will already be in VM
<headius[m]>
appcds only works with jars so you have to jar up everything in .ir first
<headius[m]>
but yes still need that jar on classpath with appcds
<enebo[m]>
yeah I get that but does that mean you are checking in the jar then or ~/.ir to see if up to date?
<headius[m]>
FWIW using a jar instead of dir in CLASSPATH also helps
<headius[m]>
filesystem access for classloading appears to be a good bit slower
<headius[m]>
the two flags you want for CDS are -XX:ArchiveClassesAtExit=blah.jsa and then -XX:SharedArchiveFile=blah.jsa to use it
<enebo[m]>
Could you maybe make a cheat sheet gist
<headius[m]>
I think that archive flag might have been introduced in 13...prior to that you had to do a three-step dance to generate a list of classes to cache
<headius[m]>
I'm testing 13 and 14 mostly
<enebo[m]>
ah also I am not trying 9+ yet but I should
<headius[m]>
yeah I'll do that for slides anyway but prioritize it today
<headius[m]>
ok this is a bit of good news
<headius[m]>
Using just --dev on JDK13, -e 1 takes 1.367s
<headius[m]>
With cached classes and CDS, that reduces to 1.267
<headius[m]>
unfortunately with no cached classes (interp only) and CDS it drops to 1.081s
<headius[m]>
CDS is a win but cold bytecode hurts
<enebo[m]>
more good news
<enebo[m]>
cache.classes with --dev of 'echo exit| rails c' on one controller app is 14.6-15s but with just --dev an no caching it is 17.2-17.6s
<enebo[m]>
yeah definite win...on Java 8
<enebo[m]>
whoa Java 13 is 10.8s
<enebo[m]>
without even enabling parallel...let's see what it is with just --dev
<headius[m]>
well that's promising
<enebo[m]>
13.7 with --dev alone
<headius[m]>
-e may be a poor test case for this, which is interesting in itself
<enebo[m]>
gem list is a poor one too apparently
<headius[m]>
are you passing those flags as env var?
<enebo[m]>
yeah since it is double invoke
<headius[m]>
pretty sure rails c spawns a second jruby
<headius[m]>
ok
<enebo[m]>
time echo exit | CLASSPATH=/home/enebo/.ir JRUBY_OPTS="-Xcompile.cache.classes" jruby --dev -S rails c >/dev/null 2>&1
<enebo[m]>
I realize I messed up here
<enebo[m]>
only --dev for first invoke
rusk has quit [Read error: Connection reset by peer]
<headius[m]>
oh I moved the logging output to another property for now
<enebo[m]>
7.8s with parGC
<headius[m]>
I'll push that change
<enebo[m]>
good...that should eliminate the /dev/null need
<enebo[m]>
I do find it weird how much faster this mangy 13 is than 8
<headius[m]>
yeah that is a bit surprising to me
<enebo[m]>
could we add jlink to the mix here and kill some crap on JVM load itself
<headius[m]>
when running on 9+ we have to load JRuby into system classloader, which means bytecode verification is back on
<headius[m]>
probably?
<enebo[m]>
This Java 13 is from my last machine so I will update that and probably update JVMs
<headius[m]>
that's a tool I have not learned yet
<headius[m]>
apparently JDK14 has some hidden enhancements to save some parts of heap as well
<headius[m]>
one thing does occur to me about -e 1 versus rails c
<headius[m]>
in the former nearly everything executes exactly once
<enebo[m]>
so why would 8 be so much slower than 13 is that is the case (verification)
<enebo[m]>
almost feels like the opposite
<headius[m]>
in the latter I would expect a lot more code is actually being used at runtime
<headius[m]>
yes that I don't know
<headius[m]>
13 is slower for -e 1
<headius[m]>
I wish we had a single-process example of the size of rails c
<headius[m]>
I would like to get a bytecode trace of something large and see what the flame graph looks like
<enebo[m]>
12.6 and 10.8 for Java 8 which is considerably slower for me but this 8 is u172 which is outdated
<headius[m]>
the flame graph for -e 1 is dominated by loading preludes as I mentioned above
<headius[m]>
I doubt 8 will have improved much on this over the past year
<enebo[m]>
So I bet we win on rails because some of the JIT loaded code is hot
<enebo[m]>
-e 1 literally does nothing
<headius[m]>
yeah that's what I'm thinking too
<enebo[m]>
I mean it is fine to focus on it to understand things but it seems reasonable having all that stuff as JITed code leads to other wins
<headius[m]>
-e boots a bunch of classes and does a little bit of work in RG but that's it
<headius[m]>
and comparing the classes that load versus hot interpreter appears to be a loss
<enebo[m]>
This is a pretty big different
<enebo[m]>
11.4s -> 8.3s without even doing appcds
<headius[m]>
oh yeah there's clearly other wins though I'm glad you're seeing startup improvement too
<headius[m]>
one big win will be warmup time, which we may or may not have time to try
<enebo[m]>
heap should also get better
<headius[m]>
but going straight to jitted code for almost everything should clearly reduce the warmup curve
<enebo[m]>
I think this will be a winner but removing reflection could end up being another tier for warming up faster
<headius[m]>
hmm I'm still having problems getting a full script to properly calculate flags
<headius[m]>
if you see some of my commits last night I'm now trying to force interpreter context to get created on parent + children and stuff like that
<headius[m]>
so much of this is lazy and it's not clear how to force everything to happen
<enebo[m]>
could you verify my single controller echo exit to rails console timings
<headius[m]>
I just ran into another file that apparently didn't see that it evals
<headius[m]>
rails 6?
<headius[m]>
can you push your app somewhere perhaps?
<enebo[m]>
good question
<enebo[m]>
hahah rails 6 rc2
<enebo[m]>
I should update that :)
<enebo[m]>
but honestly we should not see much difference
<enebo[m]>
ok so that is another thing I will do today is update some software
<enebo[m]>
latest 8,13,14 and rerun and make a current Rails 6 app
<enebo[m]>
The other fun part for this talk could be seeing how redmine changes assume we do not his a bug
<enebo[m]>
Wouldn't it be grand to see a win like quicker warmup
<headius[m]>
enebo: I can confirm it once you have an app I can run...just want to make sure I'm doing the exact same thing you are
<nirvdrum>
enebo[m]: It was shaving seconds off for me at the time. That was pre 9k though.
<headius[m]>
I am feeling better about this stuff now and we haven't even started to look at heap and warmup
<headius[m]>
9k did remove .class searching among other things
<enebo[m]>
nirvdrum: yeah I do remember it being substantial
<headius[m]>
that cuts the number of file searches in half
<nirvdrum>
enebo[m]: A sticking point at the time was the file watcher APIs weren't available in in whatever the latest JDK that Ruboto supported was.
<enebo[m]>
HAHAHA LOL
<headius[m]>
I'd love to get that cache updated and back in
<headius[m]>
and I need to revisit my load branch with all my copious free time
<nirvdrum>
Bootsnap does a lot of this now, too, I believe.
<enebo[m]>
ok I guess I have a lunch date...so I have to go but we have something interesting and should get a first swipe of talk today
<headius[m]>
I have started throwing some slides together
<headius[m]>
I will need to include some comparison with other Ruby impls so might have to re-run a few thing from my startup time post
<nirvdrum>
The basic idea is `require` needs to be responsive to changes to the filesystem to be compliant, but the filesystem likely doesn't change that much during execution. And it would never impact already loaded files. But, because of that feature, the `require` can't cache information about the fs contents and so has to do a full search every time. I added a file watcher so you could cache the contents and let the file watcher invalidate the
<nirvdrum>
cache.
<nirvdrum>
All of this gets compounded by RubyGems appending stuff to the $LOAD_PATH. So, if you require a gem and that then requires files within the gem, that could result in a full walk of $LOAD_PATH even though the file you want to load is almost certainly at the end of the $LOAD_PATH.
<headius[m]>
mmm good point
<headius[m]>
and I'm not sure there's any tool to condense an entire app into a single dir
<nirvdrum>
But, it's possible a bare `require` from a gem picks up a file from something earlier in the `$LOAD_PATH` and that needs to be honored. Fun stuff.
<headius[m]>
bundler can relocate gems into a vendor dir but I believe they still get their own entries
<nirvdrum>
IIRC, this was all exacerbated running in a type 2 hypervisor because syscalls are relatively expensive there.
<headius[m]>
enebo: the other half of this we can talk about is that being able to run mostly from bytecode feeds into native+AOT options across the board
rusk has joined #jruby
<headius[m]>
so graalvm native is an extreme case but we can use it to compile an entire Ruby app to native code for almost no warmup and much faster startup
<headius[m]>
and openj9 JIT caching can get a big boost by saving C1-like native code
<headius[m]>
I am actually surprised I have not run into more scripts that are too big to compile
<headius[m]>
almost everything for rails compiles to bytecode ok
<fzakaria[m]>
The bundler maintainer broke down the classloading bug in JRuby really nicely:
<fzakaria[m]>
Ok, so this seems like due to a weird inconsistency in jruby that should probably be fixed, where Pathname#expand_path with an argument always returns uri:classloader:// (double slash) as the canonical version of "uri:classloader" paths, while Pathname#expand_path without an argument always returns uri:classloader/ (single slash) as the canonical version of "uri:classloader" paths.
rusk has quit [Read error: Connection reset by peer]
<headius[m]>
enebo: so far I can't reproduce your result
rusk has joined #jruby
<headius[m]>
nirvdrum: not to mention filesystem access
rusk has quit [Remote host closed the connection]
<headius[m]>
filesystem access does indeed continue to show up in profiles
<headius[m]>
hard to know what percentage of actual wall clock time is wasted on filesystem access
<nirvdrum>
Yeah. If you're not on an SSD, you'll definitely feel the seek time. But even if you are on SSD, you might have problems in virtual/cloud environment. At least that was what I was seeing.
<enebo[m]>
it is possible for sure so I guess we need to figure out if there is a problem or not
<headius[m]>
it is counter-intuitive to me that the cached mode is running more bytecodes than pure interp
<headius[m]>
I'm trying your command line on a bare rails 6 app now
<enebo[m]>
my actual numbers will not match yours as I have no yarn installed currently
<enebo[m]>
but both modes hits same point so that should not differ
<enebo[m]>
correcting not having yarn or nodejs right now
<enebo[m]>
ok so I totally nuked ~/.ir and did a full runner script to generate an app
<enebo[m]>
Now the difference between cache with and without --dev is 0.3s (14.3 vs 14.0)
<enebo[m]>
so cache is still faster but weirdly not dramatically so
<headius[m]>
that almost sounds like noise
<enebo[m]>
this is also with 6.0.2.1 of rails
<enebo[m]>
except it seems consistent
<enebo[m]>
so I am going to nuke ~/.ir and do a single rails c with -X+C and see if it is faster again
<headius[m]>
I'm getting 12.1s for --dev and 14.6s for --dev with cache
<enebo[m]>
I compiled soooo much code running all those rails commands
sagax has joined #jruby
<headius[m]>
ah hell
<headius[m]>
I thought this trace thing turned off jit but it doesn't
<headius[m]>
so that's interesting but I have to rerun numbers
<headius[m]>
it would make sense that cached would run more bytecodes because less stuff gets hot
<headius[m]>
in the interpreted case the interpreter itself gets hot really fast
<headius[m]>
at least I have an explanation now
<headius[m]>
I didn't notice this line in the readme:
<headius[m]>
To get the full picture of what's going on, especially on a longer running program, it might be interesting to run your program with -Xint, since as things start to get compiled they will disappear from view...
<enebo[m]>
ok so this is interesting
<enebo[m]>
I nuked ~/.ir and did rails c once with -X+C
<enebo[m]>
then timed the echo exit rails c thing
<enebo[m]>
boom the gain is back
<enebo[m]>
14.3s vs 12.4s
<enebo[m]>
So seemingly only compiling that much code AOT was helpful but if I compile a lot more code by calling 5-6 rails commands a lot of that gain disappears
<headius[m]>
hmmm
<headius[m]>
when did you generate it last?
<headius[m]>
one thing we need to be careful of is that this does not do any sanity checking of the cached classes right now
<headius[m]>
so it could be that they're stale and it loads them up and blows up and then goes on to interpret anyway
<enebo[m]>
I nuked both times and started from scratch
<headius[m]>
my commits today did have bytecode changes
<enebo[m]>
before runner and then again for the rails console change
<headius[m]>
hmm ok
<headius[m]>
I have been wiping out too but not seeing gains
<enebo[m]>
but I did not have HEAD I think I was lacking last two commits
<enebo[m]>
I will retest again though
<headius[m]>
so the trace numbers I have now will still be interesting
<headius[m]>
since this does not reflect code that jits, it shows how much cold bytecode is executing, which is more in the cached case
<headius[m]>
the total bytecode execution count may take a while to generate but it should show the gain in overall execution of JIT compiled script versus interpreter
<headius[m]>
executing 33M bytecodes cold is pretty ugly and shows the problem pretty well
nirvdrum has joined #jruby
ur5us has joined #jruby
<headius[m]>
these new numbers better look great for cached classes or I'll be really confused
<headius[m]>
there's no way precompiled scripts should execute more bytecodes than the IR interpreter
ur5us has quit [Ping timeout: 245 seconds]
ur5us has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
nirvdrum has joined #jruby
ARTSIOM has joined #jruby
drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
ARTSIOM has quit [Client Quit]
ur5us has quit [Ping timeout: 246 seconds]
drbobbeaty has joined #jruby
lucasb has quit [Quit: Connection closed for inactivity]