temporalfox has quit [Read error: Connection reset by peer]
<nirvdrum>
We're affected, too, since we copied from you.
<nirvdrum>
But I think I'm going to make it "correct", otherwise strings go down the wrong path.
temporalfox has joined #jruby
<enebo>
nirvdrum: because you are roped will your patch help us too?
<enebo>
nirvdrum: I did not look for their fix to this
<nirvdrum>
I just opened the issue. It's been a bug since 1.9.2.
<nirvdrum>
But it's not a matter of ropes or not.
<nirvdrum>
The only real difference is we eagerly work out the code range. I added assertions today verifying we're actually creating ropes with the correct code range. And that's when I started tripping over these old bugs.
<enebo>
nirvdrum: well no doubt their patch will be very similar to ours
<nirvdrum>
I added a follow-up comment. I think the logic here is more complex.
<enebo>
nirvdrum: these are bugs I am ok with fixing as soon as discovered since nothing is gained from leaving behavior like this
temporalfox has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<nirvdrum>
But JRuby isn't affected by that one.
<nirvdrum>
ISO-2022-JP is another one I tried.
<nirvdrum>
It's a problem for any non-ASCII compatible encoding.
<enebo>
nirvdrum: but that is not 7 bit ascii when ascii?
<nirvdrum>
Eh?
<enebo>
nirvdrum: err how should I say that sentence :)
<enebo>
nirvdrum: if [a-z] is not actually ascii a-z
rdubya has quit [Quit: Leaving.]
<nirvdrum>
I could have this very wrong, but CR_7BIT only applies to codepoints < 128 in an ASCII-compatible encoding.
<nirvdrum>
It's used all over the place for single-byte operations.
<enebo>
nirvdrum: no you have it right
<enebo>
nirvdrum: but UTF-16LE does not store those there
<enebo>
nirvdrum: well I mean they are all 2 bytes
<nirvdrum>
UTF-16LE uses 2 bytes for each codepoint.
<enebo>
nirvdrum: I assume ISO-2022-JP must as well
<nirvdrum>
And I guess surrogate pairs for grapheme clusters or whatever.
<nirvdrum>
All this shit is confusing.
<enebo>
nirvdrum: but 'a' in UTF-16 is a two bytes
<enebo>
so it cannot be 7bit
<nirvdrum>
Every CR_7BIT string is by definition CR_VALID, but if you incorrectly mark a string that *could be* CR_7BIT as CR_VALID, you might go down a bad path and get incorrect results.
<enebo>
although it happens to have a 7bit 'a' in it :)
<nirvdrum>
Because the helper functions for CR_VALID assume the source string is MBC.
<nirvdrum>
enebo: Right.
<enebo>
yeah I hate CR because it is less semantically driven and more specific to what they could opt at the time of 1.9 as an impl
<nirvdrum>
I hate ASCII-8BIT with a passion.
<enebo>
but at this point I doubt we could just remove it
<enebo>
that internal state now dictates external encoding in cases so it has become semantic
<nirvdrum>
singleByteOptimizable usually means it's CR_7BIT or it's ASCII-8BIT (or I suppose some other binary-based encoding).
<enebo>
yeah
<nirvdrum>
You just always have this corner case of dealing with arbitrary binary strings.
<enebo>
nirvdrum: yeah
<nirvdrum>
And it has a wildly different usage pattern.
<nirvdrum>
I posit most strings are read-only, but most byte buffers have many writes.
<enebo>
nirvdrum: yeah I guess in many languages bytes and chars are two things or chars come from byte storage
<nirvdrum>
Ropes take a beating when reading from IO, for instance.
<enebo>
nirvdrum: but notion that chars are made of bytes is more explicit at a language level
<enebo>
nirvdrum: fact that in 1.8 all strings were bytes I guess drove this lack of distinction
<nirvdrum>
I'm sure it was based on C strings.
<enebo>
yeah char*
<enebo>
C also is similarly weird
<nirvdrum>
I'd just like to see a byte[] or byte buffer type introduced and let that get adopted.
<nirvdrum>
ASCII-8BIT is here forever for compatibility reasons. But we could encourage people off it.
<enebo>
nirvdrum: I think Matz dislikes lots of types but it is a duck-typed language so Binary/String would have been reasonable to me
<enebo>
bin[1] would be second offset of bytes and string[1] would be second offset of a character
<nirvdrum>
But it's silly to check this for every byte after you've already determined it's CR_VALID.
<enebo>
nirvdrum: yeah the asciicompat check is invariant
<enebo>
nirvdrum: and once valid it won't go back right?
<nirvdrum>
The resulting string couldn't, as far as I can tell.
<nirvdrum>
This should be a function of the replacement string's encoding and whether the string to replace is ever matched.
<enebo>
nirvdrum: this is all gross in any case (not specifcially your code but the notion of how side-effecty cr is)
<nirvdrum>
I sorta took their macro that you guys seem to have inlined and blew it up to include the current encoding.
<enebo>
you could just look for non-ascii char as a boolean and set cr once at end or something too?
<enebo>
no doubt your ropes even know right
<enebo>
like you know if you have mbc in a rope already just not if result of tr will end up with mbc
<nirvdrum>
I think so, yeah. But this code has a half dozen different exit points.
<enebo>
nirvdrum: yeah no doubt this is squirrelly method too
<nirvdrum>
I have a boundary on it and just expect it to be slow. At some point I'm going to need to tackle it.
<enebo>
heh ours is massive
bbrowning is now known as bbrowning_away
<enebo>
nirvdrum: so you have all this c0, c, save state too right?
<nirvdrum>
Sorry, not sure what you're asking.
<enebo>
nirvdrum: massive trTansHelper
<enebo>
nirvdrum: just looking at the loop where those checks happen
<nirvdrum>
Oh. We literally copy & pasted from you.
<enebo>
nirvdrum: I am guessing someone decided to overload c to contain whether translation worked or not and the new value
<nirvdrum>
The only thing that's changed is ByteList was replaced with something that looks like ByteList called RopeBuilder.
<enebo>
who would be nobu probably :)
<enebo>
so to save a boolean he added a second int
<enebo>
well I guess I should say original c, translated c, and whether translation worked
<enebo>
since translated c might not work then it is set to -1
<enebo>
this is probably fine but the names and -1 checks make this challenging to read
<enebo>
nirvdrum: if (c < TRANS_SIZE) {
<enebo>
nirvdrum: this confuses me does utf16-le translate 'a' to be a value < 256?
<enebo>
nirvdrum: c is the codepoint of 'a'
<enebo>
as an example
<enebo>
nirvdrum: I naively assumed everything in utf-16le would be above 256
<enebo>
nirvdrum: this is where I feel weak in my knowledge
<enebo>
heh actually ignore that question it probably i > trans_size
<enebo>
everything has more than one path here
<nirvdrum>
I'm getting dinner ready. In and out at the moment.
<nirvdrum>
I've spent approximately 0 amount of time trying to understand this code, though.
<enebo>
nirvdrum: np. I will probably stop working fairly soon...fighting jet lag and feelign a bit weird
<nirvdrum>
I'm not sure I even understand many of the Ruby usages I see of it.
<enebo>
nirvdrum: my main issue with code like this is we kept parity with C version to make future fixes simpler but it is in this we must have all logic paths in one method C-like build around it
<nirvdrum>
Yeah.
<nirvdrum>
The approach is sound, if not ugly.
<enebo>
nirvdrum: so grokking all paths sometimes means you only follow one branch everywhere through one call but maybe not :)
<enebo>
nirvdrum: yeah I am sure it works it is just trying to handle all cases in one place
<enebo>
nirvdrum: which I think speaks volumes to lack of polymorphism