Discussion:
Bug#818616: luajit: laujit segfaults on arm64
Wookey
2016-03-18 18:42:24 UTC
Permalink
Source: luajit
Version: 2.1.0~beta2+dfsg-1
Severity: important

Dear Maintainer,

Yay - luajit with arm64 support uploaded!

However, whilst it builds, it doesn't appear to work at all.

Running it gives an immediate segfault

I installed the debug version (thank you debian for automatic dbgsym
builds!) and got:
$ luajit
GNU gdb (Debian 7.10-1+b1) 7.10
...
Reading symbols from luajit...Reading symbols from /usr/lib/debug/.build-id/e0/82e0a7597824d1de104daad7beb534c4280041.debug...done.
done.
(gdb) directory src
(gdb) run
Starting program: /usr/bin/luajit

Program received signal SIGSEGV, Segmentation fault.
getcurrenv (L=0xffffb7d80378, L=0xffffb7d80378) at lj_api.c:76
76 return fn->c.gct == ~LJ_TFUNC ? tabref(fn->c.env) : tabref(L->env);

(gdb) info stack
#0 getcurrenv (L=0xffffb7d80378, L=0xffffb7d80378) at lj_api.c:76
#1 cpcall (L=0xffffb7d80378, func=0x404c08 <pmain>, ud=0x0) at lj_api.c:1062
#2 0x00000000004379d8 in lj_vm_cpcall () at buildvm_arm64.dasc:1182
#3 0x000000000042bc3c in lua_cpcall (L=***@entry=0xffffb7d80378, func=***@entry=0x404c08 <pmain>, ud=***@entry=0x0)
at lj_api.c:1079
#4 0x000000000040406c in main (argc=1, argv=0xfffffffff518) at luajit.c:564

I will investigate this further, but filing this bug for now in case
it's obvious to anyone else what's up.


----
Kernel: Linux 3.18.0-rc3 (SMP w/6 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /bin/dash
Init: unable to detect

Versions of packages luajit depends on:
ii libc6 2.22-3
ii libgcc1 1:5.3.1-8
ii libluajit-5.1-2 2.1.0~beta2+dfsg-1
ii libluajit-5.1-common 2.1.0~beta2+dfsg-1

luajit recommends no packages.

luajit suggests no packages.

-- no debconf information
Wookey
2016-03-22 14:59:11 UTC
Permalink
+++ Wookey [2016-03-18 18:42 +0000]:

A bit more detail from the gdb session:
(gdb) bt full
#0 getcurrenv (L=0xffffb7d80378, L=0xffffb7d80378) at lj_api.c:76
fn = 0x7fffb7d80378
#1 cpcall (L=0xffffb7d80378, func=0x404c08 <pmain>, ud=0x0) at lj_api.c:1062
fn = 0x7fffb7d80378
top = <optimized out>
#2 0x00000000004379d8 in lj_vm_cpcall () at buildvm_arm64.dasc:1182
No locals.
#3 0x000000000042bc3c in lua_cpcall (L=***@entry=0xffffb7d80378, func=***@entry=0x404c08 <pmain>, ud=***@entry=0x0) at lj_api.c:1079
g = 0xffffb7d803d8
oldh = 0 '\000'
status = -1210580104
#4 0x000000000040406c in main (argc=1, argv=0xfffffffff518) at luajit.c:564
status = <optimized out>
L = 0xffffb7d80378
(gdb) list
71 }
72
73 static GCtab *getcurrenv(lua_State *L)
74 {
75 GCfunc *fn = curr_func(L);
76 return fn->c.gct == ~LJ_TFUNC ? tabref(fn->c.env) : tabref(L->env);
77 }
78
79 /* -- Miscellaneous API functions ----------------------------------------- */
80
(gdb) p fn->c
Cannot access memory at address 0x7fffb7d80378

So that top 48th bit missing from the pointers looks wrong. Aarch64
has 48 bits, not the 47 of x86_64.

And we see things like this in the code:
lj_def.h:#define LJ_MAX_MEM64 ((uint64_t)1<<47) /* Max. 64 bit memory allocation. */
lj_def.h:#define checkptr47(x) (((uint64_t)(x) >> 47) == 0)


Wookey
--
Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/
Wookey
2016-03-22 18:41:48 UTC
Permalink
+++ Debian Bug Tracking System [2016-03-22 15:03 +0000]:

OK, so the issue is indeed the 47/48 bit pointers. The mozilla jit had
the exact same issue. https://bugzilla.mozilla.org/show_bug.cgi?id=1143022

I've hacked up a patch (attached), which changes the pointer length to
48 bits, to demonstrate that the top bit is no longer masked and the
segfault goes away. However this patch is not a proper fix (and
crashes a bit later) because it does not properly deal with the fact
that moving up one bit pushes the 4 tags further up and impinges on
the fff8 'NaN' value currently used.

My understanding of the code is not sufficient to work out whether
everything can simply be shoved up by one bit to make this work. Can
the NaN change, or is it set by the IEEE specs? Does it need to change in
fact, or is it only use in the 32bit mode?

And I don't properly understand the distinction between LJ_64 and
LJ_GC64 which affects how things are laid out. So I'm well aware that
my patch is half-a-bodge, and merely to prove that I was barking up
the right tree.

So the issue is nicely described in src/lj_obj.h 'Internal object tags'
** Format for 32 bit GC references (!LJ_GC64): **
** Internal tags overlap the MSW of a number object (must be a double).
** Interpreted as a double these are special NaNs. The FPU only generates
** one type of NaN (0xfff8_0000_0000_0000). So MSWs > 0xfff80000 are available
** for use as internal tags. Small negative numbers are used to shorten the
** encoding of type comparisons (reg/mem against sign-ext. 8 bit immediate).
**
** ---MSW---.---LSW---
** primitive types | itype | |
** lightuserdata | itype | void * | (32 bit platforms)
** lightuserdata |ffff| void * | (64 bit platforms, 47 bit pointers)
** GC objects | itype | GCRef |
** int (LJ_DUALNUM)| itype | int |
** number -------double------
**
** Format for 64 bit GC references (LJ_GC64):
**
** The upper 13 bits must be 1 (0xfff8...) for a special NaN. The next
** 4 bits hold the internal tag. The lowest 47 bits either hold a pointer,
** a zero-extended 32 bit integer or all bits set to 1 for primitive types.
**
** ------MSW------.------LSW------
** primitive types |1..1|itype|1..................1|
** GC objects/lightud |1..1|itype|-------GCRef--------|
** int (LJ_DUALNUM) |1..1|itype|0..0|-----int-------|
** number ------------double-------------

So can we just change that to:
** The upper 12 bits must be 1 (0xfff...) for a special NaN. The next
** 4 bits hold the internal tag. The lowest 48 bits either hold a pointer,
** a zero-extended 32 bit integer or all bits set to 1 for primitive types.
?

Or will that just break and we need to keep fff8, leaving only 3 bits for the tags?

Ultimately this design needs to change, because ARM64 pointers are
going to change to 52 bits in the future and x86 pointers will be
getting bigger too as they are subject to the same pressures in new hardware.

So putting the tags somewhere else would be smart, or maybe down at
the bottom end if you know things are aligned to large enough boundaries?

But perhaps we can make a smaller change for now to get this running?

I'll file this upstream too, referring back here.

Wookey
--
Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/
Wookey
2016-03-22 19:10:56 UTC
Permalink
+++ Debian Bug Tracking System [2016-03-22 18:45 +0000]:

Issue is tracked upstream on https://github.com/LuaJIT/LuaJIT/issues/49

There is also lots of quality info on
https://collaborate.linaro.org/display/TCWG/The+State+of+LuaJit+for+ARM+64+-+March+2016
but its behind a login for no good reason so people outside Linaro
can't read it. (Yay for open development Linaro!)


Wookey
--
Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/
Wookey
2016-03-23 16:22:06 UTC
Permalink
+++ Debian Bug Tracking System [2016-03-22 19:12 +0000]:

The linaro page is now accessible at:
https://collaborate.linaro.org/display/TCWGPUB/The+State+of+LuaJit+for+ARM+64+-+March+2016

(thanks to Ryan Arnold for fixing that)

Wookey
--
Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/
Wookey
2016-05-10 14:08:01 UTC
Permalink
This issue has been fixed upstream. The issue is tracked here:
https://github.com/LuaJIT/LuaJIT/issues/49
and the fix is here:
https://github.com/LuaJIT/LuaJIT/commit/0c6fdc1039a3a4450d366fba7af4b29de73f0dc6

The chosen fix is to allocate only low-enough memory that 47-bit pointers will work.

There is a longer-term plan for the kernel to support applications
asking for allocations that are guaranteed to be under particular
sizes so they don't have to do this themselves by asking over and over
until they get something that works for them.

So the net result is that if we apply this patch, luajit should at
least run on ARM64, and be consistent with upstream on their next release.

I guess I should test this.

Wookey
--
Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/
Ben Hutchings
2016-12-08 22:03:07 UTC
Permalink
Control: severity -1 serious

Raising severity; this should be a release blocker.
Post by Wookey
Source: luajit
Version: 2.1.0~beta2+dfsg-1
Severity: important
 
Dear Maintainer,
 
Yay - luajit with arm64 support uploaded!
 
However, whilst it builds, it doesn't appear to work at all.
[...]

Does this bug exist in the version in testing/unstable (2.0.4+dfsg-1)?
Currently the BTS records it as affecting experimental only.

Ben.
--
Ben Hutchings
When in doubt, use brute force. - Ken Thompson
dann frazier
2016-12-09 10:18:38 UTC
Permalink
Post by Ben Hutchings
Control: severity -1 serious
Raising severity; this should be a release blocker.
Post by Wookey
Source: luajit
Version: 2.1.0~beta2+dfsg-1
Severity: important
 
Dear Maintainer,
 
Yay - luajit with arm64 support uploaded!
 
However, whilst it builds, it doesn't appear to work at all.
[...]
Does this bug exist in the version in testing/unstable (2.0.4+dfsg-1)?
Currently the BTS records it as affecting experimental only.
hey Ben!

The version in unstable hadn't yet turned on arm64 builds.

-dann
Wookey
2016-12-09 11:35:10 UTC
Permalink
Post by Ben Hutchings
Control: severity -1 serious
Raising severity; this should be a release blocker.
Post by Wookey
Source: luajit
Version: 2.1.0~beta2+dfsg-1
Severity: important
 
Dear Maintainer,
 
Yay - luajit with arm64 support uploaded!
 
However, whilst it builds, it doesn't appear to work at all.
[...]
Does this bug exist in the version in testing/unstable (2.0.4+dfsg-1)?
Currently the BTS records it as affecting experimental only.
well, version 2.0.4+dfsg-1 does not have any arm64 support and thus is not built for arm64.

So no, it doesn't have the bug that you can build it for arm64 but it doesn't work.

But on the other hand it does have 'doesn't work on arm64'

So I'm really not sure whether one should mark this bug as applying to
2.0.4.

Can we get laujit 2.1-with-arm64-support into unstable for stretch?
There are serious people that would really like to see that (because
important stuff like nginx need luajit), but presumably the transition
freeze applies here?

Wookey
--
Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/
Ben Hutchings
2016-12-09 12:00:47 UTC
Permalink
On Fri, 2016-12-09 at 11:35 +0000, Wookey wrote:
[...]
Post by Wookey
So I'm really not sure whether one should mark this bug as applying to
2.0.4. 
Right.
Post by Wookey
Can we get laujit 2.1-with-arm64-support into unstable for stretch?
There are serious people that would really like to see that (because
important stuff like nginx need luajit), but presumably the transition
freeze applies here?
They both build libluajit-5.1-2 so I don't see that there's any
transition needed.

Ben.
--
Ben Hutchings
Logic doesn't apply to the real world. - Marvin Minsky
Loading...