Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: giordano
  • License: mit
  • Language: Julia
  • Default Branch: main
  • Size: 1.2 MB
Statistics
  • Stars: 13
  • Watchers: 2
  • Forks: 0
  • Open Issues: 2
  • Releases: 0
Created about 4 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Julia on Fugaku (2022-07-23)

Note: many links refer to internal documentation which is accessible only to Fugaku users.

Read the paper

Benchmarks present in this repository have been published in the paper Productivity meets Performance: Julia on A64FX, presented at the 2022 IEEE International Conference on Cluster Computing (CLUSTER22), as part of the Embracing Arm for High Performance Computing Workshop (pre-print available on arXiv: 2207.12762). See the CITATION.bib file for a BibTeX entry to cite the paper.

Storage

Before doing anything on Fugaku, be aware that there are tight limits on the size of (20 GiB) and the number of inodes in (200k) your home directory. If you use many Julia Pkg artifacts, it's very likely you'll hit these limits. You'll notice that you hit the limit because any disk I/O operation will result in a Disk quota exceeded error like this:

console [user@fn01sv03 ~]$ touch foo touch: cannot touch 'foo': Disk quota exceeded

You can check the quota of your home directory with accountd for the size, and accountd -i for the number of inodes.

Using the data directory

In order to avoid clogging up the home directory you may want to move the Julia depot to the data directory:

sh DATADIR="/data/<YOUR GROUP>/${USER}" export JULIA_DEPOT_PATH="${DATADIR}/julia-depot"

Interactive usage

The login nodes you access via login.fugaku.r-ccs.riken.jp (connection instructions) have Cascade Lake CPUs, so they aren't much useful if you want to run an aarch64 Julia.

You can submit jobs to the queue to run Julia code on the A64FX compute nodes, but this can be cumbersone if you need quick feedback during development or debugging. You can also request an interactive node, for example with:

pjsub --interact -L "node=1" -L "rscgrp=int" -L "elapse=30:00" --sparam "wait-time=600" --mpi "max-proc-per-node=4"

Available software

Fugaku uses the Spack package manager. For more information about how to use it, see the Fugaku Spack User Guide.

Note that Spack is installed in /vol0004, this means that if your home directory isn't mounted on this volume you will have to explicitly request the partition in your submission job scripts or commands, for example by adding -x PJM_LLIO_GFSCACHE=/vol0004 to the pjsub command, or the line

```sh

PJM -x PJMLLIOGFSCACHE=/vol0004

```

in a job script.

Using Julia on the compute nodes

There is a Julia module built with Spack available on the compute nodes, but as of this writing (2022-07-23) the version of Julia provided is 1.6.3, so you may want to download a more recent version from the official website. Use the aarch64 builds for Glibc Linux, preferably latest stable or even the nightly build if you feel confident.

To enable full vectorisation you may need to set the environment variable JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512". Example: https://github.com/JuliaLang/julia/issues/40308#issuecomment-901478623. However, note that are a couple of severe bugs when using 512-bit vectors:

Note: Julia v1.9, which is based on LLVM 14, is able to natively autovectorise code for A64FX without having to set JULIA_LLVM_ARGS, side stepping the issues above altogether.

MPI.jl

MPI.jl with default JLL-provided MPICH works out of the box! In order to configure MPI.jl v0.19 to use system-provided Fujitsu MPI (based on OpenMPI) you have to specify the MPI C compiler for A64FX with

julia --project -e 'ENV["JULIA_MPI_BINARY"]="system"; ENV["JULIA_MPICC"]="mpifcc"; using Pkg; Pkg.build("MPI"; verbose=true)'

Note #1: mpifcc is available only on the compute nodes. On the login nodes that would be mpifccpx, but this is the cross compiler running on Intel architecture, it's unlikely you'll run an aarch64 Julia on there. Preliminary tests show that MPI.jl should work mostly fine with Fujitsu MPI, but custom error handlers may not be available (read: trying to use them causes segmentation faults).

Note #2: in MPI.jl v0.20 Fujitsu MPI is a known ABI (it's the same as OpenMPI) and there is nothing special to do to configure it apart from choosing the system binaries.

Note #3: we recommend using MPI.jl's wrapper of mpiexec to run MPI applications with Julia: mpiexecjl.

File system latency

Fugaku has an advanced system to handle parallel file system latency. In order. In order to speed up parallel applications run through MPI you may want to distribute it to the cache area of the second-layer storage on the first-layer storage using llio_transfer. In particular, if you're using Julia, you likely want to distribute the julia executable itself together with its installation bundle.

For example, assuming that you are using the official binaries from the website, instead of the Julia module provided by Spack, you can do the following:

```sh

Directory for log of llio_transfer and its wrapper dir_transfer

LOGDIR="${TMPDIR}/log"

Create the log directory if necessary

mkdir -p "${LOGDIR}"

Get directory where Julia is placed

JL_BUNDLE="$(dirname $(julia --startup-file=no -O0 --compile=min -e 'print(Sys.BINDIR)'))"

Move Julia installation to fast LLIO directory

/home/system/tool/dirtransfer -l "${LOGDIR}" "${JLBUNDLE}"

Do not write empty stdout/stderr files for MPI processes.

export PLEMPISTD_EMPTYFILE=off

mpiexecjl --project=. -np ... julia ...

Remove Julia installation directory from the cache.

/home/system/tool/dirtransfer -p -l "${LOGDIR}" "${JLBUNDLE}" ```

Reverse engineering Fujitsu compiler using LLVM output

The Fujitsu compiler has two operation modes: "trad" (for "traditional") and "clang" (enabled by the flag -Nclang). In clang mode it's based on LLVM (version 7 at the moment). This means you can get it to emit LLVM IR with -emit-llvm. For example, with

console $ echo 'int main(){}' | fcc -Nclang -x c - -S -emit-llvm -o -

you get

```llvm ; ModuleID = '-' source_filename = "-" target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind readnone uwtable define dsolocal i32 @main() localunnamed_addr #0 !dbg !8 { ret i32 0, !dbg !11 }

attributes #0 = { norecurse nounwind readnone uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="a64fx" "target-features"="+crc,+crypto,+fp-armv8,+lse,+neon,+ras,+rdm,+sve,+v8.2a" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.dbg.cu = !{!0} !llvm.module.flags = !{!3, !4, !5} !llvm.ident = !{!6} !llvm.compinfo = !{!7}

!0 = distinct !DICompileUnit(language: DWLANGC99, file: !1, producer: "clang: Fujitsu C/C++ Compiler 4.7.0 (Nov 4 2021 10:55:52) (based on LLVM 7.1.0)", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !2) !1 = !DIFile(filename: "-", directory: "/home/ra000019/a04463") !2 = !{} !3 = !{i32 2, !"Dwarf Version", i32 4} !4 = !{i32 2, !"Debug Info Version", i32 3} !5 = !{i32 1, !"wchar_size", i32 4} !6 = !{!"clang: Fujitsu C/C++ Compiler 4.7.0 (Nov 4 2021 10:55:52) (based on LLVM 7.1.0)"} !7 = !{!"C::clang"} !8 = distinct !DISubprogram(name: "main", scope: !9, file: !9, line: 1, type: !10, isLocal: false, isDefinition: true, scopeLine: 1, isOptimized: true, unit: !0, retainedNodes: !2) !9 = !DIFile(filename: "", directory: "/home/ra000019/a04463") !10 = !DISubroutineType(types: !2) !11 = !DILocation(line: 1, column: 12, scope: !8) ```

SystemBenchmarks.jl

I ran SystemBenchmarks.jl on a compute node. Here are the results: https://github.com/IanButterworth/SystemBenchmark.jl/issues/8#issuecomment-1039775968.

BLAS

OpenBLAS seems to have poor performance:

```julia julia> using LinearAlgebra

julia> peakflops() 2.589865257047898e10 ```

Up to v1.7, Julia uses OpenBLAS v0.3.17, which actually doesn't support A64FX at all, so it's probably using the generic kernels. v0.3.19 and v0.3.20 improved support for this chip, you can find a build of 0.3.20 at https://github.com/JuliaBinaryWrappers/OpenBLAS_jll.jl/releases/download/OpenBLAS-v0.3.20%2B0/OpenBLAS.v0.3.20.aarch64-linux-gnu-libgfortran5.tar.gz, but sadly there isn't a great performance improvement:

```julia julia> BLAS.lbtforward("lib/libopenblas64.so") 4856

julia> peakflops() 2.6362952057793587e10 ```

There is an optimised BLAS provided by Fujitsu, with support for SVE (with both LP64 and ILP64). In order to use it, install FujitsuBLAS.jl

```julia julia> using FujitsuBLAS, LinearAlgebra

julia> BLAS.getconfig() LinearAlgebra.BLAS.LBTConfig Libraries: └ [ILP64] libfjlapackexsveilp64.so

julia> peakflops() 4.801227630694119e10 ```

The package BLISBLAS.jl similarly forwards BLAS calls to the blis library, which has optimised kernels for A64FX.

Building Julia from source

with GCC

Building Julia from source with GCC (which is the default if you don't set CC and CXX) works fine, it's just slow:

[...] JULIA usr/lib/julia/corecompiler.ji Core.Compiler ──── 903.661 seconds [...] Base ─────────────271.257337 seconds ArgTools ───────── 50.348227 seconds Artifacts ──────── 1.193792 seconds Base64 ─────────── 1.057241 seconds CRC32c ─────────── 0.097865 seconds FileWatching ───── 1.169747 seconds Libdl ──────────── 0.026215 seconds Logging ────────── 0.411966 seconds Mmap ───────────── 0.972844 seconds NetworkOptions ─── 1.159094 seconds SHA ────────────── 2.067851 seconds Serialization ──── 2.942512 seconds Sockets ────────── 3.568797 seconds Unicode ────────── 0.814165 seconds DelimitedFiles ─── 1.121546 seconds LinearAlgebra ────109.560774 seconds Markdown ───────── 7.977584 seconds Printf ─────────── 1.635409 seconds Random ─────────── 13.843395 seconds Tar ────────────── 3.146368 seconds Dates ──────────── 16.694863 seconds Distributed ────── 8.163152 seconds Future ─────────── 0.060472 seconds InteractiveUtils ─ 5.245523 seconds LibGit2 ────────── 15.469061 seconds Profile ────────── 5.399918 seconds SparseArrays ───── 42.660136 seconds UUIDs ──────────── 0.165799 seconds REPL ───────────── 40.149298 seconds SharedArrays ───── 5.476926 seconds Statistics ─────── 2.130843 seconds SuiteSparse ────── 16.849304 seconds TOML ───────────── 0.714203 seconds Test ───────────── 3.538098 seconds LibCURL ────────── 3.547585 seconds Downloads ──────── 3.657012 seconds Pkg ────────────── 54.053634 seconds LazyArtifacts ──── 0.019103 seconds Stdlibs total ────427.178257 seconds Sysimage built. Summary: Total ─────── 698.447219 seconds Base: ─────── 271.257337 seconds 38.8372% Stdlibs: ──── 427.178257 seconds 61.1611% [...] Precompilation complete. Summary: Total ─────── 1274.714700 seconds Generation ── 886.445205 seconds 69.5407% Execution ─── 388.269495 seconds 30.4593%

With Fujitsu compiler

For reference, the version used for the last build I attempted was 1ad2396f

Compiling Julia from source with the Fujitsu compiler is complicated. In particular, it's an absolute pain to use the Fujitsu compiler in trad mode. You can have some more luck with clang mode.

Preparation. Create the Make.user file with this content (I'm not sure this file is actually necessary when using Clang mode, but it definitely is with trad mode):

makefile override ARCH := aarch64 override BUILD_MACHINE := aarch64-unknown-linux-gnu

Then you can compile with (-Nclang is to select clang mode)

make -j50 CC="fcc -Nclang" CFLAGS="-Kopenmp" CXX="FCC -Nclang" CXXFLAGS="-Kopenmp"

The compiler in trad mode doesn't define the macro __SIZEOF_POINTER__, so compilation would fail in https://github.com/JuliaLang/julia/blob/1ad2396f05fa63a71e5842c814791cd7c7715100/src/support/platform.h#L114-L115. The solution is to set the macro -D__SIZEOF_POINTER__=8 in the CFLAGS (or just not use trad mode). Then, you may get errors like

/vol0003/ra000019/a04463/repo/julia/src/jltypes.c:2000:13: error: initializer element is not a compile-time constant jl_typename_type, ^~~~~~~~~~~~~~~~ ./julia_internal.h:437:41: note: expanded from macro 'jl_svec' n == sizeof((void *[]){ __VA_ARGS__ })/sizeof(void *), \ ^~~~~~~~~~~ /usr/include/sys/cdefs.h:439:53: note: expanded from macro '_Static_assert' [!!sizeof (struct { int __error_if_negative: (expr) ? 2 : -1; })] ^~~~ /vol0003/ra000019/a04463/repo/julia/src/jltypes.c:2025:43: error: initializer element is not a compile-time constant jl_typename_type->types = jl_svec(13, jl_symbol_type, jl_any_type /*jl_module_type*/, ^~~~~~~~~~~~~~ ./julia_internal.h:437:41: note: expanded from macro 'jl_svec' n == sizeof((void *[]){ __VA_ARGS__ })/sizeof(void *), \ ^~~~~~~~~~~ /usr/include/sys/cdefs.h:439:53: note: expanded from macro '_Static_assert' [!!sizeof (struct { int __error_if_negative: (expr) ? 2 : -1; })] ^~~~

This is the compiler's fault, which is supposed to be able to handle this, but you can just delete the assertions at lines https://github.com/JuliaLang/julia/blob/1ad2396f05fa63a71e5842c814791cd7c7715100/src/juliainternal.h#L427-L429, https://github.com/JuliaLang/julia/blob/1ad2396f05fa63a71e5842c814791cd7c7715100/src/juliainternal.h#L436-L438, https://github.com/JuliaLang/julia/blob/1ad2396f05fa63a71e5842c814791cd7c7715100/src/julia_internal.h#L444-L446.

If you're lucky enough, with all these changes, you may be able to build usr/bin/julia. Unfortunately, last time I tried, run this executable causes a segmentation fault in dl_init:

``` (gdb) run Starting program: /vol0003/ra000019/a04463/repo/julia/julia Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.aarch64 [Thread debugging using libthreaddb enabled] Using host libthreaddb library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault. 0x000040000000def4 in dlinit () from /lib/ld-linux-aarch64.so.1 Missing separate debuginfos, use: yum debuginfo-install FJSVxoslibmpg-2.0.0-25.14.1.el8.aarch64 elfutils-libelf-0.182-3.el8.aarch64 (gdb) bt

0 0x000040000000def4 in dlinit () from /lib/ld-linux-aarch64.so.1

1 0x000040000020adb0 in dlcatch_exception () from /lib64/libc.so.6

2 0x00004000000125e4 in dlopenworker () from /lib/ld-linux-aarch64.so.1

3 0x000040000020ad54 in dlcatch_exception () from /lib64/libc.so.6

4 0x0000400000011aa8 in dlopen () from /lib/ld-linux-aarch64.so.1

5 0x0000400000091094 in dlopen_doit () from /lib64/libdl.so.2

6 0x000040000020ad54 in dlcatch_exception () from /lib64/libc.so.6

7 0x000040000020ae20 in dlcatch_error () from /lib64/libc.so.6

8 0x00004000000917f0 in dlerrorrun () from /lib64/libdl.so.2

9 0x0000400000091134 in dlopen@@GLIBC_2.17 () from /lib64/libdl.so.2

10 0x0000400000291f34 in loadlibrary (relpath=0x400001e900c6 "libjulia-internal.so.1", srcdir=, err=1) at /vol0003/ra000019/a04463/repo/julia/cli/loaderlib.c:65

11 0x0000400000291c78 in jlloadlibjuliainternal () at /vol0003/ra000019/a04463/repo/julia/cli/loaderlib.c:200

12 0x000040000000de04 in call_init.part () from /lib/ld-linux-aarch64.so.1

13 0x000040000000df08 in dlinit () from /lib/ld-linux-aarch64.so.1

14 0x0000400000001044 in dlstart_user () from /lib/ld-linux-aarch64.so.1

Backtrace stopped: previous frame identical to this frame (corrupt stack?) ```

Owner

  • Name: Mosè Giordano
  • Login: giordano
  • Kind: user
  • Location: London, UK
  • Company: @UCL-ARC

Citation (CITATION.bib)

@INPROCEEDINGS{2022clus.confE...1G,
       author = {{Giordano}, Mos{\`e} and {Kl{\"o}wer}, Milan and {Churavy}, Valentin},
        title = "{Productivity meets Performance: Julia on A64FX}",
     keywords = {Computer Science - Distributed, Parallel, and Cluster Computing},
    booktitle = {2022 {IEEE} International Conference on Cluster Computing ({CLUSTER})},
    publisher = {{IEEE}},
         year = 2022,
        month = oct,
          eid = {1},
        pages = {1},
          doi = {10.1109/CLUSTER51413.2022.00072},
archivePrefix = {arXiv},
       eprint = {2207.12762},
 primaryClass = {cs.DC},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2022clus.confE...1G},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

GitHub Events

Total
  • Watch event: 3
  • Push event: 1
Last Year
  • Watch event: 3
  • Push event: 1

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 53
  • Total Committers: 1
  • Avg Commits per committer: 53.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Mosè Giordano m****e@g****g 53
Committer Domains (Top 20 + Academic)
gnu.org: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 4.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • carstenbauer (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels