Lichee Pi 4A: Serious RISC-V Desktop Computing [video]

imtringued · on Aug 20, 2023

I want to correct a recent post I made about the Vision Star Five. That board is only competitive against the Raspberry Pi 3. This one here actually is on a level playing field as the Raspberry Pi 4.

https://sipeed.com/licheepi4a

In practice, the lack of vector instructions significantly limits the performance of the processors. So they tend to perform poorly in Geekbench or the Phoronix benchmarks.

Anyway. The reason why RISC-V will beat ARM has little to do with the merits of RISC-V really. The reason is that ARM is a sclerotic company that is falling behind the competition and RISC-V as an architecture itself does nothing to beat ARM, it is the fact that everyone has become an ARM competitor that will finish off ARM.

x86 is still a significant threat to both ARM and RISC-V in the datacenter, for instance because there is a semblance of competition between Intel and AMD.

brucehoult · on Aug 20, 2023

Note that this board, the Lichee Pi 4A, has pretty good vector units, though they implement the current draft version of RVV (0.7.1) as at the time the core was designed in 2019. Significant incompatible changes were made before RVV 1.0 was ratified in November 2021 -- the general design is the same, most instructions and binary opcodes are the same, but both source code and binary incompatibilities exist.

But at the same time, much useful library code such as memcpy(), strlen(), strcpy(), strcmp() etc can be binary compatible (and optimal) for both.

Most toolchain and library effort is going into 1.0, of course, but if you buy one of these and want to write vectorised code you can do so using assembly language, which is much easier than using typical SIMD ISAs.

For example, here is an optimised memcpy (dst, src, len in a0, a1, a2):

    0000000000000000 <memcpy>:
       0: 86aa                 mv a3,a0
    
    0000000000000002 <.L1^B1>:
       2: 00267757           vsetvli a4,a2,e8,m4,d1
       6: 12058007           vlb.v v0,(a1)
       a: 95ba                 add a1,a1,a4
       c: 8e19                 sub a2,a2,a4
       e: 02068027           vsb.v v0,(a3)
      12: 96ba                 add a3,a3,a4
      14: f67d                 bnez a2,2 <.L1^B1>
      16: 8082                 ret

RetroTechie · on Aug 20, 2023

Performance issues are 2-fold:

a) As mentioned, vector extensions. If missing, big performance hit in some benchmarks.

b) General software optimizations (compilers & GPU drivers in particular). A lot of these are still on the table. And thus situation should improve as software support matures further.

Getting boards out there helps a lot. Developers can't fix software support without hardware to test on.

RetroTechie · on Aug 20, 2023

Nice to see more RISC-V boards in the wild!

Just the other day obtained a VisionFive 2. :-)) In comparison:

Lichee Pi should be faster (+50%? 2x? 3x?). And max. 16 GB RAM vs. 4/8 GB on the VF2.

Requires a carrier board. More flexible but if you only get/use 1 of that, module+carrier is more expensive than single board.

VF2 GPU: IMG BXE-4-32 Lichee Pi: ?? Anyone?

Much will depend on driver support though. No (working) driver for feature X, then feature X might as well be absent.

brucehoult · on Aug 20, 2023

TH1520 (Lichee Pi 4A, OoO cores similar to Arm A72) beats JH7110 (VisionFive 2 and others, in-order cores similar to Arm A55) on all micro-benchmarks.

e.g. by about 40% on my branchy primes benchmark, which exercises the CPU and L1 cache only: https://hoult.org/primes.txt

    10.430 sec Sipeed LM4A TH1520 4x C910 @1.848 GHz 216 bytes  19.3 billion clocks
    10.851 sec Sophon SG2042 64x C910 RV64 @1.8? GHz 216 bytes  19.3 billion clocks
    12.115 sec Pi4 Cortex A72 @ 1.5 GHz A64          300 bytes  18.2 billion clocks
    14.885 sec VisionFive 2 U74 _zba_zbb @ 1.5 GHz   214 bytes  22.3 billion clocks
    19.500 sec Odroid C2 A53 @ 1.536 GHz A64         276 bytes  30.0 billion clocks

The TH1520 has much faster memcpy speeds at every level of cache and DRAM.

https://hoult.org/JH7110_memcpy.txt

And yet ... both Richard Jones at Fedora and I have found that the VisionFive 2 is actually slightly faster at building software packages!

My result was that building the same binutils + gcc + newlib snapshot (an old one with RVV 0.7.1 support)...

https://github.com/brucehoult/riscv-gnu-toolchain

... the VisionFive 2 takes 108 minutes while the Lichee Pi 4A takes 122 minutes.

That's with the supplied fan on the LPi4A (and confirmed it's not throttling) and no cooling at all on the VisionFive 2. I used the same Samsung external USB3 SSD on both -- the VisionFive 2 gets slightly faster transfer speeds (IIRC 190 MB/s vs 160) with that, but that's not enough to matter: just 12s difference on the time to tar up the source directory, compared to a 14 minute build time difference. Both have enough RAM to cache everything anyway.

> VF2 GPU: IMG BXE-4-32 Lichee Pi: ?? Anyone?

BXM-4-64

mycall · on Aug 22, 2023

Have you done any watt measurements on the LPi4A? I'm curious what it pulls at idle or other use cases (excluding most of the ports)

EDIT: I found what that information here [1]

[1] https://wiki.sipeed.com/hardware/en/lichee/th1520/lpi4a/10_t...

snvzz · on Aug 22, 2023

Remarkably, it is way higher than VisionFive 2 / JH7110.

camel-cdr · on Aug 20, 2023

Here is the source code* for the CPU:

https://github.com/T-head-Semi/openc910

* AFAIK they didn't opensource the pre ratification vector extension implementation they ship with the taped out chip.

gardenfelder · on Aug 20, 2023

https://liliputing.com/sipeed-lichee-pi-4a-is-a-modular-risc...

fithisux · on Aug 23, 2023

Is it serious to use non-standard vector extensions?

Unless you can upgrade your cpu :-)