Building for the Milk-V Duo

I use a Nix-based cross-compilation environment. The upstream SDK V2 images use a statically-compiled system (no glibc available), so to build for it you need to build static binaries. You can do this by setting up cross-compilation for the riscv64-unknown-linux-musl target and then passing -static to gcc/g++ for static compilation

Vector extension support

Compilation

The SG2002 CPU used by the Milk-V uses the T-Head Vector extensions, which are a proprietary extension of the pre-standard RISC-V Vector (RVV) extension 0.7.1. Yes, this is as bad an idea as it sounds, but I guess it’s hard to complain for a 20 Euro board.

If you would normally compile using -march=rv64gcv for RVV 1.0, you should compile with -march=rv64gc_xtheadvector for T-Head CPUs like the SG2002.

Once you go beyond the basics of RVV, you will find that the T-Head Vector extensions has some missing functionality/idiosynchrasies.

Inverse square root

One thing that is missing is support for the vfrsqrt7.v instruction for reciprocal square root estimations . Unfortunately, the regular square root instruction vfsqrt.v is pretty slow, so square root + division is out for some applications. Luckily, the good old fast inverse square root estimation still works:

// Newton-Raphson iteration for inverse square root:
// y' = y * (1.5 - 0.5 * a * y * y)
inline vfloat32m8_t rsqrt_newton_raphson(vfloat32m8_t y, vfloat32m8_t a, size_t vl) {
  auto tmp = __riscv_vfmul_vv_f32m8(y, y, vl);
  tmp = __riscv_vfmul_vv_f32m8(a, tmp, vl);
  tmp = __riscv_vfmul_vf_f32m8(tmp, 0.5f, vl);
  tmp = __riscv_vfrsub_vf_f32m8(tmp, 1.5f, vl);
  return __riscv_vfmul_vv_f32m8(y, tmp, vl);
}
 
// https://en.wikipedia.org/wiki/Fast_inverse_square_root
inline vfloat32m8_t fast_rsqrt(vfloat32m8_t x, size_t vl) {
  auto i = __riscv_vreinterpret_v_f32m8_u32m8(x);
  i = __riscv_vsrl_vx_u32m8(i, 1, vl);
  i = __riscv_vrsub_vx_u32m8(i, 0x5f3759df, vl);
  auto rsqrt = __riscv_vreinterpret_v_u32m8_f32m8(i);
 
  // Adjust number of Newton-Raphson iterations depending on the
  // required accuracy.
  rsqrt = rsqrt_newton_raphson(rsqrt, x, vl);
  rsqrt = rsqrt_newton_raphson(rsqrt, x, vl);
 
  return rsqrt;
}