_Daily_ hit pieces on Elon Musk (or Musk companies), going for something like a decade. These have petered out somewhat since he left DOGE. But they started way back before he should have had that much notoriety.
They were rightfully been calling out the grift at Tesla. On the SpaceX front they've been his biggest cheerleader (even dismissing other stories like the sexual harrassment)
`_mm_alignr_epi8` is a compile-time known shuffle that gets optimized well by LLVM [1].
If you need the exact behavior of `pshufb` you can use asm or the llvm intrinsic [2]. iirc, I once got the compiler to emit a `pshufb` for a runtime shuffle... that always guaranteed indices in the 0..15 range?
Ironically, I also wanted to try zig by doing a StreamVByte implementation, but got derailed by the lack of SSE/AVX intrinsics support.
Oh, that's actually quite neat, it did not occur to me that you can use @shuffle with a compile time mask and it will optimize it to a specialized instruction.
Massena Hospital is a 25-bed hospital. Might have to go to Canton or Ogdensburg for a family doctor (45 minutes by car). Most things serious get referred to Syracuse or Burlington (3 hours away by car).
AFAIK, Cost[1] is "theoretically" nothing if annual income is less than the federal poverty line ($15,650 for an individuality). And might as well be free for an income up to $39,125.
Alcoa (Aluminum Smelter, *cheap electricity*) was the major industry in the area. Massena plant now produces 85% less aluminum compared to ~15 years ago (AFAICT), leading to something of a ghost town (and cheap housing).
Limited internet connections (speed and/or data-caps). Something like Hughesnet (satellite ISP) couldn't stream more than 240p from youtube during peek times. The data-cap coerced users to do downloads between 2am to 6am.
The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.
Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.
On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.
On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.
It has no need for that. count_if is a fold/reduce operation where the accumulator is simply incremented by `(int)some_condition(x)` for all x. In Rust:
let arr = [ 1, 3, 4, 6,7, 0, 9, -4];
let n_evens = arr.iter().fold(0, |acc, i| acc + (i & 1 == 0) as usize);
assert_eq!(n_evens, 4);
I know that. But that’s still a different interface. If you have a predicate you now have to wrap that in a different closure that conforms it to a new pattern.
This is the same argument as why have count_if if I can write a for loop.
Sure. But at least I interpreted the GP as just saying that the "count-if" operation can be implemented in terms of `reduce` if the latter is available.
Why not use regular rejection sampling when `limit` is known at compile-time.
Does fastrange[1] have fewer rejections due to any excess random bits[2]?
Fastrange is slightly biased because, as Steve Canon observes in that Swift PR, it is just Knuth’s multiplicative reduction. The point of this post is that it’s possible to simplify Lemire’s nearly-divisionless debiasing when the limit is known at compile time.
Lemire’s algorithm rejects the fewest possible samples from the random number generator, so it’s generally the fastest. The multiplication costs very little compared to the RNG.
That snippet would not reject bad chars, are non-digits rejected somewhere else?
A simple scalar loop that multiplies an accumulator by 10, itoa() style, would be faster.
reply