Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too.
- Are coroutines viable for high-performance work?
- Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution?
- Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE?
- How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD?
- What's the throughput gap between CPU and GPU Tensor Cores (TCs)?
- How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer?
- Which parts of the standard library hit performance hardest?
- How do error-handling strategies compare overhead-wise?
- What's the compile-time vs. run-time trade-off for lazily evaluated ranges?
- What practical, non-trivial use cases exist for meta-programming?
- How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
- How close are we to effectively using Networking TS or heterogeneous Executors in C++?
- What are best practices for propagating stateful allocators in nested containers, and which libraries support them?
These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation.Some fun observations:
- Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512.
- Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels.
- The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances?
- In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef.
- Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs.
- Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging.
The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :(Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!
Comments URL: https://news.ycombinator.com/item?id=43727743
Points: 37
# Comments: 2
Connectez-vous pour ajouter un commentaire
Autres messages de ce groupe

Article URL: https://github.com/valine/training-hot-swap/
Comments URL: ht

Article URL: https://tikzjax.com/
Comments URL: https://news.ycombinator.com/item?id=43746831

After 16 years, I have just launched my game JuryNow. Imagine having a truly diverse panel of 12 real people of all ages, far removed from your peer group, around the world who will be able to gi
Built this dashboard to visualize cannabis sales in real time across North America during 4/20. The data updates live from thousands of dispensary POS transactions as the day unfolds.
Under the