Show HN: Less Slow C++

Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too.

  - Are coroutines viable for high-performance work?
  - Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution?
  - Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE?
  - How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD?
  - What's the throughput gap between CPU and GPU Tensor Cores (TCs)?
  - How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer?
  - Which parts of the standard library hit performance hardest?
  - How do error-handling strategies compare overhead-wise?
  - What's the compile-time vs. run-time trade-off for lazily evaluated ranges?
  - What practical, non-trivial use cases exist for meta-programming?
  - How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
  - How close are we to effectively using Networking TS or heterogeneous Executors in C++?
  - What are best practices for propagating stateful allocators in nested containers, and which libraries support them?

These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation.

Some fun observations:

  - Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512.
  - Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels.
  - The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances?
  - In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef.
  - Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs.
  - Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging.

The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :(

Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!

Comments URL: https://news.ycombinator.com/item?id=43727743

Points: 37

# Comments: 2

https://github.com/ashvardanian/less_slow.cpp

Établi 3d | 18 avr. 2025, 14:40:09

Connectez-vous pour ajouter un commentaire

Autres messages de ce groupe

Show HN: Keep your PyTorch model in VRAM by hot swapping code

Article URL: https://github.com/valine/training-hot-swap/

Comments URL: ht

21 avr. 2025, 02:50:09 | Hacker news

Show HN: "Is This Tech Dead?" A snarky autopsy engine for your dead frameworks

Hi HN, I built this irony and data driven Regret-as-a-service tool to almost scientifically declare tech deaths. F.

Comments URL:

21 avr. 2025, 02:50:09 | Hacker news

How encryption for Cinema Movies works

Article URL: https://serverless.industries/2024/05/31/digital-cinema.en.html

Comments URL:

21 avr. 2025, 00:40:05 | Hacker news

Demystifying decorators: They don't need to be cryptic

Article URL: https://www.thepythoncodingstack.com/p/demystifying-python-decorators

Comments URL:

21 avr. 2025, 00:40:04 | Hacker news

TikZJax: Embedding LaTeX Drawings in HTML

Article URL: https://tikzjax.com/

Comments URL: https://news.ycombinator.com/item?id=43746831

21 avr. 2025, 00:40:04 | Hacker news

Show HN: JuryNow – Get an anonymous instant verdict from 12 real people

After 16 years, I have just launched my game JuryNow. Imagine having a truly diverse panel of 12 real people of all ages, far removed from your peer group, around the world who will be able to gi

20 avr. 2025, 22:20:12 | Hacker news

Show HN: Real-time 4/20 cannabis sales dashboard using Estuary and Tinybird

Built this dashboard to visualize cannabis sales in real time across North America during 4/20. The data updates live from thousands of dispensary POS transactions as the day unfolds.

Under the

20 avr. 2025, 22:20:11 | Hacker news

Techie