back
Get SIGNAL/NOISE in your inbox daily
We implemented a sophisticated matrix multiplication engine in CubeCL that rivals the performance of cuBLAS and CUTLASS while supporting a wider range of GPUs. Leveraging double buffering, tensor cores, and vectorization, it compiles seamlessly to CUDA, ROCm, WebGPU, Metal, and Vulkan backends without relying on proprietary or third-party binaries. Matrix multiplication is central to modern AI workloads, especially transformers, and optimizing it ourselves was essential to enable kernel fusion and achieve state-of-the-art performance across platforms in a deep learning framework.
Recent Stories
Jan 19, 2026
App Store apps are exposing data from millions of users
An effort led by security research lab CovertLabs is actively uncovering troves of (mostly) AI-related apps that leak and expose user data.
Jan 19, 2026Stop ignoring AI risks in finance, MPs tell BoE and FCA
Treasury committee urges regulators and Treasury to take more ‘proactive’ approach
Jan 19, 2026OpenAI CFO Friar: 2026 is year for ‘practical adoption’ of AI
OpenAI CFO Sarah Friar said the company is focused on "practical adoption" in 2026, especially in health, science, and enterprise.