Google pro LLM komunitu zachranuje Nvidia N1X (RTX Spark Laptopy), Strix Halo, pripadne M5 Pro. Je to prvni vlastovka, ma to sve problemy, ale ono to pujde. Vyuziva se toho, ze lokalne nemas frontu dotazu od hodne uzivatelu, takze pocitas na jednom promptu 256 tokenu naraz, difuzne. Tzn pokud mas neco, co ma vyrazne vyssi performance chipu nez rychlosti pameti, vice vytezis ten performance a mene trapis tam a zpatky pameti. Vyznamne zabira jak na moji milovanou 5090, tak na RTX Spark.
U nVidia podpora samozrejme od day 0, protoze spoluprace s Google.
Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation | NVIDIA Technical Bloghttps://developer.nvidia.com/blog/run-diffusiongemma-on-nvidia-for-developer-ready-high-throughput-text-generation/Reci cisel, inference:
5090: 700 tokenu / sekundu
DGX (+RTX) Spark: 150 t/s
DGX Station: 2000 t/s
Detail:
A Visual Guide to DiffusionGemma - by Maarten Grootendorsthttps://newsletter.maartengrootendorst.com/p/a-visual-guide-to-diffusiongemma