- Ama-LLM ahlolwa kangcono ngamathokheni ngomzuzwana: okokufaka nokukhiphayo kunquma ukubambezeleka.
- I-Databricks ihlinzeka ngeziphetho nge-TPS kanye ne-autoscale; I-MLPerf ilinganisa amamethrikhi.
- Amabhentshimakhi amasha (DeepSeek-R1, Whisper, Llama 3.1-8B) enza lukhuni i-TTFT/TPOT.

Uma usebenza ngamamodeli olimi, uzwile igama elithi "amathokheni ngomzuzwana" izikhathi eziyinkulungwane, kodwa akuvamile ukuthi lichazwe kabanzi ukuthi lisho ukuthini ezindaweni zomhlaba wangempela futhi, ngaphezu kwakho konke, ukuthi i-MLPerf ikala kanjani. Kulesi sihloko, sichaza ngokucacile ukuthi ayini amathokheni, kungani amathokheni nge-metric yesibili ebaluleke kakhulu ekuqondeni, nokuthi amapulatifomu afana ne-Databricks kanye nebhentshimakhi ye-MLPerf ayisebenzisa kanjani usayizi, ukuqhathanisa, kanye nesikali. Ngaphezu kwalokho, sifaka izibalo ezithile ezivela kubakhiqizi namafu kuya kokulindelwe kokusebenza okuphansi..
Indaba ayiyona encane: imboni inamathokheni ajwayelekile ngomzuzwana ukuze ahlole ukusebenza kwe-LLM ezikhungweni zedatha nasemaphethelweni. I-MLPerf, i-MLCommons ebuyekezwe ngontanga, isibe ibhentshimakhi yokuqhathanisa izingxenyekazi zekhompuyutha nesofthiwe.Ngokuhambisanayo, ama-opharetha afana ne-Databricks asevele ehlinzeka ngeziphetho zabo zemodeli ngokuqondile ngokusekelwe kuhlu lwamathokheni ngomzuzwana. Ake sihlukanise konke lokhu, ngezinombolo kanye namacala okusebenzisa esandleni.
Iyini ithokheni futhi kungani ibalulekile ku-LLM?
Izibonelo zolimi azicubunguli izinhlamvu ngazinye noma amagama njengoba enjalo; basebenza ngamayunithi abizwa ngamathokheni. Ithokheni ivamise ukuba nezinhlamvu ezi-4 ubude, noma ngokwesilinganiso, amagama angu-0,75.Lesi silinganiso siyahlukahluka kuye ngolimi kanye ne-tokenizer yemodeli, kodwa sisebenza njengereferensi esheshayo: umbhalo wamagama angu-10 uhamba eduze kwamathokheni angu-13–14.
Ukuhlukaniswa okuqondile kuncike kumodeli: I-LLM ngayinye isebenzisa ithokheni yayo futhi ihlukanisa amagama abe amathokheni aphelele noma amagama angaphansiAmathuluzi aku-inthanethi akuvumela ukuthi ubone, ngokwesibonelo, ukuthi u-Llama uwubeka kanjani uphawu umushwana othile. Lokhu kuhlukahluka, okubonakala njengemininingwane encane, kuthonya ukubambezeleka nezindleko zekhompuyutha.
Uma kukhulunywa ngezinga lesizukulwane, kuvame ukuvezwa ngokwezimpawu zamathokheni ngomzuzwana, kunamagama ngomzuzwana. Lokhu kwenza imethrikhi ibe yinye kuzo zonke izilimi, ubude bomongo, nezitayela zokuphumayo., futhi ivumela ukubala ngokunembile izindleko zokucatshangelwa kanye nomthamo odingekayo.
Kungani ukala ukusebenza kumathokheni ngomzuzwana futhi hhayi ku-RPS?
Amasevisi e-API endabuko agxile ku-RPS (izicelo ngomzuzwana). Ku-LLM, leyo ndlela iyafinyela: Izicelo ezimbili zingathatha izikhathi ezihluke kakhulu kuye ngamathokheni okufaka namathokheni okukhiphayoOkusho ukuthi, ukukhokhelwa kwangempela kuza ngamathokheni, hhayi "kunombolo yezingcingo."
Kunemithombo emibili eyinhloko yokuhlukahluka. Okokuqala, ubude bomongo wokokufaka: Ukwaziswa okufushane kungase kube namathokheni ambalwa, kodwa idokhumenti efingqiwe ingakhuphukela kumakhulu noma izinkulungwane.Ngakolunye uhlangothi, ubude bomphumela: ukufingqa ngokuvamile kukhiqiza amathokheni ambalwa; ukukhiqiza i-athikili ende noma incazelo kwandisa isikhathi, ngoba ukukhipha amakhodi okukhiphayo kuyingxenye ebiza kakhulu.
Ngakho-ke, ukukala ngokweqiniso isiphetho se-inference, kuyasiza ukucabanga ngamathokheni. I-Databricks, ngokwesibonelo, ihlinzeka ngeziphetho zayo Zokunikezela ngebanga lamathokheni ngomzuzwana kanye nezikweletu ngehora ngokusekelwe ekulinganisweni.Ngale ndlela, ungakwazi ukuvumelanisa umthamo nomthwalo wangempela ngaphandle kokukhohliswa i-RPS engayitsheli yonke indaba.
I-Databricks ne-MLPerf zikala kanjani amathokheni ngomzuzwana
I-Databricks ithatha umthwalo omele ama-RAG njengesithenjwa futhi ifinyeze: Amathokheni okufaka angu-2048 namathokheni okukhipha angu-256. Ihlanganisa zombili izigaba (ukugcwalisa kuqala kanye nokukhipha ikhodi) futhi, ngokuzenzakalelayo, ithuthukisa ibhalansi phakathi kokuphuma kanye nokubambezeleka kosayizi beqoqwana ongu-1 ngesicelo ngasinye, kulingisa izicelo eziningi ngesikhathi esisodwa.
Ngalowo mthetho, izinombolo zifundeka kanje: uma ulungisa indawo yokugcina kumathokheni angu-2304 ngomzuzwana (2048 + 256), Isicelo esinalawo masayizi sithatha cishe isekhondiUma usetha kumathokheni angu-5600 ngomzuzwana, isicelo esifanayo sehla cishe ku-0,5 s futhi ungakwazi ukucubungula izicelo ezimbili ezifanayo ngomzuzwana.
Uma umthwalo wakho womsebenzi ushintsha, ukubambezeleka kuzoshintsha. Ukukhiqiza amathokheni okukhiphayo amaningi kujezisa ngaphezu kokukhulisa amathokheni okokufaka.Uma wenza i-batch inference, bala isilinganiso senani lamathokheni okokufaka nokukhiphayo kudathasethi yakho bese uyiqhathanisa nebhentshimakhi yangaphambilini ukuze ulinganise izikhathi.
Izibonelo ezingokoqobo: ngemigqa engu-1000, isilinganiso samathokheni angu-3000 kanye namathokheni okukhipha angu-500, kanye nokuphuma okulungiselelwe kwamathokheni angu-3500 ngomzuzwana, kuzokuthatha ngaphezu kwemizuzwana eyi-1000 ngoba izilinganiso zakho zidlula ireferensi. Uma esikhundleni salokho ulinganisa okokufaka okungu-1500 kanye nokuphumayo okungu-100 namathokheni ayi-1600 ngokuhlinzekwa kwesekhondi ngalinye, uzohamba ngaphansi kwemizuzwana eyi-1000 isiyonke kuleyo migqa eyi-1000.
Ukukala okuzenzakalelayo okufunekayo kanye nokubala kwesikali kwangempela
I-Databricks Model Serving ihlanganisa ukukala okuzenzakalelayo okusheshayo lokho Khulisa noma unciphise izinsiza ngokuya ngesidingo samathokheni ngomzuzwanaIsistimu ikala ngamabhulokhi womthamo, futhi umthamo owengeziwe ukhokhiswa kuphela uma usetshenziswa. Ezivivinyweni ezinezicelo ezihambisanayo ezengeziwe, ukuphuma okulungiselelwe kuyanyuka kuze kube yilapho kuzinza cishe kumathokheni angu-8000 ngomzuzwana lapho izinsiza zigcwele, okwandisa ukubambezeleka komugqa.
Uma ubona amathokheni ambalwa ngomzuzwana kunalawo owamakile, hlola izinto ezimbili: I-concurrency enikeziwe ebonisa amamethrikhi ephoyinti lokugcina kanye nosayizi omncane womkhawulokudonsa kulungisiwe. Ngale datha, ukukalwa kwangempela kulinganiselwa kusetshenziswa ifomula: ukuvumelana okulungiselelwe × ubuncane bosayizi womkhawulokudonsa / 4.
Isibonelo esiphathekayo: ngokuvumelana okukhulu okungu-8 kanye nosayizi omncane womugqa wamathokheni angama-850 ngomzuzwana, Umkhawulo osebenzayo uzoba amathokheni angu-1700 ngomzuzwana (8 × 850 / 4). Ukuqonda lesi sibalo kuvimbela izimanga futhi kukusiza ukuthi ulungise izilungiselelo zakho kuma-SLO akho okubambezeleka.
I-MLPerf Inference: Kuyini nokuthi ilinganisa ini namuhla
I-MLPerf, ethuthukiswe yi-MLCommons, iyindawo evulekile nemisiwe yokulinganisa ukusebenza kwe-AI kusikhungo sedatha nasemaphethelweni, ukusuka embonweni kuya ku-LLM. Umgomo wayo uwukuqhathanisa izinkundla ngendlela engenzeleli futhi ekwazi ukukhiqiza kabusha ukuze kuqhutshwe ukusebenza kahle kwe-ecosystem.Eminyakeni yamuva nje, ukugxila kushintshe ngokusobala ku-GenAI nama-LLM amakhulu.
Ohlelweni lwesihlanu, i-Llama 2 70B yahlanganiswa yaba ibhentshimakhi yenkanyezi, yasusa i-ResNet50, futhi Amathokheni ngesekhondi lama-metrics athuthuke afika ku-3,3x esimweni esihle kakhulu onyakeni owodwa, ngokusebenza kwe-median ngokuphindwe izikhathi ezingu-5 ngenxa yehadiwe nokulungiselelwa kwesofthiwe. Ukuba khona kwama-CPU afana ne-Intel Xeon 6 emiphumeleni esemthethweni nakho kukhombisile lokho Kukhona indawo yezixazululo ezisebenza kahle ze-generalist ezimweni ezithile.
Inguqulo engu-5.1 ye-MLPerf Inference isithathe elinye igxathu eliya phambili: ihlanganise amabhentshimakhi angukhiye amathathu amasha, ukubonisana ne-DeepSeek-R1, inkulumo-kuya-umbhalo nge-Whisper Large v3 kanye ne-LLM encane esekelwe ku-Llama 3.1 8BSekukonke, i-consortium ibike ababambiqhaza abangu-27, yafinyelela ingqophamlando yemiphumela engu-90.000, futhi yehlisa amamethrikhi okubambezeleka ezimeni ezisebenzisanayo.
Amamethrikhi nezinjongo kumabhentshimakhi amasha
Ibhentshimakhi yokucabanga ne-DeepSeek-R1, i-MoE yamapharamitha angama-671B, ikhombisa ukuthi Lawa mamodeli akhiqiza amaketanga amade okucabanga ngaphambi kwempendulo. Isekela okuphumayo okungafika kumathokheni angu-20.000, anesilinganiso samathokheni angu-3880 okukhiphayo ngakunye kudathasethi, elikhulu kunawo wonke kuze kube yimanje ekuqondeni.
Imithetho ikala ukuphuma kumodi engaxhunyiwe ku-inthanethi nemodi yeseva enemikhawulo eqinile: Isikhathi ukuya kuthokheni yokuqala yamasekhondi angu-2 nokubambezeleka kwethokheni ngayinye ye-80 ms ku-p99Lokhu kuhloswe ukulinganisa isabelomali "sokucabanga" kanye nokusabela okudingekayo ukuze sisetshenziswe.
Ibhentshimakhi encane ye-LLM ene-Llama 3.1‑8B ingena esikhundleni se-GPT-J 6B njengesango. Isekela izimo ezifika ku-128.000 amathokheni futhi ihlola isifinyezo ku-CNN‑DailyMail enamathokheni okokufaka angu-778 namathokheni okukhipha angu-73. Ukunemba kuqinisekiswa nge-ROUGE futhi, ekuhlukaniseni okuvaliwe, kuyadingeka ukuze kufane namaphesenti angama-99 webhentshimakhi yokunemba okuphezulu.
Kumamethrikhi we-latency, izinkomba ezimbili zisetshenziswa: i-TTFT (isikhathi ukuya kwithokheni yokuqala) kanye ne-TPOT (isikhathi ngethokheni ngayinye). Kuseva, 2 s we-TTFT kanye no-100 ms we-TPOT aqashelwa. (cishe 480 ppm), futhi esimweni esisha sokusebenzisana sicindezelwa ku-0,5 s kanye no-30 ms ngokulandelanayo (cishe u-1600 ppm) ezimweni ezifana nengxoxo, ukubhala ngekhodi noma amathuluzi okudala.
Ukusebenza okuvelele komkhiqizi nomsebenzi
- I-NVIDIA ihole futhi, kulokhu ngeBlackwell Ultra ohlelweni lwe-GB300 NVL72, yashaya amaphuzu Irekhodi lokubonisana elinamaphesenti angama-45 ngaphezulu kwe-DeepSeek‑R1 kune-GB200 NVL72, ifinyelela amathokheni angu-5842 ngomzuzwana nge-GPU ngayinye engaxhunyiwe ku-inthanethi kanye no-2907 kuseva, ngokuthuthuka okusondele ku-5x uma kuqhathaniswa ne-Hopper engaqinisekisiwe.
- Kubhentshimakhi entsha esebenzisanayo ye-Llama 3.1 405B, kusetshenziswe i-NVIDIA ukukhonza okuhlukene neDynamo, ukuhlukanisa umongo nokukhiqiza kuma-GPU ahlukene nokudlulisa Inqolobane ye-KV nge-NVLink, kufinyelelwe u-1,5× ngaphezulu kokuphuma kwe-GPU ngayinye kunokusebenza okungokwesiko ku-Blackwell nokungaphezu kuka-5× ngaphezu kwezinhlelo ezine-Hopper.
- Kumamodeli amancane, i-NVIDIA ibike Ngaphezu kwamathokheni angu-18.000 ngomzuzwana nge-GPU ngayinye ku-Llama 3.1 8B engaxhunyiwe ku-inthanethi kanye namathokheni angu-5667 ngomzuzwana nge-GPU ngayinye ku-Whisper, egcina ubuholi be-GPU kuzo zonke izimo (okungaxhunyiwe ku-inthanethi, iseva nokusebenzisana).
- I-AMD yandise ubukhona bayo ngokuthunyelwa kokuqala kwe-Instinct MI355X GPU, manje esebangeni elingu-2-70B. Ibonise ukukala kwama-node amaningi kanye nokwenyuka kwamathokheni okungu-2,7x ngomzuzwana ngaphezu kwe-MI325X ku-FP8.. Ekuhlukaniseni okuvulekile, ukuthena okuhleliwe kwasetshenziswa ku-Llama 3.1-405B (FP4), ukwanda komphumela ngamaphesenti angama-82 ngemodeli ethenwe ukujula engamaphesenti angama-21 futhi ngamaphesenti angama-90 ngemodeli ecushwe kahle ngamaphesenti angama-33, ukugcina ukunemba.
- Iphinde yakhipha ukuthunyelwa ku-Llama 2‑70B Interactive, Mixtral‑8×7B kanye ne-Stable Diffusion XL, futhi yethula imiphumela exubile ye-MI300X/MI325X: Lapho ifinyelela kumanodi angu-4, i-MI355X ithole ukukhishwa oku-3,4x ngaphezulu kune-MI300X, inwebela kumanodi angu-8 anokuscalability okuhle.
- I-HPE, ehlanganisa iProLiant neCray, ibike imiphumela engu-14 inombolo 1. I-DL380a Gen12 igqame ku-DLRM naku-Llama 3.1-8B (Iseva) phakathi kwezinhlelo ze-8-GPU PCIe; Idatha ye-DL385 Gen11 imakwe ukusebenza okungcono kwe-GPU kokuthi Whisper nge-H200 NVL; kanye ne-Cray XD670 (8× H200) ithole amaphuzu okuqala ayisithupha ku-RetinaNet, Llama 3.1‑8B, Mixtral, ne-Whisper, kanye nemiphumela yokuqala nge-RTX Pro 6000 Blackwell SE kanye nemiphumela ye-GH200 NVL2 ku-DLRM.
- I-CoreWeave kwakuyifu lokuqala ukubika imiphumela nge-GB300, iletha 6005 amathokheni ngomzuzwana nge-GPU ngayinye ku-DeepSeek‑R1 ungaxhunyiwe ku-inthanethi futhi ubonisa i-orchestration nokukala nge-Slurm ku-Kubernetes kanye nokuhlela okuqaphela i-topology ukuze uthole okuningi ku-NVLink.
- UDell uthumele izinhlelo eziyi-12 ezinama-accelerator e-AMD ne-NVIDIA, ecwebezela ku-LLaMA 2 70B Interactive ne-PowerEdge XE9680L ne-B200, Iseva ye-LLaMA 3.1‑8B ku-XE9685L+B200, i-SDXL ku-XE9685L kanye ne-Whisper ku-XE9680L, okubonisa ukuguquguquka kusuka esithombeni kuye ezwini nge-LLM.
- U-Intel ugcizelele ukuthi kusasele ukuphela kokuthumela imiphumela ngamaseva we-CPU futhi yabonisa ukuthi i-Xeon 6 ene-P-cores ithuthukisa i-1,9× ngaphezulu kwe-5th Gen Xeon kuwo wonke amabhentshimakhi amahlanu, iqinisa indima yayo ekuqondeni kwenhloso evamile. Iphinde yethula izindawo zokusebenza ezinama-8 Arc Pro B60 GPUs, ane-192GB ye-VRAM ukuze inikeze i-Llama2‑70B kubasebenzisi abaningi, kanye nabashayeli abahlanganisiwe nezinhlaka zokwenza lula ukuthunyelwa kwe-GPU eminingi.
- Phakathi kwabahlanganisi nozakwethu, i-ASUSTeK Ukubambezeleka okuthuthukisiwe kanye nokuphumayo ngobuningi, izikhwebu, nesitaki; I-Broadcom ibonise i-VCF virtualization ene-overhead encane uma iqhathaniswa nensimbi engenalutho emisebenzini eminingi (Whisper, SDXL, Llama 3.1-405B, Llama2-70B, RGAT, RetinaNet); I-Cisco ilinganise cishe ngokuhambisana ne-UCS C885A M8 (8× H200 SXM) ne-UCS C845A M8 (8× H200 NVL noma L40S), esekelwa amanethiwekhi e-One G200.
- I-KRAI, isebenzisa i-OpenAI API nama-overheads angokoqobo, iqhathanise i-SGlang ne-vLLM ne-Llama3.1‑70B: 31.391 amathokheni ngomzuzwana ungaxhunyiwe ku-inthanethi nge-SGlang 0.4.9 kanye nezingu-26.319 ezine-vLLM 0.9.2 kuseva eyodwa ene-8x H200; nge-quantization enamandla ifinyelele ku-27.697 nge-SGlang kanye ne-30.893 nge-vLLM, futhi kuma-multi-node yafinyelela kumathokheni angu-87.334 ngomzuzwana kumaseva amathathu.
- I-Lambda, ene-8x B200 180 GB SXM, ibonise ukuthuthukiswa kokusebenza kufika kumaphesenti angu-7 ku-SDXL namaphesenti angu-15 ku-Llama 3.1‑405B uma kuqhathaniswa nomzuliswano odlule, futhi inikeza amaqoqo asuka ku-16 kuya ku-1536 GPU ane-Kubernetes ephethwe noma i-Slurm.
- I-MiTAC, nochungechunge lwayo lwe-G8825Z5, yakhanya ku-LLaMA 2 70B Interactive nge 18.846,1 amathokheni ngomzuzwana kanye nemiphumela emihle ku-Server ne-Mixtral; I-Nebius iqinisekise ukusebenza kwayo okubonakalayo cishe ngokulingana nensimbi engenalutho ku-GB200 NVL72, HGX B200 ne-HGX H200, 596,11 amathokheni ngomzuzwana kuseva kanye namathokheni angu-855,82 angaxhunyiwe ku-inthanethi ku-Llama 3.1-405B nama-GPU angu-4 GB200.
- I-Red Hat ibonise i-vLLM njengesikhathi sokusebenza esisekelwayo ku-AI Inference Server yayo, nge Izinhlamvu ze-CUTLASS ze-FP8 kanye ne-FlashAttention-3 kanye nenjini ethuthukisiwe ye-vLLM v1, inika amandla i-Llama‑3.1‑8B ku-H100 ne-L40S ngesilinganiso esihle kakhulu sokusebenza kwezindleko.
- I-Supermicro ithumele imiphumela ehamba phambili nge-HGX‑B200 8‑GPU (umoya noketshezi) nawo womabili ama-Intel kanye ne-AMD CPU, egqamisayo. I-Llama 3.1‑8B kanye ne-Llama 2‑70B kuseva/engaxhunyiwe ku-inthanethi/eyasebenzisanayo kanye ne-Whisper; ngokusebenzisana, ibonise ukukala okuhle kakhulu nge-32× H100‑SXM nezinye izindlela nge-MI325X.
- I-Vultr ikhishwe nge-Supermicro AS-8126GS-TNMR ne-8x MI325X, eqinisekisa ukusebenza kokuncintisana njenge-Cloud GPU; GATEOverflow iphromothiwe ukukhiqiza kabusha nge-MLCFlow kuma-RTX 4090 kanye nama-AMD/Intel CPUs; I-Giga Computing ithumele i-8U air-cooled EPYC+MI325X kanye ne-Xeon+HGX B200 izinhlelo; I-QCT ihlanganise ukulungiselelwa kwe-Xeon 6 nge-H200 NVL (4 GPUs) kanye nezinkundla ze-8× H200 SXM5 ezine-NVLink ne-GPUDirect Storage, ngaphezu kwezinhlelo ze-8× MI325X.
I-Academia nayo ibe nesikhathi sayo. IYunivesithi yaseFlorida, ne-DGX B200 SuperPOD yayo ehlanganiswe ne-HiPerGator, kwaba yisikhungo sokuqala ukuhambisa imiphumela yokukhomba Ukubambezeleka kweseva yomhlangano ngaphansi kokuhlukaniswa okuvaliwe, kusetshenziswa i-Apptainer ngaphandle kwe-Docker/Sudo, futhi ingena ku-SLURM yabasebenzisi abaningi. Ngakolunye uhlangothi, ukuhanjiswa okukodwa ku-M1 MacBook Pro, nge-ONNX Runtime ne-CoreML ku-GPU ne-Neural Engine, yeqa ukunemba okuhlosiwe esigabeni esisemaphethelweni futhi yabonisa ukuthi ukuqondiswa kwekhwalithi kungahlolwa kuhadiwe yomthengi.
Isivinini esibonwa abasebenzisi kanye nemikhawulo esebenzayo
Ulwazi lomsebenzisi alulinganiswa ngamabhentshimakhi kuphela; empilweni yansuku zonke, Umuzwa we-fluidity ufika lapho weqa umkhawulo othile wamathokheni ngomzuzwanaOmunye umsebenzisi ubeke amazwana ukuthi umkhawulo wabo wengxoxo ungamathokheni angu-4 ngomzuzwana, futhi ngokubhala indaba, kungamathokheni ayi-10 ngomzuzwana; ngaphansi kwalokho, ukusebenzisana kuzwakala kuhamba kancane.
Uma uzama ukusebenzisa i-LLM endaweni, kunamaqiniso amathathu. Ku-CPU yedeskithophu, Kuyinto evamile ukuhamba ngamathokheni angu-1-2 ngomzuzwana, akunakwenzeka ukuthola izimpendulo ezinde. Nge-GPU yokudlala ephezulu, ungasondela kumathokheni angu-5 ngomzuzwana. Nge-NVIDIA H100, yebo, sesivele sikhuluma ngamathokheni angama-60 ngomzuzwana, kodwa i-hardware yesikhungo sedatha, hhayi i-hardware yedeskithophu.
Kwenzekani emafini? Abahlinzeki abanamandla kakhulu bahlula lezi zinombolo ngenxa yezingxenyekazi zekhompuyutha eziyisipesheli kanye nezitaki zemibono ezithuthukisiwe. Izilinganiso ezilinganiselwa ku-119 zamathokheni ngomzuzwana zibikwe ku-ChatGPT-4 kanye ne-168 ku-Gemini., kuyilapho amamodeli omthombo ovulekile adumile afana ne-DeepSeek ehambahamba eduze kwamathokheni angu-21 ngomzuzwana. Uma uguqulela lokho kube amagama, amathokheni ayi-119 ngomzuzwana cishe amagama angama-90 ngomzuzwana.
Isiphetho sokusebenza: kubasebenzisi abaningi, Ukusebenzisa i-AI kukhompyutha kungenzeka, kodwa akwenzeki ngenxa yokunensaUkuze usebenze ngesivinini esinethezekile kanye nokubambezeleka okuqinile, amasevisi aphethwe ahlala eyinketho enengqondo.
Ungawenza kanjani usayizi wephoyinti lakho lokugcina nge-TPS nokuthi yini ongayilindela kusukela ku-latency
Izinyathelo ezisebenzayo zokulinganisa. Okokuqala, chaza isimo sakho sokusebenzisa: Isilinganiso senani lamathokheni okokufaka nokukhiphayo, ukusatshalaliswa kobude, nokuvumelana okulindelekile. Okwesibili, sebenzisa ukuhlolwa komthwalo ngedathasethi emele, ehlanganisa i-TTFT namathokheni ngesekhondi ngalinye eliqhubekayo ngesicelo ngasinye.
Okulandelayo, qondanisa ukucushwa nephethini yakho. Uma umsebenzi wakho ufana nereferensi yeDathabricks (2048 in, 256 out), Khetha ububanzi bamathokheni ngesekhondi ngalinye ukuze isicelo siwele ngaphakathi kwesabelomali esifiselekayo sokubambezelekaKhumbula ukuthi ukuphinda okukhiphayo kuvamise kubiza ngaphezu kokuphindaphinda okokufaka, futhi lokho kuvumelana okusebenzayo kuncike ekulinganiseni okuzenzakalelayo kwangempela.
Gada futhi ulungise. Bheka amamethrikhi ukuvumelana kwesikhashana, imigqa, i-TTFT kanye ne-TPOT, futhi uyiqhathanise nama-SLO akho. Uma ushoda ngomthamo, nweba ububanzi; uma unezinsiza ezeqile, zehlise futhi ulungise amabhlogo ukuze ulondoloze. Ifomula yangempela yokukala izokusiza uqonde ukuthi kungani iphoyinti lokugcina lingasebenzi njengoba limisiwe uma lingazange lidale izifaniso ezanele.
Ekugcineni, qaphela isimo. Kumodi yesitayela se-chatbot esebenzisanayo, ihlose i-TTFT ka-0,5s no-30ms ithokheni ngayinye Lokhu kuzokunikeza okuhlangenwe nakho komsebenzisi okuyi-premium. Kumodi yeseva, 2 s kanye no-100 ms ithokheni ngayinye iyiziqondiso ezinengqondo, futhi ingaxhunyiwe ku-inthanethi, ifuna ukuphuma okuphezulu ngenkathi igcina ukunemba okudingekayo ibhentshimakhi.
Uma ubheka izitayela ze-MLPerf, i-vector icacile: Umongo owengeziwe, amathokheni amaningi, nezindlela ezingcono kakhulu zokusebenza kahle -ukuphakelwa okuhlukanisiwe, i-FP4/FP8, ukuthena okuhleliwe, izikhwebu zangokwezifiso, ukuhlela inqolobane ye-KV- kuphakamisa uphahla lwamathokheni ngonyaka wesibili onyakeni, kokubili chip nesistimu ngayinye.
Isithombe sisonke esidwetshwe yi-Databricks kanye ne-MLPerf siyahambisana: Ukucabanga ngokwamathokheni ngomzuzwana kuyindlela elungile yokucabanga ngezindleko, ukubambezeleka, kanye nokwanda kwe-LLM.Ngophawu oluhle lokumela abanye, amamethrikhi e-TTFT/TPOT, kanye nokulinganisa okuzenzakalelayo okulinganiselwe kahle, kuyenzeka kulethwe izimpendulo ezisheshayo nezizinzile ngaphandle kokukhulisa ingqalasizinda.
