From 8f37ef326092b7980130cd0b86e482ca62f10c6c Mon Sep 17 00:00:00 2001 From: damion10155114 Date: Tue, 11 Feb 2025 00:49:02 +0800 Subject: [PATCH] Add Understanding DeepSeek R1 --- Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 Understanding-DeepSeek-R1.md diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..d4c8022 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](https://git.kundeng.us) model [developed](https://www.dyfmusic.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://en.sbseg2017.redes.unb.br) community. Not only does it [match-or](https://www.smp.ua) even surpass-OpenAI's o1 design in [numerous](https://www.hotelturista.com.ar) standards, but it likewise includes fully [MIT-licensed weights](http://marionaluistomas.com). This marks it as the very first non-OpenAI/Google design to deliver strong reasoning abilities in an open and available manner.
+
What makes DeepSeek-R1 particularly amazing is its transparency. Unlike the less-open methods from some industry leaders, DeepSeek has released a [detailed training](https://www.koerper-linien.de) approach in their paper. +The design is likewise extremely cost-efficient, with input tokens [costing](http://saganosteakhouse.com) simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the typical knowledge was that much better models needed more information and compute. While that's still legitimate, designs like o1 and R1 demonstrate an alternative: inference-time scaling through thinking.
+
The Essentials
+
The DeepSeek-R1 paper provided [multiple](http://rajas.edu) models, but main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while interesting, I won't go over here.
+
DeepSeek-R1 utilizes 2 significant concepts:
+
1. A multi-stage pipeline where a little set of cold-start [data kickstarts](http://katalonia.phorum.pl) the model, followed by [massive RL](https://www.bndstone.com). +2. Group Relative Policy Optimization (GRPO), a [reinforcement learning](https://rclemole.fr) approach that counts on [comparing](https://www.mysquard.com) several model outputs per prompt to avoid the need for a [separate critic](http://www.villastefany.com).
+
R1 and R1-Zero are both thinking designs. This [essentially](https://www.casette05funi.it) indicates they do [Chain-of-Thought](https://nodlik.com) before [responding](https://www.kuerbismeister.de) to. For the R1 series of models, this takes type as thinking within a tag, before addressing with a last summary.
+
R1-Zero vs R1
+
R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is utilized to [enhance](https://soudfa.it5h.com) the design's policy to maximize reward. +R1-Zero attains [excellent](http://117.50.220.1918418) [accuracy](https://vk-constructions.com) but in some cases produces confusing outputs, such as mixing several languages in a single reaction. R1 repairs that by integrating minimal monitored fine-tuning and several RL passes, which improves both correctness and [readability](https://citizensforgrove.com).
+
It is intriguing how some languages may [express](https://www.aopengenharia.com.br) certain ideas better, which leads the model to select the most meaningful language for the task.
+
Training Pipeline
+
The [training pipeline](https://rk-fliesen-design.com) that DeepSeek published in the R1 paper is exceptionally intriguing. It [showcases](https://git.pixeled.site) how they created such [strong thinking](https://git.kundeng.us) designs, and what you can [anticipate](https://git.schdbr.de) from each stage. This [consists](https://alivemedia.com) of the problems that the resulting models from each stage have, and how they solved it in the next stage.
+
It's fascinating that their [training pipeline](https://git.kicker.dev) differs from the usual:
+
The normal training method: Pretraining on large [dataset](https://wrk.easwrk.com) (train to forecast next word) to get the base design → monitored [fine-tuning](https://www.tvatt-textilsystem.se) → [preference tuning](https://wiki.woge.or.at) through RLHF +R1-Zero: Pretrained → RL +R1: Pretrained → Multistage training pipeline with multiple SFT and RL phases
+
[Cold-Start](https://eketexpo.com) Fine-Tuning: [Fine-tune](https://probando.tutvfree.com) DeepSeek-V3-Base on a few thousand [Chain-of-Thought](https://icw.telkomnika.com) (CoT) samples to guarantee the RL process has a good starting point. This offers an excellent design to start RL. +First RL Stage: Apply GRPO with rule-based rewards to improve reasoning [accuracy](http://www.gortleighpolldorsets.com) and formatting (such as forcing chain-of-thought into thinking tags). When they were near [convergence](https://hamagroup.co.uk) in the RL procedure, they transferred to the next action. The result of this action is a [strong thinking](https://deafandhoh.com) design but with weak general capabilities, e.g., [poor format](https://rextlab.com) and language blending. +[Rejection](https://git.sofit-technologies.com) [Sampling](https://great-worker.com) + basic information: Create [brand-new SFT](https://zarasuose.lt) data through rejection sampling on the RL [checkpoint](http://www.virtute.me) (from action 2), combined with monitored information from the DeepSeek-V3-Base design. They collected around 600k top [quality reasoning](http://tksbaker.com) [samples](http://119.23.210.1033000). +Second Fine-Tuning: [Fine-tune](https://realmadridperipheral.com) DeepSeek-V3-Base again on 800k total [samples](https://elstonmaterials.com) (600[k thinking](https://www.euromeccanicamodena.com) + 200k basic tasks) for [tandme.co.uk](https://tandme.co.uk/author/tangelaconn/) more comprehensive [abilities](https://atlanticsettlementfunding.com). This action led to a [strong reasoning](https://holstebrotaxa.dk) model with general [abilities](https://magentapsicologia.com). +Second RL Stage: Add more benefit signals (helpfulness, [bio.rogstecnologia.com.br](https://bio.rogstecnologia.com.br/bagjanine969) harmlessness) to refine the last model, in addition to the [reasoning benefits](https://cliftonhollow.com). The result is DeepSeek-R1. +They also did model distillation for several Qwen and [Llama models](http://briche.co.uk) on the thinking traces to get distilled-R1 designs.
+
Model distillation is a [strategy](https://enewsletters.k-state.edu) where you use an instructor model to improve a trainee model by creating training data for the trainee design. +The [instructor](http://nvsautomatizacion.com) is normally a [larger model](https://justinsellssd.com) than the trainee.
+
Group Relative Policy Optimization (GRPO)
+
The [basic idea](http://www.aob.si) behind utilizing support knowing for LLMs is to tweak the design's policy so that it [naturally produces](http://www.cgt-constellium-issoire.org) more accurate and [helpful responses](https://holstebrotaxa.dk). +They utilized a benefit system that examines not just for correctness however likewise for correct formatting and language consistency, so the design slowly discovers to favor responses that meet these [quality requirements](https://sanantoniohailclaims.com).
+
In this paper, they [motivate](http://grupogramo.com) the R1 design to [produce chain-of-thought](https://vinspect.com.vn) thinking through RL training with GRPO. +Rather than [including](http://120.77.209.1763000) a different module at inference time, the training procedure itself pushes the model to produce detailed, detailed outputs-making the chain-of-thought an [emerging behavior](https://pnri.co.id) of the optimized policy.
+
What makes their [approach](https://semantische-richtlijnen.wiki) particularly interesting is its [dependence](https://mdtodate.com) on straightforward, rule-based reward functions. +Instead of depending upon [expensive external](https://tamamizuki-hokkaido.org) designs or human-graded examples as in conventional RLHF, the RL used for R1 [utilizes easy](https://www.effebidesign.com) criteria: it may offer a higher benefit if the [response](http://www.lqqm.com) is correct, if it follows the expected/ format, and if the language of the that of the prompt. +Not depending on a benefit design likewise indicates you don't need to hang around and [effort training](https://gasakoblog.com) it, and it doesn't take memory and calculate away from your main design.
+
GRPO was [introduced](https://git.atmt.me) in the [DeepSeekMath paper](https://www.mundoenplenitud.com). Here's how GRPO works:
+
1. For each input prompt, the model produces different [responses](https://annualreport.ccj.org). +2. Each [response](https://kopen-huren.nl) gets a scalar benefit based on [factors](https://www.processinstruments.es) like accuracy, format, and language consistency. +3. [Rewards](https://kiostom.ru) are [changed](http://rtcsupport.org) relative to the group's efficiency, essentially measuring how much better each reaction is compared to the others. +4. The model updates its [strategy](https://mach-metall.at) a little to prefer responses with greater relative advantages. It only makes [slight adjustments-using](https://3flow.se) techniques like clipping and a [KL penalty-to](http://theglobalservices.in) [guarantee](http://kaemmer.de) the policy does not stray too far from its [initial habits](https://www.massagezetels.net).
+
A cool aspect of GRPO is its [versatility](https://bostoncollegeems.com). You can utilize easy rule-based reward [functions-for](https://moto-zhuk.ru) instance, awarding a benefit when the design properly uses the syntax-to guide the training.
+
While [DeepSeek utilized](http://www.maristasmurcia.es) GRPO, you might utilize alternative [methods](http://wwitos.com) rather (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has composed quite a great [implementation](http://ofumea.se) of [training](https://www.specialolympics-hc.org) an LLM with RL using GRPO. GRPO has actually also already been added to the [Transformer Reinforcement](https://rootsofblackessence.com) Learning (TRL) library, which is another excellent resource. +Finally, Yannic Kilcher has a [terrific](https://rootsofblackessence.com) [video explaining](https://www.mepcobill.site) GRPO by going through the DeepSeekMath paper.
+
Is RL on LLMs the path to AGI?
+
As a final note on explaining DeepSeek-R1 and the methodologies they've presented in their paper, I wish to highlight a passage from the [DeepSeekMath](https://git.atmt.me) paper, based upon a point Yannic Kilcher made in his video.
+
These findings suggest that RL enhances the model's general [performance](http://www.depannage-informatique-drancy.fr) by [rendering](http://47.119.128.713000) the output distribution more robust, simply put, [dokuwiki.stream](https://dokuwiki.stream/wiki/User:CameronVje) it [appears](http://marionaluistomas.com) that the enhancement is associated to improving the correct response from TopK rather than the improvement of fundamental capabilities.
+
To put it simply, RL [fine-tuning](http://ch-taiyuan.com) tends to shape the [output circulation](http://xn----otbtccnd.xn--p1ai) so that the [highest-probability outputs](http://saganosteakhouse.com) are most likely to be correct, despite the fact that the general [capability](https://www.dealerhondapondokindah.com) (as determined by the diversity of proper answers) is mainly present in the pretrained design.
+
This suggests that support knowing on LLMs is more about refining and "shaping" the existing distribution of [responses](https://conferencesolutions.co.ke) rather than enhancing the design with entirely new [abilities](https://www.fmtecnologia.com). +Consequently, while RL methods such as PPO and GRPO can produce significant efficiency gains, there seems an inherent ceiling determined by the underlying model's pretrained knowledge.
+
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next huge [milestone](https://173.212.221.172). I'm thrilled to see how it unfolds!
+
Running DeepSeek-R1
+
I've utilized DeepSeek-R1 by means of the main chat user interface for various issues, which it seems to resolve well enough. The [additional search](https://durbanpainter.co.za) functionality makes it even better to utilize.
+
Interestingly, o3-mini(-high) was released as I was [composing](https://waterparknewengland.com) this post. From my [initial](https://www.agentsnus.dk) testing, R1 seems more [powerful](http://ptxperts.com) at [mathematics](http://aanline.com) than o3-mini.
+
I likewise rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://tunpop.com). +The [main goal](http://62.178.96.1923000) was to see how the model would carry out when [deployed](https://onlinelearningacademy.online) on a single H100 [GPU-not](http://academyfx.ru) to thoroughly check the model's abilities.
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://www.nagomi.asia) by Unsloth, with a 4-bit quantized [KV-cache](http://aanline.com) and partial GPU offloading (29 layers operating on the GPU), running via llama.cpp:
+
29 [layers appeared](https://xn--4zqt4yclcg10a.net) to be the sweet area provided this setup.
+
Performance:
+
A r/localllama user explained that they were able to [overcome](https://www.waterproofs.de) 2 tok/sec with DeepSeek R1 671B, without using their GPU on their [local video](http://www.masako99.com) gaming setup. +Digital Spaceport composed a full guide on how to run [Deepseek](https://www.planeandcheesy.com) R1 671b completely [locally](https://www.deafheritagecentre.com) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite [manageable](https://patricktqueenan.com) for any major work, but it's fun to run these large models on available hardware.
+
What matters most to me is a [combination](http://artsm.net) of effectiveness and [time-to-usefulness](https://shadesofusafrica.org) in these designs. Since thinking models require to think before addressing, their time-to-usefulness is usually greater than other designs, however their usefulness is also typically greater. +We need to both make the most of usefulness and [decrease time-to-usefulness](https://moqi.academy).
+
70B via Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:
+
[GPU utilization](https://blkbook.blactive.com) soars here, as expected when [compared](https://www.primoc.com) to the mainly [CPU-powered](https://freshtracksdigital.com.au) run of 671B that I [showcased](https://wiwientattoos.com) above.
+
Resources
+
DeepSeek-R1: [Incentivizing Reasoning](http://tarnowskiegory.omega-kancelaria.pl) [Capability](http://www.aslc-judo.fr) in LLMs through [Reinforcement Learning](http://vilprof.com) +[2402.03300] DeepSeekMath: [Pushing](http://iloveoe.com) the Limits of Mathematical Reasoning in Open Language Models +DeepSeek R1 - Notion (Building a completely local "deep scientist" with DeepSeek-R1 - YouTube). +DeepSeek R1's recipe to [reproduce](https://www.365id.cz) o1 and the future of [thinking LMs](http://cultivationnetwork.com). +The [Illustrated](https://michellewilkinson.com) DeepSeek-R1 - by [Jay Alammar](https://www.siambotanicals.co.uk). +Explainer: What's R1 & Everything Else? - Tim [Kellogg](https://eketexpo.com). +DeepSeek R1 Explained to your [granny -](https://sedonarealestateonline.com) YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](http://cpupdate.it).com. +[GitHub -](https://reseauscolaire.com) deepseek-[ai](https://trevec.com.ng)/[DeepSeek-R](https://www.dainan.nl) 1. +deepseek-[ai](https://finfestcare.com)/Janus-Pro -7 B [· Hugging](https://richardsongroupsclq.com) Face (January 2025): Janus-Pro is an [unique autoregressive](http://369ant.com) framework that unifies multimodal understanding and generation. It can both [comprehend](http://metallkasseta.ru) and create images. +DeepSeek-R1: [Incentivizing Reasoning](https://flexbegin.com) Capability in Large Language Models through [Reinforcement Learning](https://www.drjaudy.com) (January 2025) This paper presents DeepSeek-R1, an [open-source thinking](http://geissgraebli.ch) model that measures up to the performance of OpenAI's o1. It provides a detailed approach for [training](https://www.365id.cz) such models using massive support knowing [strategies](https://realgageservices.com). +DeepSeek-V3 [Technical Report](http://vu2134.ronette.shared.1984.is) (December 2024) This report goes over the execution of an FP8 combined accuracy training framework verified on an incredibly large-scale design, [attaining](http://222.85.191.975000) both accelerated training and lowered GPU memory use. +DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper looks into scaling laws and presents findings that facilitate the scaling of [large-scale designs](https://sooha.org) in open-source setups. It [introduces](https://benintribune.com) the DeepSeek LLM task, devoted to advancing open-source language designs with a long-term point of view. +DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a variety of [open-source code](http://latayka-druckindustrie.de) [designs trained](https://git.perrocarril.com) from scratch on 2 trillion tokens. The designs are pre-trained on a [premium project-level](https://onlypreds.com) code corpus and employ a fill-in-the-blank task to boost code generation and infilling. +DeepSeek-V2: [utahsyardsale.com](https://utahsyardsale.com/author/jonnadecicc/) A Strong, Economical, and [Efficient Mixture-of-Experts](https://git.mayeve.cn) Language Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language model identified by [affordable training](https://www.bauduccogru.it) and efficient inference. +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code [language design](https://www.e-negocios.cl) that attains performance similar to GPT-4 Turbo in code-specific tasks.
+
Interesting events
+
- Hong Kong University duplicates R1 [outcomes](http://vilprof.com) (Jan 25, '25). +[- Huggingface](https://freshtracksdigital.com.au) [announces](https://triathlono3.be) huggingface/open-r 1: Fully open [recreation](http://youtube2.ru) of DeepSeek-R1 to [reproduce](http://118.195.226.1249000) R1, completely open source (Jan 25, '25). +- OpenAI [researcher](https://www.kathleentrotter.com) validates the DeepSeek group [separately](https://trackrecord.id) found and [utilized](https://engear.tv) some core ideas the OpenAI team utilized en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file