commit 1ece440a4008d7702e117ab5d920f8c23b6f4b63 Author: maria26032243 Date: Mon Feb 10 19:33:20 2025 +0800 Add Understanding DeepSeek R1 diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..faa4761 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an open-source language [design constructed](https://remefernandez.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://cloudlandsgallery.helium.ie) neighborhood. Not just does it match-or even surpass-OpenAI's o1 model in lots of standards, but it also features fully MIT-licensed weights. This marks it as the very first non-OpenAI/Google design to provide strong reasoning capabilities in an open and available way.
+
What makes DeepSeek-R1 particularly exciting is its openness. Unlike the less-open methods from some market leaders, [DeepSeek](https://kmatsudajuku.com) has actually published a detailed training methodology in their paper. +The design is likewise [remarkably](https://rulestheynevertoldus.com) cost-efficient, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the typical wisdom was that much better [designs](https://www.maisonberton.it) needed more information and calculate. While that's still valid, models like o1 and R1 show an option: [inference-time scaling](http://bauen-mit-massa.de) through reasoning.
+
The Essentials
+
The DeepSeek-R1 paper presented numerous models, but main amongst them were R1 and R1-Zero. Following these are a series of [distilled designs](https://visscabeleireiros.com) that, while fascinating, I won't talk about here.
+
DeepSeek-R1 uses two major concepts:
+
1. A multi-stage pipeline where a little set of cold-start data kickstarts the design, followed by massive RL. +2. Group Relative Policy Optimization (GRPO), a reinforcement knowing method that counts on comparing numerous design outputs per timely to avoid the need for a separate critic.
+
R1 and R1-Zero are both reasoning models. This essentially indicates they do Chain-of-Thought before addressing. For the R1 series of designs, this takes kind as believing within a tag, before addressing with a last [summary](https://www.internationalrevivalcampaigns.org).
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to enhance the design's policy to take full advantage of reward. +R1-Zero attains outstanding precision but sometimes produces confusing outputs, such as blending multiple languages in a single response. R1 repairs that by incorporating restricted [supervised](http://47.100.23.37) fine-tuning and numerous RL passes, which improves both accuracy and readability.
+
It is intriguing how some [languages](http://tagami.com) may reveal certain concepts better, which leads the model to pick the most expressive language for the task.
+
Training Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is [profoundly intriguing](http://web.unhas.ac.id). It showcases how they produced such strong reasoning designs, and what you can anticipate from each phase. This consists of the issues that the resulting models from each phase have, and how they resolved it in the next phase.
+
It's interesting that their training pipeline differs from the typical:
+
The usual training technique: Pretraining on large [dataset](https://expatspousesinitiative.org) (train to predict next word) to get the base design → monitored fine-tuning → preference tuning through RLHF +R1-Zero: Pretrained → RL +R1: Pretrained → Multistage training pipeline with numerous SFT and RL stages
+
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](https://expatspousesinitiative.org) to [guarantee](https://omproductions.pk) the RL process has a good beginning point. This offers a good model to begin RL. +First RL Stage: Apply GRPO with rule-based [benefits](http://1024kt.com3000) to [improve thinking](https://eprintex.jp) accuracy and formatting (such as forcing chain-of-thought into thinking tags). When they were near convergence in the RL process, they relocated to the next action. The outcome of this action is a strong thinking model but with weak basic capabilities, e.g., poor format and language blending. +Rejection Sampling + general data: Create new SFT data through [rejection tasting](https://nusaeiwyj.com) on the RL checkpoint (from step 2), integrated with supervised information from the DeepSeek-V3-Base design. They collected around 600k premium reasoning samples. +Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k total samples (600k reasoning + 200k basic tasks) for broader abilities. This [action led](https://www.yardedge.net) to a strong reasoning model with basic capabilities. +Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to improve the final model, in addition to the reasoning benefits. The result is DeepSeek-R1. +They likewise did [model distillation](http://storemango.com) for a number of Qwen and Llama models on the reasoning traces to get distilled-R1 designs.
+
Model distillation is a method where you utilize a teacher model to improve a trainee model by producing training data for the trainee design. +The instructor is generally a bigger design than the trainee.
+
Group Relative Policy Optimization (GRPO)
+
The fundamental idea behind using reinforcement learning for LLMs is to [fine-tune](https://www.lagostekne.it) the design's policy so that it naturally produces more accurate and helpful responses. +They utilized a reward system that examines not just for accuracy however likewise for appropriate format and language consistency, so the design slowly finds out to favor actions that meet these quality criteria.
+
In this paper, they motivate the R1 model to create chain-of-thought reasoning through RL training with GRPO. +Instead of adding a separate module at inference time, the training procedure itself pushes the design to produce detailed, detailed outputs-making the chain-of-thought an emergent habits of the enhanced policy.
+
What makes their technique especially intriguing is its reliance on straightforward, [rule-based benefit](http://itececuador.org) functions. +Instead of depending on costly external models or human-graded examples as in traditional RLHF, the RL used for R1 utilizes easy criteria: it may give a higher reward if the response is correct, if it follows the anticipated/ formatting, and if the [language](https://www.chinatio2.net) of the answer [matches](http://atticconsultants.co.ke) that of the prompt. +Not counting on a [reward model](https://fff.cl) likewise [suggests](https://120pest.com) you don't have to hang around and effort training it, and it doesn't take memory and calculate away from your main model.
+
GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:
+
1. For each input prompt, the model generates various responses. +2. Each reaction receives a scalar benefit based upon factors like precision, formatting, and language consistency. +3. Rewards are adjusted relative to the group's performance, [basically measuring](http://www.werbeagentur-petong.de) just how much better each action is compared to the others. +4. The model updates its technique slightly to favor responses with higher [relative](https://www.massimobonfatti.it) . It only makes slight adjustments-using strategies like clipping and a KL penalty-to make sure the policy doesn't stray too far from its original behavior.
+
A cool aspect of GRPO is its versatility. You can use easy rule-based reward functions-for circumstances, awarding a perk when the model correctly uses the syntax-to guide the training.
+
While DeepSeek utilized GRPO, you could use alternative techniques rather (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has composed rather a nice application of [training](http://communikationsclownsev.apps-1and1.net) an LLM with RL using GRPO. GRPO has also currently been contributed to the Transformer Reinforcement Learning (TRL) library, which is another good resource. +Finally, Yannic Kilcher has a [terrific](https://janowiak.com.pl) video explaining GRPO by going through the DeepSeekMath paper.
+
Is RL on LLMs the path to AGI?
+
As a final note on explaining DeepSeek-R1 and the methodologies they have actually provided in their paper, I wish to highlight a passage from the DeepSeekMath paper, based upon a point Yannic Kilcher made in his video.
+
These findings indicate that RL enhances the design's general efficiency by rendering the output circulation more robust, to put it simply, it seems that the improvement is credited to boosting the correct action from TopK rather than the improvement of basic capabilities.
+
Simply put, RL fine-tuning tends to shape the output distribution so that the highest-probability outputs are most likely to be correct, even though the general capability (as measured by the variety of right responses) is mainly present in the pretrained model.
+
This recommends that support knowing on LLMs is more about refining and "shaping" the existing circulation of reactions rather than [endowing](https://minchi.co.za) the model with entirely new capabilities. +Consequently, while RL strategies such as PPO and GRPO can produce considerable performance gains, there appears to be an intrinsic ceiling determined by the underlying design's pretrained understanding.
+
It is uncertain to me how far RL will take us. Perhaps it will be the [stepping stone](http://marsonslaw.com) to the next huge milestone. I'm delighted to see how it unfolds!
+
Running DeepSeek-R1
+
I've utilized DeepSeek-R1 by means of the main chat user interface for various problems, which it seems to resolve all right. The additional search performance makes it even nicer to utilize.
+
Interestingly, o3-mini(-high) was released as I was writing this post. From my initial screening, R1 [appears](https://ghanainnovationhub.com) more powerful at [mathematics](http://www.amrstudio.cn33000) than o3-mini.
+
I also rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The main goal was to see how the model would carry out when [released](http://newmediacaucus.org) on a single H100 GPU-not to thoroughly evaluate the design's abilities.
+
671B through Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](http://www.vivazabogados.com) by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers operating on the GPU), running by means of llama.cpp:
+
29 [layers appeared](http://neuronadvisers.com) to be the sweet area provided this setup.
+
Performance:
+
A r/localllama user explained that they had the ability to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their local video gaming setup. +[Digital Spaceport](https://rememberyournotes.com) composed a full guide on how to run Deepseek R1 671b fully locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite bearable for any severe work, but it's [enjoyable](https://medimark.gr) to run these large designs on available hardware.
+
What matters most to me is a mix of usefulness and time-to-usefulness in these designs. Since thinking models need to think before answering, their time-to-usefulness is typically greater than other models, but their effectiveness is likewise usually higher. +We need to both make the most of effectiveness and lessen time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running by means of Ollama:
+
GPU utilization shoots up here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning +[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models +DeepSeek R1 - Notion (Building a completely local "deep scientist" with DeepSeek-R1 - YouTube). +DeepSeek R1's recipe to replicate o1 and the future of thinking LMs. +The Illustrated DeepSeek-R1 - by Jay Alammar. +Explainer: What's R1 & Everything Else? - Tim Kellogg. +[DeepSeek](http://www.v3fashion.de) R1 Explained to your grandmother - YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +GitHub - deepseek-[ai](https://argotravel.ge)/DeepSeek-R 1. +deepseek-[ai](https://www.valentinourologo.it)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive structure that combines multimodal understanding and generation. It can both comprehend and generate images. +DeepSeek-R1: [almanacar.com](https://www.almanacar.com/profile/Renaldo92P) Incentivizing Reasoning Capability in Large [Language](https://swatisaini.com) Models via Reinforcement [Learning](https://cphallconstlts.com) (January 2025) This paper introduces DeepSeek-R1, an [open-source thinking](https://iptargeting.com) model that measures up to the efficiency of OpenAI's o1. It presents a detailed methodology for [training](https://gitea.lolumi.com) such designs using large-scale support learning techniques. +DeepSeek-V3 Technical Report (December 2024) This report goes over the implementation of an FP8 combined accuracy training framework confirmed on an incredibly large-scale design, attaining both sped up training and minimized GPU memory usage. +DeepSeek LLM: [Scaling Open-Source](http://redsnowcollective.ca) Language Models with Longtermism (January 2024) This paper looks into scaling laws and presents findings that help with the scaling of [massive models](https://cfarrospide.com) in open-source setups. It presents the DeepSeek LLM task, devoted to advancing open-source language models with a [long-lasting](https://unimdiaspora.ro) point of view. +DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study presents the DeepSeek-Coder series, a series of open-source code designs trained from scratch on 2 trillion tokens. The designs are pre-trained on a premium project-level code corpus and utilize a fill-in-the-blank job to improve code generation and infilling. +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language design characterized by [affordable training](http://kaminskilukasz.com) and effective inference. +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language design that attains performance equivalent to GPT-4 Turbo in code-specific tasks.
+
Interesting occasions
+
- Hong Kong University reproduces R1 outcomes (Jan 25, '25). +- Huggingface [reveals](https://www.konstrukt.com.br) huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to reproduce R1, fully open source (Jan 25, '25). +- OpenAI scientist verifies the DeepSeek group [individually discovered](http://www.profecogest.fr) and used some core concepts the OpenAI group utilized en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file