Basic knowledge
Differences among Deepseek Models
Source: [Differences between Deepseek Models _ Baidu Search]( https://www.baidu.com/s?ie=utf -8&f=3&rsv-bp=1&rsv-idx=1&tn=Baidu&wd=Deepseek Differences between Models&fenlei=256&oq=GFW&rsv-pq=fb577c3801857560&rsv-t=6c76xUXPB1nfIkfMnWMBY2bsJ6csooA6KbOVMeLDy% 2BpwQjSPoYQRMddIs&rqlang=cn&rsv-unter=1&rsv-dl=ts-0&rsv-btype=t&inputT=15348&rsv-shug3=25&rsv-shug1=19&rsv-shug7=100&rsv-sug2=1&prefixes=deepseek% E5% 90% 84% E6% A8% A1% E5% 9E% 8B% E7% 9A% 84&rsp=0&rsv-sug4=15348)

**[DeepSeek]( https://www.baidu.com/s?rsv_idx=1&wd=DeepSeek&fenlei=256&usm=4&ie=utf -8&rsv-pq=a5fc143b01953e90&oq=deepseek Differences between models&rsv-t=f733V1O15YChhddJm% 2Fg6oMdiPjsX7SoHyEuf3xXVNdkEkH4Uwd6pm4XWg&rsv-dl=re-dqa_generatione&sa=re-dqa_generation) The series of models mainly include [DeepSeek-R1]( https://www.baidu.com/s?rsv_idx=1&wd=DeepSeek -R1&fenlei=256&usm=4&ie=utf-8&rsv-pq=a5fc143b01953e90&oq=deepseek Differences between Models&rsv-t=e582% 2B0R7ody% 2F9z6tz38wrJNuiILi% 2F9JQiZnsMGvs83Xo8lsAU2MKCg% 2BGd0&rsv-dl=re-dqa_generation&sa=re-dqa_generation) [DeepSeek-V3]( https://www.baidu.com/s?rsv_idx=1&wd=DeepSeek -V3&fenlei=256&usm=4&ie=utf-8&rsv-pq=a5fc143b0195e90&oq=deepseek Differences between Models&rsv-t=a2ecZfyRnGkQ7w1% 2F96rpqhwf3QBtQnOXSp09TIFhbNDROzprrCX5auliuBM&rsv-dl=re-dqa_generation&sa=re-dqa_generation) [DeepSeek-VL]( https://www.baidu.com/s?rsv_idx=1&wd=DeepSeek -VL&fenlei=256&usm=4&ie=utf-8&rsv-pq=a5fc143b0195e90&oq=deepseek Differences between Models&rsv-t=00bfW3gvVLD6nnLJSCeyU6XKcQ8r4LXHY% 2BvPA3t% 2FD0SUO38xFtaH8MR3fLA&rsv-dl=re-dqa_generatione&sa=re-dqa_generation) [DeepSeek-V2]( https://www.baidu.com/s?rsv_idx=1&wd=DeepSeek -V2&fenlei=256&usm=4&ie=utf-8&rsv-pq=a5fc143b01953e90&oq=deepseek Differences between Models&rsv-t=c202wVfrq3GD% 2FyODnhqbfF% 2FhE5IrH0kPI6vBdnwqPfMdQVpR2imSFUMdIsQ&rsv-dl=re-d q_generation&sa=re-d q_generation) and [DeepSeek-R1-0]( https://www.baidu.com/s?rsv_idx=1&wd=DeepSeek -R1 Zero&fenlei=256&usm=4&ie=utf-8&rsv-pq=a5fc143b0195e90&oq=deepseek. The differences between various models are significant in terms of architecture, training methods, parameter scale, and application scenarios, such as rsv-t=7e55N5kLO% 2FrbYZRDOtqkv41ryl% 2BH7J0iL7WY8qvb8Gsa2192c8CYcIuMYxF4&rsv-dl=re-dqa_generatione&sa=re-dqa_generatione**
###Architectural differences
-DeepSeek-R1: Based on the Transformer architecture, it may have been optimized for inference and trained using reinforcement learning.
-DeepSeek-V3 adopts a hybrid expert (MoE) architecture, utilizing multi head latent attention (MLA) and DeepSeekMoE architecture, with a parameter scale of 685 billion.
-DeepSeek VL: Based on decoder only LLaVA style architecture, it includes three core modules: visual encoder, visual language adapter, and expert mixed language model.
-DeepSeeker V2 adopts Transformer architecture, introduces MLA architecture and self-developed Sparse structure.
-DeepSeeker R1 Zero: Similar to DeepSeeker R1, but may have been optimized for training on unlabeled data. 1
###Training methods
-DeepSeek-R1: Using post training large-scale reinforcement learning techniques, combined with two core models, DeepSeek IE Zero and DeepSeek IE.
-DeepSeek-V3: Using traditional deep learning training methods, relying on a large amount of data to enhance general abilities.
-DeepSeek VL includes three stages: visual language alignment, visual language pre training, and supervised fine-tuning.
-DeepSeek-V2: trained on the efficient and lightweight framework HAI-LLM, using 16 way zero consumable pipeline parallelism and ZeRO-1 data parallelism.
-DeepSeeker R1 Zero: Almost does not rely on human data, relying entirely on machine generated data for reinforcement learning training.
###Parameter scale and application scenarios
-DeepSeek-R1: Parameter 660B, suitable for mathematical, coding, and complex logical reasoning tasks.
-DeepSeek-V3: With parameters of 671 billion, it is suitable for multimodal scenarios such as chatting, encoding, multilingual translation, image generation, and AI painting.
-DeepSeek VL: Suitable for VQA OCR、 Multi modal tasks such as document/table/chart understanding and visual localization.
-DeepSeeker V2: With a parameter of 236 billion, it excels in Chinese language synthesis and is suitable for various tasks in natural language processing.
-DeepSeeker R1 Zero: with a parameter of 660B, it is suitable for complex inference tasks and performs better in scenarios without manually annotated data.