LLMs之Baichuan 2:《Baichuan 2: Open Large-scale Language Models》翻譯與解讀
導讀:2023年9月6日,百川智能重磅發布Baichuan 2。科技論文主要介紹了Baichuan 2,一個開源的大規模語言模型,以及其在多個領域的性能表現和安全性措施。論文強調了對開源模型的重要性,特別是對于非英語語言的研究者和應用程序開發者。此外,論文還詳細討論了模型的預訓練、微調和安全性方面的技術細節。通過開源和透明度,Baichuan2為研究社區提供了機會來深入研究和改進大型語言模型的安全性和性能。
Baichuan 2= 擁有7B/13B+2.6T的tokens+1024個A800
Baichuan 2包含兩個模型—Baichuan 2-7B和Baichuan 2-13B。兩個模型都在2.6T的tokens上進行預訓練,這是迄今為止規模最大的。Baichuan 2在通用基準測試上表現優于Baichuan 1和其他開源模型。
預訓練:數據來源廣泛,處理2.6T的tokens。模型架構基于Transformer,對位置編碼和訓練優化進行了改進。
對齊:采用監督預訓練和強化學習從人類反饋中獲得的方法對模型進行對齊,獲得Baichuan 2-7B-Chat和Baichuan 2-13B-Chat兩個會話模型。
安全性:從預訓練到對齊各個階段都采取措施提升模型安全性,比對與LLaMA 2表現出一定優勢。
評估:Baichuan 2在通用測評、專業領域、數學編程、多語言等多個維度展現出優異表現。同時提供中間檢查點供研究訓練動力學。
限制與倡導:論述了Baichuan 2在安全性、偏差、知識滯后等方面存在的困難,并強調合理和負責任的使用。
總體來看,該技術報告系統介紹了Baichuan 2的訓練方法及各項性能,在開源與透明度上做出了貢獻。
相關文章
LLMs之Baichuan:Baichuan-13B模型的簡介(包括Baichuan-7B)、安裝、使用方法之詳細攻略
LLMs之Baichuan:Baichuan-13B模型的簡介(包括Baichuan-7B)、安裝、使用方法之詳細攻略_一個處女座的程序猿的博客-CSDN博客
LLMs之Baichuan 2:Baichuan 2的簡介、安裝、使用方法之詳細攻略
LLMs之Baichuan 2:Baichuan 2的簡介、安裝、使用方法之詳細攻略_一個處女座的程序猿的博客-CSDN博客
LLMs之Baichuan 2:《Baichuan 2: Open Large-scale Language Models》翻譯與解讀
LLMs之Baichuan 2:《Baichuan 2: Open Large-scale Language Models》翻譯與解讀_一個處女座的程序猿的博客-CSDN博客
《Baichuan 2: Open Large-scale Language Models》翻譯與解讀
地址
技術報告:https://cdn.baichuan-ai.com/paper/Baichuan2-technical-report.pdf
GitHub官網:GitHub - baichuan-inc/Baichuan2: A series of large language models developed by Baichuan Intelligent Technology
時間
2023年9月6日
作者
百川智能
Abstract摘要
指令微調可大幅度降低FE的需求,強模型的閉源性+非英語能力有限性→提出開源Baichuan 2=擁有7B/13B參數+基于2.6T的tokens+公共基準優秀+垂直領域出色
Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
大型語言模型(LLMs)已經在各種自然語言任務上展現出卓越的性能,僅憑自然語言指令的少數示例,減少了對廣泛特征工程的需求。然而,大多數強大的LLMs都是閉源或在英語以外的其他語言上的能力有限。在這份技術報告中,我們介紹了Baichuan 2,一個包含70億和130億參數的大規模多語言語言模型系列,從零開始訓練,使用了2.6萬億個token。Baichuan 2在公共基準測試如MMLU、CMMLU、GSM8K和HumanEval上與其他開源模型相當或表現更好。此外,Baichuan 2在醫學和法律等垂直領域也表現出色。我們將發布所有的預訓練模型檢查點,以有助于研究社區更好地理解Baichuan 2的訓練動態。
1?Introduction引言
語言模型規模的發展趨勢【理解+生成】:參數(數百萬【ELMo/GPT-1】→數十~萬億【GPT-3/PaLM/Switch Transformers】)、LM性能的提升(像人類流利+具備多種NLP任務)、ChatGPT帶來突破性進展
The field of large language models has witnessed promising and remarkable progress in recent years. The size of language models has grown from millions of parameters, such as ELMo (Peters et al., 1802), GPT-1 (Radford et al., 2018), to billions or even trillions of parameters such as GPT- 3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022; Anil et al., 2023) and Switch Transformers (Fedus et al., 2022). This increase in scale has led to significant improvements in the capabilities of language models, enabling more human-like fluency and the ability to perform a diverse range of natural language tasks. With the introduction of ChatGPT (OpenAI, 2022) from OpenAI, the power of these models to generate human-like text has captured widespread public attention. ChatGPT demonstrates strong language proficiency across a variety of domains, from conversing casually to explaining complex concepts. This breakthrough highlights the potential for large language models to automate tasks involving natural language generation and comprehension.
大型語言模型領域近年來取得了令人興奮和顯著的進展。語言模型的規模從數百萬參數,如ELMo(Peters等,1802年)、GPT-1(Radford等,2018年),擴展到了數十億甚至數萬億參數,如GPT-3(Brown等,2020年)、PaLM(Chowdhery等,2022年;Anil等,2023年)和Switch Transformers(Fedus等,2022年)。這種規模的增加帶來了語言模型性能的顯著提升,使其能夠更像人類一樣流利,并具備執行多種自然語言任務的能力。隨著OpenAI推出的ChatGPT(OpenAI,2022年),這些模型生成類似人類的文本的能力引起了廣泛的公眾關注。ChatGPT在各種領域表現出了強大的語言能力,從隨意交談到解釋復雜概念都游刃有余。這一突破突顯出大型語言模型自動化處理涉及自然語言生成和理解的任務的潛力。
大多數領先LLMs均都閉源(GPT-4/Claude)→研究人員難以深入研究,而LLaMA等開源模型促進開源的發展→從而加速了該base模型發展(如Alpaca/Vicuna)
While there have been exciting breakthroughs and applications of LLMs, most leading LLMs like GPT-4 (OpenAI, 2023), PaLM-2 (Anil et al., 2023), and Claude (Claude, 2023) remain closed-sourced. Developers and researchers have limited access to the full model parameters, making it difficult for the community to deeply study or fine-tune these systems. More openness and transparency around LLMs could accelerate research and responsible development within this rapidly advancing field. LLaMA (Touvron et al., 2023a), a series of large language models developed by Meta containing up to 65 billion parameters, has significantly benefited the LLM research community by being fully open- sourced. The open nature of LLaMA, along with other open-source LLMs such as OPT (Zhang et al., 2022), Bloom (Scao et al., 2022), MPT (MosaicML, 2023) and Falcon (Penedo et al., 2023), enables researchers to freely access the models for examination, experimentation, and further development. This transparency and access distinguishes LLaMA from other proprietary LLMs. By providing full access, the open-source LLMs have accelerated research and advances in the field, leading to new models like Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), and others (Wang et al., 2022; Zhu et al., 2023; Anand et al., 2023).
盡管LLMs取得了令人興奮的突破和應用,但大多數領先的LLMs,如GPT-4(OpenAI,2023年)、PaLM-2(Anil等,2023年)和Claude(Claude,2023年)仍是閉源的。開發人員和研究人員無法完全訪問完整的模型參數,使得研究社區難以深入研究或微調這些系統。在這個快速發展的領域,更多關于LLMs的開放性和透明性可能會加速研究和負責任的發展。LLaMA (Touvron et al., 2023a)是Meta開發的一系列包含多達650億個參數的大型語言模型,其完全開源為LLM研究社區帶來了顯著的好處。LLaMA的開放性質,以及其他開源LLMs如OPT(Zhang等,2022年)、Bloom(Scao等,2022年)、MPT(MosaicML,2023年)和Falcon(Penedo等,2023年)使研究人員可以自由訪問這些模型進行檢查、實驗和進一步的開發。這種透明度和訪問性將LLaMA與其他專有LLMs區分開來。通過提供完全訪問,開源LLMs已經加速了該領域的研究和進展,導致了新模型如Alpaca(Taori等,2023年)、Vicuna(Chiang等,2023年)以及其他模型的出現(Wang等,2022年;Zhu等,2023年;Anand等,2023年)。
當前困境:大多開源LLMs的語料主要是英文,導致對中文不友好
However, most open-source large language models have focused primarily on English. For instance, the main data source for LLaMA is Common Crawl1, which comprises 67% of LLaMA’s pre-training data but is filtered to English content only. Other open source LLMs such as MPT (MosaicML, 2023) and Falcon (Penedo et al.,2023) are also focused on English and have limited capabilities in other languages. This hinders the development and application of LLMs in specific languages, such as Chinese.
然而,大多數開源的大型語言模型主要集中在英語上。例如,LLaMA的主要數據來源是Common Crawl1,占據了LLaMA預訓練數據的67%,但只過濾到英語內容。其他開源LLMs,如MPT(MosaicML,2023年)和Falcon(Penedo等人,2023年),也專注于英語,對其他語言的支持能力有限。這制約了在特定語言(如中文)中開發和應用LLMs的發展。
本文提出Baichuan 2=有兩個模型(7B/13B參數)+基于2.6T的token數據+通用基準測試提高了30%(其中數學和代碼領域翻倍)+專業領域也出色(醫學和法律)
In this technical report, we introduce Baichuan 2, a series of large-scale multilingual language models. Baichuan 2 has two separate models, Baichuan 2-7B with 7 billion parameters and Baichuan 2-13B with 13 billion parameters. Both models were trained on 2.6 trillion tokens, which to our knowledge is the largest to date, more than double that of Baichuan 1 (Baichuan, 2023b,a). With such a massive amount of training data, Baichuan 2 achieves significant improvements over Baichuan 1. On general benchmarks like MMLU (Hendrycks et al., 2021a), CMMLU (Li et al., 2023), and C-Eval (Huang et al., 2023), Baichuan 2-7B achieves nearly 30% higher performance compared to Baichuan 1-7B. Specifically, Baichuan 2 is optimized to improve performance on math and code problems. On the GSM8K (Cobbe et al., 2021) and HumanEval (Chen et al., 2021) evaluations, Baichuan 2 nearly doubles the results of the Baichuan 1. In addition, Baichuan 2 also demonstrates strong performance on medical and legal domain tasks. On benchmarks such as MedQA (Jin et al., 2021) and JEC-QA (Zhong et al., 2020), Baichuan 2 outperforms other open- source models, making it a suitable foundation model for domain-specific optimization.
在本技術報告中,我們介紹了Baichuan 2,這是一系列大規模多語言語言模型。Baichuan 2有兩個獨立的模型,Baichuan 2-7B具有70億參數,Baichuan 2-13B具有130億參數。這兩個模型都經過了2.6萬億令牌的訓練,據我們所知,這是迄今為止最大規模的訓練數據,是Baichuan 1(Baichuan,2023b,a)的兩倍多。由于這么大規模的訓練數據,Baichuan 2在Baichuan 1的基礎上取得了顯著的改進。在像MMLU(Hendrycks等人,2021a)、CMMLU(Li等人,2023年)和C-Eval(Huang等人,2023年)這樣的通用基準測試中,Baichuan 2-7B的性能幾乎比Baichuan 1-7B提高了30%。具體而言,Baichuan 2經過優化,以提高數學和代碼問題的性能。在GSM8K(Cobbe等人,2021年)和HumanEval(Chen等人,2021年)評估中,Baichuan 2幾乎使Baichuan 1的結果翻倍。此外,Baichuan 2在醫學和法律領域的任務中也表現出色。在MedQA(Jin等人,2021年)和JEC-QA(Zhong等人,2020年)等基準測試上,Baichuan 2勝過其他開源模型,使其成為適合領域特定優化的基礎模型。
本報告還發布了兩個聊天模型(Baichuan 2-7B-Chat和Baichuan 2-13B-Chat),并發現7B模型基于2.6T的數據性能依舊會提高
Additionally, we also released two chat models, Baichuan 2-7B-Chat and Baichuan 2- 13B-Chat, optimized to follow human instructions. These models excel at dialogue and context understanding. We will elaborate on our approaches to improve the safety of Baichuan 2. By open-sourcing these models, we hope to enable the community to further improve the safety of large language models, facilitating more research on responsible LLMs development.
Furthermore, in spirit of research collaboration and continuous improvement, we are also releasing the checkpoints of Baichuan 2 at various stages of training from 200 billion tokens up to the full 2.6 trillion tokens. We found that even for the 7 billion parameter model, performance continued to improve after training on more than 2.6 trillion tokens. By sharing these intermediary results, we hope to provide the community with greater insight into the training dynamics of Baichuan 2. Understanding these dynamics is key to unraveling the inner working mechanism of large language models (Biderman et al., 2023a; Tirumala et al., 2022). We believe the release of these checkpoints will pave the way for further advances in this rapidly developing field.
此外,我們還發布了兩個聊天模型,Baichuan 2-7B-Chat和Baichuan 2-13B-Chat,經過優化以遵循人類指令。這些模型擅長對話和上下文理解。我們將詳細介紹改進Baichuan 2安全性的方法。通過開源這些模型,我們希望能夠使社區進一步改進大型語言模型的安全性,促進更多負責任的LLMs開發的研究。
此外,為了促進研究合作和持續改進,我們還發布了Baichuan 2在從2000億token到完整的2.6萬億token的各個訓練階段的檢查點。我們發現,即使對于70億參數模型,訓練超過2.6萬億令牌后,性能仍然會持續提高。通過分享這些中間結果,我們希望為社區提供更深入了解Baichuan 2訓練動態的見解。了解這些動態對于揭示大型語言模型的內部工作機制(Biderman等人,2023a;Tirumala等人,2022年)至關重要。我們相信這些檢查點的發布將為這個快速發展的領域帶來更進一步的進展。
In this technical report, we will also share some of the trials, errors, and lessons learned through training Baichuan 2. In the following sections, we will present detailed modifications made to the vanilla Transformer architecture and our training methodology. We will then describe our fine-tuning methods to align the foundation model with human preferences. Finally, we will benchmark the performance of our models against other LLMs on a set of standard tests. Throughout the report, we aim to provide transparency into our process, including unsuccessful experiments, to advance collective knowledge in developing LLMs. Baichuan 2’s foundation models and chat models are available for both research and commercial use at https://github.com/ baichuan-inc/Baichuan2.
在本技術報告中,我們還將分享訓練Baichuan 2時的一些嘗試、錯誤和經驗教訓。在接下來的章節中,我們將介紹對基礎Transformer架構進行的詳細修改以及我們的訓練方法。然后,我們將描述與人類偏好對齊的微調方法。最后,我們將在一組標準測試上對我們的模型的性能進行基準測試。在整個報告中,我們旨在提供關于我們的過程的透明度,包括不成功的實驗,以推動LLMs的開發方面的集體知識。Baichuan 2的基礎模型和聊天模型可供研究和商業用途使用,網址為https://github.com/baichuan-inc/Baichuan2。
2?Pre-training預訓練
This section introduces the training procedure for the Baichuan 2 foundation models. Before diving into the model details, we first show the overall performance of the Baichuan 2 base models compared to other open or closed-sourced models in Table 1. We then describe our pre-training data and data processing methods. Next, we elaborate on the Baichuan 2 architecture and scaling results. Finally, we describe the distributed training system.
本節介紹了Baichuan 2基礎模型的訓練過程。在深入介紹模型細節之前,我們首先在表1中展示了Baichuan 2基礎模型與其他開源或封閉源模型的整體性能比較。然后,我們描述了我們的預訓練數據和數據處理方法。接下來,我們詳細介紹了Baichuan 2的架構和擴展結果。最后,我們描述了分布式訓練系統。
Table 1: Overall results of Baichuan 2 compared with other similarly sized LLMs on general benchmarks. * denotes results derived from official websites.
表1:Baichuan 2與其他類似規模的LLMs在通用基準測試上的整體結果。*表示來自官方網站的結果。
2.1?Pre-training Data預訓練數據
數據收集(目標-可擴展性+代表性,多種來源-世界知識系統)
數據處理(構建高效的大規模去重和聚類系統)
Data sourcing: During data acquisition, our objective is to pursue comprehensive data scalability and representativeness. We gather data from diverse sources including general internet webpages, books, research papers, codebases, and more to build an extensive world knowledge system. The composition of the training corpus is shown in Figure 1.
Data processing: For data processing, we focus on data frequency and quality. Data frequency relies on clustering and deduplication. We built a large-scale deduplication and clustering system supporting both LSH-like features and dense embedding features. This system can cluster and deduplicate trillion-scale data within hours. Based on the clustering, individual documents,paragraphs, and sentences are deduplicated and scored. Those scores are then used for data sampling in pre-training. The size of the training data at different stages of data processing is shown in Figure 2.
數據收集:在數據采集過程中,我們的目標是追求全面的數據可擴展性和代表性。我們從各種來源收集數據,包括互聯網網頁、書籍、研究論文、代碼庫等,以構建一個廣泛的世界知識系統。訓練語料庫的組成如圖1所示。
數據處理:對于數據處理,我們關注數據的頻率和質量。數據頻率依賴于聚類和去重。我們建立了一個支持LSH-like特征和密集嵌入特征的大規模去重和聚類系統。這個系統可以在幾小時內對萬億級別的數據進行聚類和去重。在聚類的基礎上,單個文檔、段落和句子被去重復并打分。然后將這些分數用于預訓練中的數據采樣。不同數據處理階段的訓練數據規模如圖2所示。
Figure 1: The distribution of different categories of Baichuan 2 training data.
圖1:Baichuan 2訓練數據的不同類別分布。
Figure 2: The data processing procedure of Baichuan 2’s pre-training data.
圖2:Baichuan 2預訓練數據的數據處理過程。
Raw corpus→精確去重→啟發式方法→逐句質量篩選→逐句、逐段去重→文檔級別去重
Raw corpus→Exact deduplication精確去重→Heuristic approach啟發式方法→Sent-wise quality filter逐句質量篩選→Sent-wise,paragraph-wise deduplication逐句、逐段去重→Document-wise deduplication文檔級別去重
2.2 Architecture架構:基于主流的Transformer修改而來
The model architecture of Baichuan 2 is based on the prevailing Transformer (Vaswani et al., 2017). Nevertheless, we made several modifications which we detailed below.
Baichuan 2的模型架構基于主流的Transformer(Vaswani等人,2017)。然而,我們進行了一些詳細的修改,如下所述。
2.3 Tokenizer分詞器
需要平衡的兩個因素:高壓縮率、詞匯表(6.4W→12.56W,采用BPE+數字數據單獨標記+代碼數據加空格標記)
A tokenizer needs to balance two critical factors: a high compression rate for efficient inference, and an appropriately sized vocabulary to ensure adequate training of each word embedding. We have taken both these aspects into account. We have expanded the vocabulary size from 64,000 in Baichuan 1 to 125,696, aiming to strike a balance between computational efficiency and model performance.
We use byte-pair encoding (BPE) (Shibata et al., 1999) from SentencePiece (Kudo and Richardson, 2018) to tokenize the data. Specifically, we do not apply any normalization to the input text and we do not add a dummy prefix as in Baichuan 1. We split numbers into individual digits to better encode numeric data. To handle code data containing extra whitespaces, we add whitespace-only tokens to the tokenizer. The character coverage is set to 0.9999, with rare characters falling back to UTF-8 bytes.
We set the maximum token length to 32 to account for long Chinese phrases. The training data for the Baichuan 2 tokenizer comes from the Baichuan 2 pre-training corpus, with more sampled code examples and academic papers to improve coverage (Taylor et al., 2022). Table 2 shows a detailed comparison of Baichuan 2’s tokenizer with others.
分詞器需要平衡兩個關鍵因素:高效推理的高壓縮率和適當大小的詞匯表,以確保對每個詞嵌入進行充分的訓練。我們考慮了這兩個方面。入進行充分的訓練。這兩個方面我們都考慮到了。為了在計算效率和模型性能之間取得平衡,我們將詞匯表大小從Baichuan 1的64,000擴展到125,696。
我們使用了來自SentencePiece的字節對編碼(BPE)(Shibata等人,1999年)對數據進行分詞。具體來說,我們沒有對輸入文本應用任何歸一化,并且不像Baichuan 1那樣添加虛擬前綴。我為了更好地編碼數字數據,我們將數字分成單獨的數字。為了處理包含額外空格的代碼數據,我們向分詞器添加了僅包含空格的標記。字符覆蓋率設置為0.9999,罕見字符返回到UTF-8字節。
我們將最大令牌長度設置為32,以考慮長的中文短語。Baichuan 2分詞器的訓練數據來自Baichuan 2預訓練語料庫,其中包含更多的樣本代碼示例和學術論文以提高覆蓋率(Taylor等人,2022年)。表2顯示了Baichuan 2的分詞器與其他模型的詳細比較。
Table 2: The vocab size and text compression rate of Baichuan 2’s tokenizer compared with other models. The lower the better.
表2:Baichuan 2的分詞器的詞匯大小和文本壓縮率與其他模型的比較。數字越小越好。
2.3.1 Positional Embeddings位置嵌入—類似Baichuan 1模型:Baichuan 2-7B采用RoPE(更適合Flash Attention)、Baichuan 2-13B采用ALiBi
Building on Baichuan 1, we adopt Rotary Positional Embedding (RoPE) (Su et al., 2021) for Baichuan 2-7B and ALiBi (Press et al., 2021) for Baichuan 2-13B. ALiBi is a more recent positional encoding technique that has shown improved extrapolation performance. However, most open-sourced models use RoPE for positional embeddings, and optimized attention implementations like Flash Attention (Dao et al., 2022; Dao, 2023) are currently better suited to RoPE since it is multiplication-based, bypassing the need for passing attention_mask to the attention operation. Nevertheless, in preliminary experiments, the choice of positional embedding did not significantly impact model performance. To enable further research on bias-based and multiplication-based attention, we apply RoPE on Baichuan 2-7B and ALiBi on Baichuan 2-13B, consistent with Baichuan 1.
在Baichuan 1的基礎上,我們為Baichuan 2-7B采用了旋轉位置嵌入?(RoPE)(Su等人,2021年),而對于Baichuan 2-13B,采用了ALiBi(Press等人,2021年)。ALiBi是一種較新的位置編碼技術,已經顯示出了改進的外推性能。然而,大多數開源模型使用RoPE來進行位置嵌入,并且像Flash Attention(Dao等人,2022年;Dao,2023年)這樣的優化的注意力實現目前更適合RoPE,因為它是基于乘法的,無需將attention_mask傳遞給attention操作。然而,在初步實驗中,位置嵌入的選擇并沒有顯著影響模型性能。為了進一步研究基于偏差的和基于乘法的注意力,我們在Baichuan 2-7B上應用RoPE,在Baichuan 2-13B上應用ALiBi,與Baichuan 1保持一致。
2.4 Activations and Normalizations激活函數和標準化:采用SwiGLU+xFormers(注意力和偏差能力結合ALiBi減少內存開銷)+RMSNorm(層歸一化Transformer塊的輸入)
We use SwiGLU (Shazeer, 2020) activation function, a switch-activated variant of GLU (Dauphin et al., 2017) which shows improved results. However, SwiGLU has a “bilinear” layer and contains three parameter matrices, differing from the vanilla Transformer’s feed-forward layer that has two matrices, so we reduce the hidden size from 4 times the hidden size to 8 hidden size and rounded to the multiply of 128.
For the attention layer of Baichuan 2, we adopt the memory efficient attention (Rabe and Staats, 2021) implemented by xFormers2. By leveraging xFormers’ optimized attention with biasing capabilities, we can efficiently incorporate ALiBi’s bias-based positional encoding while reducing memory overhead. This provides performance and efficiency benefits for Baichuan 2’s large-scale training.
We apply Layer Normalization (Ba et al., 2016) to the input of the Transformer block which is more robust to the warm-up schedule (Xiong et al., 2020). In addition, we use the RMSNorm implementation introduced by (Zhang and Sennrich, 2019), which only calculates the variance of input features to improve efficiency.
我們使用SwiGLU(Shazeer,2020年)激活函數,這是GLU(Dauphin等人,2017年)的一種開關激活變體,顯示出改進的結果。然而,SwiGLU具有“雙線性”層,并包含三個參數矩陣,與傳統的包含兩個矩陣的Transformer前饋層不同,因此我們將隱藏大小從隱藏大小的4倍減少到隱藏大小的8倍,并四舍五入到128的乘法。
對于Baichuan 2的注意力層,我們采用了xFormers2實現的內存高效的注意力(Rabe和Staats,2021年)。通過利用xFormers的優化注意力和偏差能力,我們可以高效地將ALiBi的基于偏差的位置編碼整合到模型中,同時減少內存開銷。這為Baichuan 2的大規模訓練提供了性能和效率上的優勢。
我們對Transformer塊的輸入應用層歸一化(Ba等人,2016年),這對于預熱計劃(Xiong等人,2020年)更加穩健。此外,我們使用了Zhang和Sennrich(2019年)引入的RMSNorm實現,它只計算輸入特征的方差,以提高效率。
2.5 Optimizations
采用AdamW(2000個線性縮放后升溫→余弦衰減到最小學習率)
Table 3: Model details of Baichuan 2.
表3:Baichuan 2的模型詳細信息。
混合精度(BFloat16有更好的動態范圍)
The whole models are trained using BFloat16 mixed precision. Compared to Float16, BFloat16 has a better dynamic range, making it more robust to large values that are critical in training large language models. However, BFloat16’s low precision causes issues in some settings. For instance, in some public RoPE and ALibi implementations, the torch.arange operation fails due to collisions when the integer exceeds 256, preventing differentiation of nearby positions. Therefore, we use full precision for some value-sensitive operations such as positional embeddings.
整個模型使用BFloat16混合精度進行訓練。與Float16相比,BFloat16具有更好的動態范圍,使其對大值更具魯棒性,這對于訓練大型語言模型至關重要(更穩定)。然而,BFloat16的低精度在某些情況下會引發問題。例如,在一些公共的RoPE和ALibi實現中,當整數超過256時,torch.arange操作會失敗,導致附近位置的差異化無法進行。因此,對于一些敏感于數值的操作,如位置嵌入,我們使用全精度。
NormHead(對輸出嵌入歸一化)兩優點:顯著穩定訓練動態+減輕計算logits時L2距離的干擾
NormHead: To stabilize training and improve the model performance, we normalize the output embeddings (which are also referred as 'head’). There are two advantages of NormHead in our experiment. First, in our preliminary experiments we found that the norm of the head are prone to be unstable. The norm of the rare token’s embedding becomes smaller during training which disturb the training dynamics. NormHead can stabilize the dynamics significantly. Second, we found that the semantic information is mainly encoded by the cosine similarity of Embedding rather than L2 distance. Since the current linear classifier computes logits by dot product, which is a mixture of L2 distance and cosine similarity. NormHead alleviates the distraction of L2 distance in computing logits. For more details, please refer appendix C.
NormHead:為了穩定訓練并提高模型性能,我們對輸出嵌入(也稱為“head”)進行了歸一化。NormHead在我們的實驗中具有兩個優點。首先,在我們的初步實驗中,我們發現頭部的范數容易不穩定。在訓練過程中,稀有token的嵌入范數變小,擾亂了訓練動態。NormHead可以顯著穩定動態。其次,我們發現語義信息主要通過嵌入的余弦相似性而不是L2距離進行編碼。由于當前的線性分類器通過點積計算logits,這是L2距離和余弦相似性的混合。NormHead減輕了在計算logits時L2距離的干擾。有關更多詳細信息,請參閱附錄C。
Max-z loss:規范logits助于穩定訓練
Max-z loss: During training, we found that the logits of LLMs could become very large. While the softmax function is agnostic to the absolute logit values, as it depends only on their relative values. Large logits caused issues during inference because common implementations of repetition penalty (such as the Hugging Face implementation3 in model.generate) apply a scalar (e.g. 1.1or 1.2) directly to the logits. Contracting very large logits in this way can significantly alter the probabilities after softmax, making the model sensitive to the choice of repetition penalty hyper-parameter. Inspired by NormSoftmax (Jiang et al., 2023b) and the auxiliary z-loss from PaLM (Chowdhery et al., 2022), we added a max-z loss to normalize the logits:
where z is the maximum logit value. This helped stabilize training and made the inference more robust to hyper-parameters.
The final training loss of Baichuan 2-7B and Baichuan 2-13B are shown in Figure 3.
Max-z loss:在訓練過程中,我們發現LLMs的logits可能變得非常大。雖然softmax函數不關心logits的絕對值,因為它僅依賴于它們的相對值。但大的logits在推理過程中會引發問題,因為常見的重復懲罰實現(例如model.generate中的Hugging Face實現)會直接將標量(例如1.1或1.2)應用于logits。以這種方式收縮非常大的logits可以顯著改變softmax后的概率,使模型對重復懲罰超參數的選擇敏感。受NormSoftmax(Jiang等人,2023b)和PaLM(Chowdhery等人,2022年)的輔助z-loss的啟發,我們添加了一個max-z loss來規范logits:
其中z是最大的logit值。這有助于穩定訓練,并使推理更加穩健,不受超參數的影響。 Baichuan 2-7B和Baichuan 2-13B的最終訓練損失如圖3所示。
Figure 3: The pre-training loss of Baichuan 2.
圖3:Baichuan 2的預訓練損失。
2.6 Scaling Laws縮放定律(保證高訓練成本的高性能):通過逐個訓練(10M~3B)小模型擬合縮放定律
Neural scaling laws, where the error decreases as a power function of training set size, model size, or both, have enabled an assuring performance when training became more and more expensive in deep learning and large language models. Before training the large language models of billions of parameters, we first train some small-sized models and fit a scaling law for training larger models.
We launched a range of model sizes going from 10M to 3B, ranging from 1 to 1 the size of the final model, and each of the model is trained for up to 1 trillion tokens, using consistent hyper-parameters and the same data set sourced from Baichuan 2. Based on the final loss of different models, we can obtain a mapping from the training flops to the target loss.
當深度學習和大型語言模型的訓練成本越來越高時,神經縮放定律(其中誤差作為訓練集大小、模型大小或兩者的冪函數而減小)能夠保證性能。在訓練數十億參數的大型語言模型之前,我們首先訓練了一些小型模型,并為訓練更大型號的模型擬合了縮放定律。
我們啟動了一系列模型尺寸,從10M到3B,范圍從最終模型大小的1到1,每個模型都使用一致的超參數和來自Baichuan 2的相同數據集訓練了多達1萬億個令牌。根據不同模型的最終損失,我們可以獲得從訓練flops到目標損失的映射。
To fit the scaling law of the model, we employed the formula given by Henighan et al. (2020):
where L∞ is the irreducible loss and the first term is the reducible loss which is formulated as a power-law scaling term. C are training flops and the LC are final loss of the model in that flops. We used the curve_fit function from the SciPy4 library to fit the parameters. The final fitted scaling curve and the predicted 7 billion and 13 billion parameters model’s final loss are shown in Figure 4.We can see that the fitted scaling law predicted Baichuan 2’s final loss with high accuracy.
為了擬合模型的縮放定律,我們采用了Henighan等人(2020年)給出的公式:
其中L∞是不可減小的損失,第一項是可減小的損失,其被公式化為冪律縮放項。C是訓練flops,LC是該flops中模型的最終損失。我們使用SciPy4庫中的curve_fit函數來擬合參數。圖4顯示了最終擬合的縮放曲線以及預測的7億和13億參數模型的最終損失。
Figure 4: The scaling law of Baichuan 2. We trained various models ranging from 10 million to 3 billion parameters with 1 trillion tokens. By fitting a power law term to the losses given training flops, we predicted losses for training Baichuan 2-7B and Baichuan 2-13B on 2.6 trillion tokens. This fitting process precisely predicted the final models’ losses (marked with two stars).
圖4:Baichuan 2的縮放定律。我們訓練了各種模型,從1000萬到30億參數,使用1萬億令牌。通過將冪律項擬合到不同訓練flops給出的損失,我們預測了在2.6萬億令牌上訓練Baichuan 2-7B和Baichuan 2-13B的損失。這個擬合過程精確地預測了最終模型的損失(用兩顆星標記)。
2.7 Infrastructure基礎設施:
一種高效利用GPU資源的聯合設計(彈性訓練框架【張量并行+基于ZeRO共享的數據并行】+智能集群調度策略【任務的資源可根據集群狀態動態修改】)、張量分割技術(減少內存峰值消耗)
Efficiently leveraging existing GPU resources plays a critically important role in training and developing large language models today. To accomplish this, we develop a co-design approach for an elastic training framework and a smart cluster scheduling policy.
Since our GPUs are shared among multiple users and tasks, the specific behavior of each task is unpredictable, often leading to idle GPU nodes within the cluster. Considering that a single machine equipped with eight A800 GPUs could adequately meet the memory requirements for our Baichuan 7B and Baichuan 13B models, the primary design criterion for our training framework is the machine-level elasticity, which supports that resources for tasks can be dynamically modified according to the cluster status and thereby serves as the foundation for our smart scheduling algorithm.
To meet the requirement of the machine-level elasticity, our training framework integrates tensor parallelism (Narayanan et al., 2021) and ZeRO-powered data parallelism (Rajbhandari et al., 2020), where we set tensor parallelism inside each machine and employ ZeRO shared data parallelism for elastic scaling across machines.
In addition, we employ a tensor-splitting technique (Nie et al., 2022) where we split certain calculations to reduce peak memory consumption, such as the cross-entropy calculations with large vocabularies. This approach enables us to meet memory needs without extra computing and communication, making the system more efficient.
在今天訓練和開發大型語言模型時,高效地利用現有的GPU資源在訓練和開發大型語言模型中發揮著至關重要的作用。為了實現這一目標,我們開發了一種彈性訓練框架和智能集群調度策略的聯合設計方法。
由于我們的GPU被多個用戶和任務共享,每個任務的具體行為是不可預測的,這經常導致集群中存在空閑的GPU節點。考慮到一臺配備8個A800?GPU的機器可以滿足Baichuan 7B和Baichuan 13B模型的內存需求,我們的訓練框架的主要設計標準是機器級彈性,支持任務的資源可以根據集群狀態動態修改,從而作為我們智能調度算法的基礎。
為了滿足機器級彈性的要求,我們的訓練框架集成了張量并行性(Narayanan等人,2021年)和ZeRO驅動的數據并行性(Rajbhandari等人,2020年),其中我們在每臺機器內設置了張量并行性,并采用ZeRO共享數據并行性來跨機器進行彈性擴展。
此外,我們采用了張量分割技術(Nie等人,2022年),其中我們分割了某些計算以減少內存峰值消耗,例如具有大詞匯表的交叉熵計算。這種方法使我們能夠在無需額外計算和通信的情況下滿足內存需求,使系統更加高效。
混合精度訓練(BFloat16中前后向計算+Float32中執行優化器更新):
To further accelerate training without compromising model accuracy, we implement mixed-precision training, where we perform forward and backward computations in BFloat16, while performing optimizer updating in Float32.
為了在不影響模型精度的情況下進一步加速訓練,我們實現了混合精度訓練,在BFloat16中執行前向和后向計算,同時在Float32中執行優化器更新。
兩大技術避免降低通信效率:拓撲感知分布式訓練+?ZeRO混合分層分區設計
Furthermore, in order to efficiently scale our training cluster to thousands of GPUs, we integrate the following techniques to avoid the degradation of communication efficiency:
>>Topology-aware distributed training. In large-scale clusters, network connections frequently span multiple layers of switches. We strategically arrange the ranks for distributed training to minimize frequent access across different switches, which reduces latency and thereby enhances overall training efficiency.
>>Hybrid and hierarchical partition for ZeRO. By partitioning parameters across GPUs, ZeRO3 reduces memory consumption at the expense of additional all-gather communications. This approach would lead to a significant communication bottleneck when scaling to thousands of GPUs (Jiang et al., 2023a). To address this issue, we propose a hybrid and hierarchical partitioning scheme. Specifically, our framework first partitions the optimizer states across all GPUs, and then adaptively decides which layers need to activate ZeRO3, and whether partitioning parameters hierarchically.
此外,為了有效地將我們的訓練集群擴展到數千個GPU,我們集成了以下技術來避免通信效率的降低::
>>拓撲感知分布式訓練。在大規模集群中,網絡連接經常跨越多層交換機。我們策略性地排列分布式訓練的隊列,以減少跨不同交換機的頻繁訪問,從而減少延遲,從而提高整體訓練效率。
>> ZeRO混合分層分區設計。通過跨GPU劃分參數,ZeRO3以額外的全采集通信為代價減少了內存消耗。當擴展到數千個GPU時,這種方法會導致嚴重的通信瓶頸(Jiang等人,2023a)。為了解決這個問題,我們提出了一種混合分層分區方案。具體來說,我們的框架首先對所有GPU上的優化器狀態進行分區,然后自適應地決定哪些層需要激活ZeRO3,以及是否分層劃分參數。
整合以上策略實現1024個A800來高效訓練Baichuan 2
By integrating these strategies, our system is capable of training Baichuan 2-7B and Baichuan 2-13B models efficiently on 1,024 NVIDIA A800 GPUs, achieving a computational efficiency that exceeds 180 TFLOPS.
通過整合這些策略,我們的系統能夠在1024個NVIDIA A800 GPU上高效訓練Baichuan 2-7B和Baichuan 2-13B模型,實現計算效率超過180 TFLOPS。
3 Alignment對齊=SFT+RLHF(RM+RL)
Baichuan 2 also introduces the alignment procedure resulting in two chat models: Baichuan 2-7B-Chat and Baichuan 2-13B-Chat. The alignment process of the Baichuan 2 encompasses two main components: Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).
Baichuan 2還引入了對齊過程,產生了兩個聊天模型:Baichuan 2-7B-Chat和Baichuan 2-13B-Chat。Baichuan 2的對齊過程包括兩個主要組成部分:監督微調(SFT)和從人類反饋中強化學習(RLHF)。
3.1 Supervised Fine-Tuning監督微調:基于10萬監督微調樣本+人工標記器執行注釋+交叉驗證(權威標注員校驗)
During the supervised fine-tuning phase, we use human labelers to annotate prompts gathered from various data sources. Each prompt is labeled as being helpful or harmless based on key principles similar to Claude (2023). To validate data quality, we use cross-validation—an authoritative annotator checks the quality of a sample batch annotated by a specific crowd worker group, rejecting any batches that do not meet our quality standards.
We collected over 100k supervised fine-tuning samples and trained our base model on them. Next, we delineated the reinforcement learning process via the RLHF method to further improve results. The whole process of RLHF, including RM and RL training, is shown in Figure 5.
在監督微調階段,我們使用人工標記器對來自各種數據源收集的提示進行注釋。每個提示都根據類似于Claude(2023年)的關鍵原則標記為有幫助或無害。為了驗證數據質量,我們使用交叉驗證——一個權威標注員檢查了由特定群體的眾包工作者組注釋的樣本批次的質量,拒絕不符合我們質量標準的任何批次。
我們收集了超過10萬個監督微調樣本,并在基礎模型上進行了訓練。接下來,我們通過RLHF方法進一步改進結果,確定了強化學習過程。RLHF的整個過程,包括RM和RL訓練,如圖5所示。
Figure 5: An illustration of Baichuan 2’s RLHF process.
圖5:Baichuan 2的RLHF過程的示意圖。
3.2 Reward Model獎勵模型:三級分類系統(6/30/200)
We devised a three-tiered classification system for all prompts, consisting of 6 primary categories, 30 secondary categories, and over 200 tertiary categories. From the user’s perspective, we aim for the classification system to comprehensively cover all types of user needs. From the standpoint of reward model training, prompts within each category should have sufficient diversity to ensure the reward model can generalize well.
Given a prompt, responses are generated by Baichuan 2 models of different sizes and stages (SFT, PPO) to enhance response diversity. Only responses generated by the Baichuan 2 model family are used in the RM training. Responses from other open-source datasets and proprietary models do not improve the reward model’s accuracy. This also underscores the intrinsic consistency of the Baichuan model series from another perspective.
The loss function used for training the reward model is consistent with that in InstructGPT (Ouyang et al., 2022). The reward model derived from training exhibits a performance consistent with that of LLaMA 2 (Touvron et al., 2023b), indicating that the greater the score difference between two responses, the higher the discriminative accuracy of the reward model, as shown in Table 4.
我們為所有提示設計了一個三級分類系統,包括6個一級類別,30個二級類別和200多個三級類別。從用戶的角度來看,我們的分類系統旨在全面涵蓋所有類型的用戶需求。從獎勵模型訓練的角度來看,每個類別內的提示應具有足夠的多樣性,以確保獎勵模型能夠很好地泛化。
在提示條件下,采用不同規模、不同階段的Baichuan?2模型(SFT、PPO)生成響應,增強響應的多樣性。在RM訓練中只使用由Baichuan?2模型族生成的響應。來自其他開源數據集和專有模型的響應并不能提高獎勵模型的準確性。這也從另一個角度強調了Baichuan?2模型系列的內在一致性。
用于訓練獎勵模型的損失函數與InstructGPT(Ouyang等人,2022年)中的損失函數一致。從訓練中得到的獎勵模型表現出與LLaMA 2(Touvron等人,2023b年)一致的性能,表明兩個響應之間的得分差異越大,獎勵模型的判別準確性越高,如表4所示。
Table 4: Reward Model test accuracy on different score gaps of two responses. The larger the response gap, the better RM accuracy. The gap 1,2,3,4,5 correspond to unsure, negligibly better, slightly better, better, and significantly better, respectively.
表4:不同響應得分差距下的獎勵模型測試準確性。響應差距越大,RM準確性越高。差距1、2、3、4、5分別對應于unsure、negligibly better、slightly better、better和significantly better。
3.3 PPO(訓練LM):采用四模型(參與者+參考模型+獎勵模型+批評模型)
After obtaining the reward model, we employ the PPO (Schulman et al., 2017) algorithm to train our language model. We employ four models: the actor model (responsible for generating responses), the reference model (used to compute the KL penalty with fixed parameters), the reward model (providing an overarching reward for the entire response with fixed parameters), and the critic model (designed to learn per-token values).
在獲得獎勵模型后,我們采用PPO(Schulman等人,2017年)算法來訓練我們的語言模型。我們使用了四個模型:參與者模型(負責生成響應)、參考模型(用于計算具有固定參數的KL懲罰)、獎勵模型(為整個響應提供總體獎勵,具有固定參數)和批評模型(設計用于學習每個標記值)。
3.4 Training Details訓練細節
During the RLHF training process, the critic model is warmed up with an initial 20 training steps ahead. Subsequently, both the critic and actor models are updated via the standard PPO algorithm. For all models, we use gradient clipping of 0.5, a constant learning rate of 5e-6, and a PPO clip threshold ? = 0.1. We set the KL penalty coefficient β = 0.2, decaying to 0.005 over steps. We train for 350 iterations for all our chat models, resulting in Baichuan 2-7B-Chat and Baichuan 2-13B-Chat.
在RLHF訓練過程中,批評家模型在前面的20個訓練步驟中進行了熱身。隨后,批評家模型和參與者模型都通過標準PPO算法進行更新。對于所有模型,我們使用了0.5的梯度剪裁、恒定的學習率5e-6和PPO剪裁閾值? = 0.1。我們將KL懲罰系數β設置為0.2,并在步驟上逐漸減小到0.005。我們為所有的聊天模型訓練了350次迭代,得到了Baichuan 2-7B-Chat和Baichuan 2-13B-Chat。
4 Safety安全性
We believe that model safety improvements stem not only from constraints during data cleansing or alignment stages but also from harnessing positive knowledge and identifying negative knowledge during all training stages. Guided by this concept, we have enhanced model safety throughout the Baichuan 2 training process.
我們認為,模型安全性的提高不僅來自于數據清洗或對齊階段的約束,還源于在所有訓練階段利用積極知識和識別負面知識。在這一理念的指導下,我們在整個Baichuan 2的訓練過程中提高了模型的安全性。
4.1 Pre-training Stage預訓練階段:設計了一套規則和模型過濾有害內容+策劃了一個中英雙語數據集
In the pre-training stage, we pay close attention to data safety. The entire pre-training dataset underwent a rigorous data filtering process aimed at enhancing safety. We devised a system of rules and models to eliminate harmful content such as violence, pornography, racial discrimination, hate speech, and more.
Furthermore, we curated a Chinese-English bilingual dataset comprising several million webpages from hundreds of reputable websites that represent various positive value domains, encompassing areas such as policy, law, vulnerable groups, general values, traditional virtues, and more. We also heightened the sampling probability for this dataset.
在預訓練階段,我們密切關注數據安全性。整個預訓練數據集經歷了嚴格的數據過濾過程,旨在提高安全性。我們設計了一套規則和模型,以消除暴力、色情、種族歧視、仇恨言論等有害內容。
此外,我們策劃了一個中英雙語數據集,包括數百個知名網站的數百萬個網頁,代表了各種積極價值領域,涵蓋了政策、法律、弱勢群體、一般價值觀、傳統美德等領域。我們還提高了該數據集的抽樣概率。
4.2 Alignment Stage對齊階段:紅隊流程(6種和100+粒度),專家標注團隊(20萬個攻擊提示)+多值監督采樣方法+DPO+采用有益和無害目標相結合的獎勵模型
We build a red-teaming procedure consisting of 6 types of attacks and 100+ granular safety value categories, an expert annotation team of 10 with traditional internet security experience initialized safe alignment prompts. The relevant snippets from the pre-training dataset were retrieved to create responses, resulting in approximately 1K annotated data for initialization.
>>The expert annotation team guided a 50-person outsourced annotation team through red-blue confrontation with the initialized alignment model, resulting in the generation of 200K attack prompts.
>>By employing a specialized multi-value supervised sampling method, we maximized the utilization of attack data to generate responses at varying safety levels.
During the RL optimization stage, we also take safety into the first account:
>>At the onset of safety reinforcement, DPO (Rafailov et al., 2023) methods efficiently employed limited amounts of annotated data to enhance performance concerning specific vulnerability issues.
>>By employing a Reward Model that integrates Helpful and Harmless objectives, PPO safety reinforcement training was conducted.
我們構建了由6種攻擊類型和100+粒度安全價值類別組成的紅隊流程,由10人組成的具有傳統互聯網安全經驗的專家標注團隊初始化安全對齊提示。從預訓練數據集中檢索相關片段來創建響應,產生大約1K的帶注釋的數據用于初始化。
>>專家標注團隊通過與初始化的對齊模型進行紅藍對抗,引導了一個由50名外包注釋團隊組成的團隊,生成了20萬個攻擊提示。
>>通過采用專門的多值監督采樣方法,我們最大程度地利用攻擊數據來生成不同安全級別的響應。
在RL優化階段,我們還首先考慮了安全性:
>>在安全性增強的初期,DPO(Rafailov等人,2023年)方法高效地使用有限數量的注釋數據來增強特定的漏洞問題。
>>采用有益和無害目標相結合的獎勵模型,對PPO進行安全強化訓練。
5 Evaluations評估
兩大形式進行評估:自由形式的生成任務、多選任務
In this section, we report the zero-shot or few-shot results of the pre-trained base models on standard benchmarks. We evaluate Baichuan 2 on free-form generation tasks and multiple-choice tasks.
>>Free-form generation: Models are given some sample inputs (shots) and then generate continuations to obtain results, like for question answering, translation, and other tasks.
>>Multiple-choice: Models are given a question and multiple choices, and the task is to select the most appropriate candidates.
本節中,我們報告了預訓練基礎模型在標準基準上的zero-shot 或few-shot 結果。我們評估了Baichuan 2在自由形式生成任務和多選任務上的性能。
>>自由形式生成:模型提供一些示例輸入(shots),然后生成繼續以獲得結果,例如問題回答、翻譯和其他任務。
>>多選:模型提供一個問題和多個選項,任務是選擇最合適的候選項。
公平的基準比較:引入開源評估框架(如lm-evaluation-harness/OpenCompass)
Given the variety of tasks and examples, we incorporated open-source evaluation frameworks like lm-evaluation-harness (Gao et al., 2021) and OpenCompass (OpenCompass, 2023) into our in-house implementations for fair benchmarking against other models.
The models we choose to compare have similar sizes to Baichuan 2 and are open-sourced that the results can reproduced:
>>LLaMA (Touvron et al., 2023b): The language models trained by Meta on 1 trillion tokens. The context length is 2,048 and we evaluate both LLaMA 7B and LLaMA 13B.
>>LLaMA 2 (Touvron et al., 2023c): A successor model to LLaMA 1 trained on 2 trillion tokens and better data mixture.
>>Baichuan 1 (Baichuan, 2023b): The Baichuan 7B is trained on 1.2 trillion tokens and Baichuan 13B is trained on 1.4 trillion tokens. Both of them focus on English and Chinese.
>>ChatGLM 2-6B (Zeng et al., 2022): A chat language model that has strong performance on several benchmarks5.
>>MPT-7B (MosaicML, 2023): An open-source LLMs trained 1 trillion tokens of English text and code.
>>Falcon-7B (Penedo et al., 2023): A series of LLMs trained on 1 trillion tokens enhanced with curated corpora. It is made available under the Apache 2.0 license.
>>Vicuna-13B (Chiang et al., 2023): A language model trained by fine-tuning LLaMA-13B on the conversational dataset generated by ChatGPT.
>>Chinese-Alpaca-Plus-13B (Cui et al., 2023): A language model trained by fine-tuning LLaMA- 13B on the conversational dataset generated by ChatGPT.
>>XVERSE-13B: A 13B multilingual large language model trained on more than 1.4 trillion tokens.
考慮到任務和示例的多樣性,我們在內部實施中引入了開源評估框架,如lm-evaluation-harness(Gao等人,2021年)和OpenCompass(OpenCompass,2023年),以便與其他模型進行公平的基準比較。
我們選擇了與Baichuan 2大小相似且開源的模型進行比較,其結果可以被復制:
>>LLaMA(Touvron等人,2023b):由Meta在1萬億標記上訓練的語言模型。上下文長度為2,048,我們評估LLaMA 7B和LLaMA 13B。
>>LLaMA 2(Touvron等人,2023c):LLaMA 1的后續模型,訓練在2萬億標記上,數據混合更好。
>>Baichuan 1(Baichuan,2023b):Baichuan 7B訓練在1.2萬億標記上,Baichuan 13B訓練在1.4萬億標記上。它們都側重于英語和中文。
>>ChatGLM 2-6B(Zeng等人,2022年):在幾個基準上表現出色的聊天語言模型。
>>MPT-7B(MosaicML,2023):一個開源的LLMs,訓練了1萬億標記的英文文本和代碼。
>>Falcon-7B(Penedo等人,2023):一系列經過策劃的1萬億標記的LLMs。它在Apache 2.0許可下提供。
>>Vicuna-13B(Chiang等人,2023):通過對LLaMA-13B進行微調而訓練的語言模型,該模型使用ChatGPT生成的對話數據集。
>>Chinese-Alpaca-Plus-13B(Cui等人,2023):通過對LLaMA-13B進行微調而訓練的語言模型,該模型使用ChatGPT生成的對話數據集。
>>XVERSE-13B:一個13B多語言大型語言模型,訓練了超過1.4萬億標記。
5.1 Overall Performance總體性能
八個基準簡介:MMLU(學術科目的多項選擇題)、AGIEval(以人為中心的認知和問題解決的一般能力)、BBH(有挑戰性的BIG-Bench任務),C-Eval(基于中文的1W個多項選擇題)、CMMLU(中文語言和文化背景下的知識和推理能力)、Gaokao(中國高考),GSM8K(評估數學)、HumanEval(164個編程問題)
This section introduces the overall performance of Baichuan 2 base models compared with other similar-sized models. We choose 8 benchmarks for comparison: MMLU (Hendrycks et al., 2021a) The Massive Multitask Language Understanding consists of a range of multiple-choice questions on academic subjects. C-Eval (Huang et al., 2023) is a comprehensive Chinese evaluation benchmark consists of more than 10k multi-choice questions. CMMLU (Li et al., 2023) is also a general evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of the Chinese language and culture. AGIEval (Zhong et al., 2023) is a human-centric benchmark specifically designed to evaluate general abilities like human cognition and problem-solving. Gaokao (Zhang et al., 2023) is an evaluation framework that utilizes Chinese high school entrance examination questions. BBH (Suzgun et al., 2022) is a suite of challenging BIG-Bench (Srivastava et al., 2022) tasks that the language model evaluations did not outperform the average human-rater. GSM8K (Cobbe et al., 2021) is an evaluation benchmarks that focused on math. HumanEval (Chen et al., 2021) is a docstring-to-code dataset consisting of 164 coding problems that test various aspects of programming logic.
本節介紹了Baichuan 2基礎模型的總體性能,與其他類似大小的模型進行了比較。我們選擇了8個基準進行比較:MMLU(Hendrycks等人,2021a)大規模多任務語言理解包括一系列關于學術科目的多項選擇題。C-Eval(Huang等人,2023)是一個由1萬多個選擇題組成的綜合性的中文評估基準。CMMLU(Li等人,2023)也是一個通用評估基準,專門用于評估LLMs在中國語言和文化背景下的知識和推理能力。AGIEval(Zhong等人,2023)是一個以人為中心的基準,專門設計用于評估人類認知和問題解決等一般能力。Gaokao(Zhang等人,2023)是一個評估框架,利用了中國高中入學考試的問題。BBH(Suzgun等人,2022)是一套具有挑戰性的BIG-Bench(Srivastava等人,2022)任務,語言模型的評估沒有超過人類評分的平均水平。GSM8K(Cobbe等人,2021)是一個關注數學的評估基準。HumanEval(Chen等人,2021)是一個由164個編程問題組成的docstring-to-code數據集,測試編程邏輯的各個方面。
For CMMLU and MMLU, we adopt the official implementations and adopt 5-shot for evaluation. For BBH we adopt 3-shot evaluations. For C-Eval, Gaokao, and AGIEval we only select the multiple-choice with four candidates for better evaluations. For GSM8K, we adopt 4-shot testing derived from OpenCompass (OpenCompass, 2023). We also incorporate the result of GPT-46 and GPT-3.5-Turbo7. Unless stated otherwise, the results in this paper were obtained using our internal evaluation tools.
The overall result is shown in Table 1. Compared?with other similar-sized open-sourced models, our model has a clear performance advantage. Especially in math and code problems, our model achieves significant improvement over Baichuan 1.
對于CMMLU和MMLU,我們采用了官方實現,并采用了?5-shot?進行評估。對于BBH,我們采用了3-shot評估。對于C-Eval、Gaokao和AGIEval,我們僅選擇了具有四個候選項的多選題進行更好的評估。對于GSM8K,我們采用了從OpenCompass(OpenCompass,2023)派生的4-shot測試。我們還包括了GPT-4和GPT-3.5-Turbo的結果。除非另有說明,本文中的結果是使用我們的內部評估工具獲得的。
總體結果如表1所示。與其他類似大小的開源模型相比,我們的模型具有明顯的性能優勢。特別是在數學和代碼問題上,我們的模型相對于Baichuan 1取得了顯著的改進。
Table 1: Overall results of Baichuan 2 compared with other similarly sized LLMs on general benchmarks. * denotes results derived from official websites.
表1:Baichuan 2與其他類似規模的LLMs在通用基準測試上的整體結果。*表示來自官方網站的結果。
5.2 Vertical Domain Evaluations垂直領域評估:法律領域(JEC-QA,僅次于GPT-4)、醫學領域(MedQA+MedMCQA等,超越了ChatGLM 2-6B和LLaMA 2-7B)
We also evaluate Baichuan 2 in vertical domains, where we choose the law and medical field as they has been widely studied in recent years.
In the law field, we report scores of JEC-QA (Zhong et al., 2020), which is collected from the National Judicial Examination of China. It contains multiple-choice and multiple-answer questions. For compatibility with our evaluation suite, we only test the multiple-choice questions.
In the medical field, we report scores from two medical benchmarks, MedQA (Jin et al., 2021) and MedMCQA (Pal et al., 2022), as well as average scores from medical-related disciplines in C-Eval (val), MMLU, and CMMLU (abbreviated as CMC). Specifically, MedMCQA is collected from the professional medical board exams in the USA and China, including three subsets, i.e., USMLE, MCMLE and TWMLE, and we report the results of USMLE and MCMLE with five candidates; MedMCQA is collected from from Indian medical entrance exams, and we evaluate multiple-choice questions and report the scores in the dev set. The detail of MedMCQA includes (1) clinical medicine, basic medicine of C-Eval (val), (2) clinical knowledge, anatomy, college medicine, college biology, nutrition, virology, medical genetics, professional medicine of MMLU,(3) anatomy, clinical knowledge, college medicine, genetics, nutrition, traditional chinese medicine, virology of CMMLU. Moreover, all these datasets are evaluated in 5-shot.
我們還評估了Baichuan 2在垂直領域中的表現,選擇了法律和醫學領域,因為它們近年來得到了廣泛研究。
在法律領域,我們報告了來自中國國家司法考試的JEC-QA(Zhong等人,2020)的分數,該數據集包含多項選擇和多答案問題。出于與我們評估套件的兼容性考慮,我們只測試多項選擇問題。
在醫學領域,我們報告了兩個醫學基準的分數,MedQA(Jin等人,2021)和MedMCQA(Pal等人,2022),以及C-Eval(val)中與醫學相關學科的平均分數,以及MMLU和CMMLU(簡稱CMC)中的醫學相關學科的平均分數。具體來說,MedMCQA是從美國和中國的專業醫學委員會考試中收集的,包括三個子集,即USMLE、MCMLE和TWMLE,我們報告了USMLE和MCMLE的結果,包括五個候選項;MedMCQA是從印度醫學入學考試中收集的,我們評估多項選擇問題,并報告dev集的分數。MedMCQA的詳細信息包括
(1)C-Eval(val)的臨床醫學、基礎醫學,(
2)MMLU的臨床知識、解剖學、大學醫學、大學生物學、營養學、病毒學、醫學遺傳學、專業醫學,
(3)CMMLU的解剖學、臨床知識、大學醫學、遺傳學、營養學、中藥學、病毒學。此外,所有這些數據集都是在5-shot下評估的。
As shown in Table 5 Baichuan 2-7B-Base surpasses models such as GPT-3.5 Turbo, ChatGLM 2-6B, and LLaMA 2-7B in the field of Chinese law, second only to GPT-4. Compared to Baichuan 1-7B, Baichuan 2-7B-Base shows an improvement of nearly 10 points. In the medical field, Baichuan 2-7B-Base outperforms models like ChatGLM 2-6B and LLaMA 2-7B, showing significant improvement over Baichuan 1-7B as well.
Similarly, Baichuan 2-13B-Base surpasses models other than GPT-4 in the field of Chinese law. In the medical domain, Baichuan 2-13B-Base outperforms models such as XVERSE-13B?and LLaMA 2-13B. Compared to Baichuan 1- 13B-Base, Baichuan 2-13B-Base also exhibits remarkable improvement.
如表5所示,Baichuan 2-7B-Base在中國法律領域超越了GPT-3.5 Turbo、ChatGLM 2-6B和LLaMA 2-7B等模型,僅次于GPT-4。在醫學領域,Baichuan 2-7B-Base超越了ChatGLM 2-6B和LLaMA 2-7B等模型,在Baichuan 1-7B上也取得了近10個點的改進。
同樣,Baichuan 2-13B-Base在中國法律領域超越了除GPT-4以外的其他模型。在醫學領域,Baichuan 2-13B-Base超越了XVERSE-13B和LLaMA 2-13B等模型,相對于Baichuan 1-13B-Base,Baichuan 2-13B-Base也取得了顯著的改進。
Table 5: The result of Baichuan 2 compared with other models on law and medical filed.
表5:Baichuan 2在法律和醫學領域與其他模型的結果比較。
5.3 Math and Code數學和代碼
MATH(包含1.25W個困難問題)、GSM8K,HumanEval(包含言理解、推理、算法和簡單數學的一系列編程任務)、MBPP(974個Python短函數和程序文本描述)
This section introduces the performance in mathematics and coding.
本節介紹了數學和編程的性能。
We use GSM8K (Cobbe et al., 2021) (4-shot) and MATH (Hendrycks et al., 2021b) (4-shot) to evaluate the mathematical ability. MATH contains 12,500 mathematical questions that are harder to be solved. To evaluate the model’s code ability, we report the scores in HumanEval (Chen et al., 2021)(0-shot) and MBPP (Austin et al., 2021) (3-shot).
>>HumanEval is a series of programming tasks including model language comprehension, reasoning, algorithms, and simple mathematics to evaluate the correctness of the model and measure the model’s problem-solving ability.
>>MBPP. It consists of a dataset of 974 Python short functions and program textual descriptions, along with test cases used to verify the correctness of their functionality.
我們使用GSM8K(Cobbe等人,2021)(4-shot)和MATH(Hendrycks等人,2021b)(4-shot)來評估數學能力。MATH包含12,500個更難解決的數學問題。為了評估模型的代碼能力,我們報告了HumanEval(Chen等人,2021)(0-shot)和MBPP(Austin等人,2021)(3-shot)的分數。
>>HumanEval是一系列編程任務,包括模型語言理解、推理、算法和簡單數學,旨在評估模型的正確性和問題解決能力。
>>MBPP。它包含974個Python短函數和程序文本描述的數據集,以及用于驗證其功能正確性的測試用例。
OpenCompass評估:數學領域(接近GPT-3.5 Turbo的水平)、代碼領域(超越了LLaMA 2-13B)
We use OpenCompass to evaluate the ability of models in math and code. As shown in Table 6, in the field of mathematics, Baichuan 2-7B-Base surpasses models like LLaMA 2-7B. In the code domain, it outperforms models of the same size such as ChatGLM 2-6B. Baichuan 2-7B-Base exhibits significant improvement compared to the Baichuan 1-7B model.
In mathematics, Baichuan 2-13B-Base surpasses all models of the same size, approaching the level of GPT-3.5 Turbo. In the code domain, Baichuan 2-13B-Base outperforms models like LLaMA 2- 13B and XVERSE-13B. Baichuan 2-13B-Base demonstrates significant improvement compared to Baichuan 1-13B-Base.
我們使用OpenCompass來評估模型在數學和代碼方面的能力。如表6所示,在數學領域,Baichuan 2-7B-Base超越了LLaMA 2-7B等大小相似的模型。在代碼領域,它超越了ChatGLM 2-6B等大小相似的模型。Baichuan 2-7B-Base相對于Baichuan 1-7B模型也取得了顯著的改進。
在數學領域,Baichuan 2-13B-Base超越了所有相同大小的模型,接近了GPT-3.5 Turbo的水平。在代碼領域,Baichuan 2-13B-Base超越了LLaMA 2-13B和XVERSE-13B等模型。Baichuan 2-13B-Base相對于Baichuan 1-13B-Base模型也取得了顯著的改進。
Table 6: The result of Baichuan 2 compared with other models on mathematics and coding.
表6:Baichuan 2在數學和編程方面與其他模型的結果比較。
5.4 Multilingual多語言:Flores-101評估(涵蓋全球101種語言)
We use Flores-101 (NLLB Team, 2022; Goyal et al., 2021; Guzmán et al., 2019) to evaluate multilingual ability. Flores-101 covers 101 languages from around the world. Its data is sourced from various domains such as news, travel guides, and books. We selected the official languages of the United Nations (Arabic (ar), Chinese (zh), English (en), French (fr), Russian (ru), and Spanish (es)), as well as German (de) and Japanese (ja), as the test languages. We conducted 8-shot tests on seven subtasks in Flores-101 , including zh-en, zh-fr, zh-es, zh-ar, zh-ru, zh-ja and zh-de. The evaluation is conducted with OpenCompass.
我們使用Flores-101(NLLB團隊,2022年;Goyal等人,2021年;Guzmán等人,2019年)來評估多語言能力。Flores-101涵蓋了來自世界各地的101種語言。其數據來自各個領域,如新聞、旅游指南和書籍。我們選擇了聯合國的官方語言(阿拉伯語(ar)、中文(zh)、英語(en)、法語(fr)、俄語(ru)和西班牙語(es)),以及德語(de)和日語(ja)作為測試語言。我們在Flores-101的七個子任務中進行了8-shot測試,包括zh-en、zh-fr、zh-es、zh-ar、zh-ru、zh-ja和zh-de。評估是通過OpenCompass進行的。
In the multilingual domain, as shown in Table 7, Baichuan 2-7B-Base surpasses all models of the same size in all seven tasks and shows significant improvement compared to Baichuan 1-7B.
Baichuan 2-13B-Base outperforms models of the same size in four out of the seven tasks. In the zh-en and zh-ja tasks, it surpasses GPT3.5 Turbo and reaches the level of GPT-4. Compared to Baichuan 1-13B-Base, Baichuan 2-13B-Base exhibits significant improvement in the zh-ar, zh-ru, and zh-ja tasks.
Although GPT-4 still dominates in the field of multilingualism, open-source models are catching up closely. In zh-en tasks, Baichuan 2-13B-Base has slightly surpassed GPT-4.
在多語言領域,如表7所示,Baichuan 2-7B-Base在所有七個任務中都超越了所有相同大小的模型,并相對于Baichuan 1-7B取得了顯著的改進。
Baichuan 2-13B-Base在七個任務中的四個任務中超越了相同大小的模型。在zh-e超越了GPT3.5 Turbon和zh-ja任務中,它,達到了GPT-4的水平。相對于Baichuan 1-13B-Base,Baichuan 2-13B-Base在zh-ar、zh-ru和zh-ja任務中表現出了顯著的改進。
雖然GPT-4仍然在多語言領域占據主導地位,但開源模型正逐漸迎頭趕上。在zh-en任務中,Baichuan 2-13B-Base稍微超越了GPT-4。
Table 7: The result of Baichuan 2 compared with other models on multilingual field.
表7:Baichuan 2在多語言領域與其他模型的結果比較。
5.5 Safety Evaluations安全性評估:Toxigen數據集、構建了BHED【白川無害評估數據集+七個類別+7*1W的樣本】
In Sec. 4, we describe the efforts made to improve the safety of Baichuan 2. However, some prior work indicates that helpfulness and harmlessness are two sides of a seesaw - when harmlessness increases, helpfulness could lead to a bit decrease (Bai et al., 2022a). So we evaluate these two factors before and after safety alignments.
Figure 6 shows the helpfulness and harmlessness before and after the safety alignment of Baichuan 2.We can see that our safety alignment process did not hurt the helpfulness while significantly improving the harmlessness.
Then we evaluate the safety of our pre-trained models using the Toxigen (Hartvigsen et al., 2022) dataset. Same as LLaMA 2, we use the cleaned?version from the SafeNLP project8, distinguishing neutral and hate types for the 13 minority groups, forming a 6-shot dataset consistent with the original Toxigen prompt format. Our decoding parameters use temperature 0.1 and top-p 0.9 nucleus sampling.
在第4節中,我們描述了改善Baichuan 2安全性的努力。然而,一些先前的工作指出,幫助性和無害性是一把雙刃劍的兩面——當無害性增加時,幫助性可能會稍微減少(Bai等人,2022a)。因此,我們在安全對齊之前和之后評估了這兩個因素。
圖6顯示了Baichuan 2安全對齊之前和之后的幫助和無害性。我們可以看到,我們的安全對齊過程并沒有損害幫助性,而在很大程度上提高了無害性。
然后,我們使用Toxigen(Hartvigsen等人,2022)數據集來評估我們的預訓練模型的安全性。與LLaMA 2一樣,我們使用SafeNLP項目的清理版本,區分了13個少數民族群體中的中性和仇恨類型,形成了一個6-shot的數據集,與原始的Toxigen提示格式一致。我們的解碼參數使用溫度0.1和top-p 0.9核心抽樣。
We use the fine-tuned HateBert version optimized in the Toxigen (Hartvigsen et al., 2022) for model evaluation. Table 8 shows that compared to LLaMA 2, the Baichuan 2-7B and Baichuan 2-13B model has some safety advantages.
Inspired by BeaverTails Ji et al. (2023)9, we constructed the Baichuan Harmless Evaluation Dataset (BHED), covering 7 major safety categories of bias/discrimination, insults/profanity, illegal/unethical content, physical health, mental health, financial privacy, and sensitive topics to evaluate the safety of our chat models.
To ensure comprehensive coverage within each category, We ask human annotators to generate 1,400 data samples. This was further expanded through self-instruction and cleaned by humans for fluency, resulting in 70,000 total samples with 10,000 per category. Examples of those safety prompts and principles are shown in the Appendix E.
We use those samples to evaluate different models and the result is shown in Table 9. We can see that Baichuan 2 is on par or outperforms other chat models in our safety evaluations.
我們使用在Toxigen(Hartvigsen等人,2022)中優化的HateBert版本來評估模型。表8顯示,與LLaMA 2相比,Baichuan 2-7B和Baichuan 2-13B模型在某些安全方面具有一定的優勢。
受到BeaverTails Ji等人(2023)的啟發,我們構建了Baichuan Harmless Evaluation Dataset(BHED【白川無害評估數據集】),涵蓋了7個主要的安全類別,包括偏見/歧視、侮辱/褻瀆、非法/不道德內容、身體健康、心理健康、金融隱私和敏感話題,以評估我們聊天模型的安全性。
為了確保每個類別內部具有全面的覆蓋范圍,我們請人工標記者生成1,400個數據樣本。這些樣本通過自我學習進一步擴展,并由人工清理以獲得流暢性,總共有70,000個樣本,每個類別有10,000個。這些安全提示和原則的示例顯示在附錄E中。
我們使用這些樣本來評估不同的模型,結果如表9所示。我們可以看到,在我們的安全評估中,Baichuan 2與其他聊天模型不相上下,甚至表現更好。
Figure 6: Helpfulness and harmlessness before and after safety alignment of Baichuan 2.
Figure 6: Helpfulness and harmlessness before and after safety alignment of Baichuan 2. The x-axis shows the metric before safety alignment and the y-axis shows the result after. We see that helpfulness remains largely unchanged after this procedure, while harmlessness improved substantially (more mass in upper triangle) with safety efforts.
圖6:Baichuan 2安全對齊前后的有益無害。x軸表示安全對齊前的度量,y軸表示安全對齊后的結果。我們看到,在這個過程后,有益性基本上沒有改變,而無害性在安全努力下得到了實質性的改善(上三角形的質量更大)。
Table 8: Toxigen results of Baichuan 2 foundation models compared with LLaMA 2.
表8:Baichuan 2基礎模型與LLaMA 2的Toxigen結果比較。
Table 9: The result of different chat models on our safety evaluation benchmarks.
表9:不同聊天模型在我們的安全評估基準上的結果。
5.6 Intermediate Checkpoints中間檢查點
We will also release the intermediate checkpoints of 7B models, from 220 billion tokens checkpoint to 2,640 billion tokens checkpoint, which is the final output of Baichuan 2-7B-Base. We examine their performance on several benchmarks and the result is shown in Figure 7.
As shown in the figure, Baichuan 2 demonstrates consistent improvement as training proceeds. Even after 2.6 trillion tokens, there appears to be ample room for further gains. This aligns with previous work on scaling LLMs indicating that data size is a critical factor (Hoffmann et al., 2022). In the Appendix D, we provide more detailed training dynamics for both the 7B and 13B models.
我們還將發布7B模型的中間checkpoints?,從2200億標記的檢查點到2640億標記的檢查點,這是Baichuan 2-7B-Base的最終輸出。我們對它們在幾個基準上的性能進行了檢查,結果如圖7所示。
正如圖中所示,Baichuan 2在訓練過程中表現出了一致的改進。即使在2.6萬億標記之后,似乎仍然有足夠的提升空間。這與以前關于擴展LLMs的工作表明數據大小是一個關鍵因素的研究一致(Hoffmann等人,2022)。在附錄D中,我們提供了7B和13B模型的更詳細的訓練動態。
Figure 7: The results of intermediary checkpoints of Baichuan 2-7B which will be released to the public.
圖7:Baichuan 2-7B的中間檢查點結果,將向公眾發布。
6 Related Work相關工作
LM復興源自深度神經網絡和Transformer的發展→大模型KM縮放定律(NDC三公式冪律關系+偏模型更大預算)讓各大AI組織(OpenAI/Google/Meta/Anthropic)卷入計算競賽→Chinchilla縮放定律(偏數據更大預算)
The field of language models has undergone a renaissance in recent years, sparked largely by the development of deep neural networks and?Transformers (Vaswani et al., 2017). Kaplan et al.(2020) proposed the scaling laws for large model pre-training. By systematically analyzing model performance as parameters and data size increased, they provided a blueprint for the current era of massive models with hundreds of or even billions of parameters.
Seizing upon these scaling laws, organizations like OpenAI, Google, Meta, and Anthropic have engaged in a computing arms race to create ever-larger LLMs. Spurred by the OpenAI’s 175 billion parameters proprietary language model GPT-3 (Brown et al., 2020). The few-shot or even zero-shot ability of LLMs has revolved most natural language understanding tasks. From code generation to math-solving problems or even open-world scenarios. Specialized scientific LLMs like Galactica (Taylor et al., 2022) have also emerged to showcase the potential for large models to assimilate technical knowledge. However, raw parameter count alone does not determine model capability - Chinchilla (Hoffmann et al., 2022) demonstrated that scaling model capacity according to the number of tokens, rather than just parameters, can yield better sample efficiency.
近年來,語言模型領域經歷了一次復興,主要是由于深度神經網絡和Transformer(Vaswani等人,2017)的發展所引發的。Kaplan等人(2020)提出了用于大型模型預訓練的縮放定律。通過系統分析模型性能隨參數和數據大小增加的情況,他們為當前具有數百億甚至數千億參數的大型模型時代提供了藍圖。
利用這些縮放定律,OpenAI、Google、Meta和Anthropic等組織已經卷入了一場計算的競賽,以創建規模更大的LLMs。受OpenAI的1750億參數專有語言模型GPT-3(Brown等人,2020)的推動,LLMs的few-shot甚至zero-shot能力已經圍繞著大多數自然語言理解任務展開。從代碼生成到數學問題甚至開放世界的情景。專門的科學LLMs,如Galactica(Taylor等人,2022),也已經出現,展示了大型模型吸收技術知識的潛力。然而,僅憑原始參數數量無法確定模型的能力——Chinchilla(Hoffmann等人,2022)表明,根據標記數量而不僅僅是參數來擴展模型容量,可以提高樣本效率。
開源的基礎模型:千億token(Bloom/OPT/Pythia)→萬億token(LLaMA脫穎而出)
Concurrent with the development of private LLMs, academic and non-profit efforts have worked to develop open-source alternatives like Bloom (Scao et al., 2022), OPT (Zhang et al., 2022) and Pythia (Biderman et al., 2023b). Although some open-source large language models contain up to 175 billion parameters, most are trained on only 500 billion tokens or less. This is relatively small considering that 7 billion parameter models can still significantly improve after being trained on trillions of tokens. Among those open-sourced models, LLaMA (Touvron et al., 2023b) and its successor LLaMA 2 (Touvron et al., 2023c) stands out for its performance and transparency. Which was quickly optimized by the community for better inference speed and various applications.
與私有LLMs的發展同時,學術和非營利性機構一直致力于開發像Bloom(Scao等人,2022)、OPT(Zhang等人,2022)和Pythia(Biderman等人,2023b)這樣的開源替代品。盡管一些開源的大型語言模型包含多達1750億參數,但大多數僅在5000億標記或更少的數據上訓練。考慮到70億參數模型在訓練數萬億標記后仍然可以顯著改進,這相對較小。在那些開源的模型中,LLaMA(Touvron等人,2023b)及其后繼模型LLaMA 2(Touvron等人,2023c)以其性能和透明度脫穎而出。社區迅速對其進行了優化,以提高推理速度和各種應用。
微調后的聊天模型(遵循人類指令):微調基礎模型實現與人類保持一致→進一步改善對齊提出RLHF(在人類評定的輸出上+訓練獎勵模型+來學習人類偏好,如直接偏好優化DPO/來自AI反饋的強化學習RLAIF)
In addition to those foundation models, a lot of chat models have also been proposed to follow human instructions. Most of them fine-tune the foundation models to align with human (OpenAI, 2022; Wang et al., 2023). Those chat models have demonstrated a marked improvement in understanding human instructions and solving complex tasks (Chiang et al., 2023; Xu et al., 2023; Sun et al., 2023). To further improve alignment, (Ouyang et al., 2022) incorporates the Reinforcement Learning from Human Feedback (RLHF) approach. This involves learning from human preferences by training a reward model on human-rated outputs. Other methods such as direct preference optimization (DPO) (Rafailov et al., 2023) and reinforcement learning from AI feedback (RLAIF) (Bai et al., 2022b) have also been proposed to improve the RLHF both in terms of efficiency and effectiveness.
除了這些基礎模型,還提出了許多聊天模型,以遵循人類的指令。它們大多數是對基礎模型進行微調,使其以與人類保持一致(OpenAI,2022;Wang等人,2023)。這些聊天模型已經在理解人類指令和解決復雜任務方面取得了顯著的改進(Chiang等人,2023;Xu等人,2023;Sun等人,2023)。為了進一步改善對齊,Ouyang等人(2022)融入了來自人類反饋的強化學習(RLHF)方法。這涉及到通過在人類評定的輸出上訓練獎勵模型來學習人類偏好。其他方法,如直接偏好優化(DPO)(Rafailov等人,2023)和來自AI反饋的強化學習(RLAIF)(Bai等人,2022b),也已經提出,以提高RLHF的效率和效果。
7 Limitations and Ethical Considerations限制和道德考慮
依然存在偏見和毒性影響(本文采用了Toxigen基準來減輕)
知識的非實時更新性會對醫學或者法律帶來挑戰
Like other large language models, Baichuan 2 also faces ethical challenges. It’s prone to biases and toxicity, especially given that much of its training data originates from the internet. Despite our best efforts to mitigate these issues using benchmarks like Toxigen (Hartvigsen et al., 2022), the risks cannot be eliminated, and toxicity tends to increase with model size. Moreover, the knowledge of Baichuan 2 models is static and can be outdated or incorrect, posing challenges in fields that require up-to-date information like medicine or law. While optimized for Chinese and English for safety, the model has limitations in other languages and may not fully capture biases relevant to non-Chinese cultures.
與其他大型語言模型一樣,Baichuan 2也面臨著倫理挑戰。它容易受到偏見和毒性的影響,特別是考慮到它的訓練數據很大程度上來自互聯網。盡管我們盡最大努力通過使用Toxigen(Hartvigsen等人,2022)等基準來減輕這些問題,但風險無法完全消除,而且隨著模型大小的增加,毒性往往會增加。此外,Baichuan 2模型的知識是靜態的,可能會過時或不正確,這在需要最新信息的領域,如醫學或法律,會帶來挑戰。雖然為了安全性而進行了中文和英文的優化,但該模型在其他語言方面存在局限性,并且可能無法充分捕捉與非中國文化相關的偏見。
依然存在存在濫用的潛力
There’s also the potential for misuse, as the model could be used to generate harmful or misleading content. Although we try our best efforts to balance safety and utility, some safety measures may appear as over-cautions, affecting the model’s usability for certain tasks. We encourage users to make responsible and ethical use of Baichuan 2 models. Meanwhile, we will continue to optimize these issues and release updated versions in the future.
還存在濫用的潛力,因為該模型可能被用來生成有害或誤導性的內容。盡管我們盡最大努力平衡安全性和效用,但一些安全措施可能會顯得過于謹慎,影響模型在某些任務中的可用性。我們鼓勵用戶負責任地和道德地使用Baichuan 2模型。與此同時,我們將繼續優化這些問題,并在未來發布更新版本。