北京大学数学学院

主页 » 科学研究» 学术报告» 讨论班» Information Sciences

讨论班

机器学习与数据科学博士生系列论坛（第一百零五期）—— Pretraining Data Strategy for Large Language Models

报告人：孙昱洋（北京大学）

时间：2026-06-25 16:00-17:00

地点：腾讯会议：928-6293-8217

摘要：
Pretraining data strategy has become a central problem in large language model development, where data quality is not only about filtering noisy documents, but also about allocating a finite training budget across topics, sources, quality levels, repetitions, and training stages.

This talk reviews how pretraining data can be organized and optimized: from topic-based web mixing and quality-score fusion, to multi-source mixture search and repetition-aware modeling. We then discuss a stage-wise view of pretraining data design, where the key unit shifts from a static corpus to an executable training recipe. In this setting, high-quality data should be informative, diverse, capacity-aware, and aligned with the objectives of each training stage.

论坛简介：该线上论坛是由张志华教授机器学习实验室组织，每两周主办一次（除了公共假期）。论坛每次邀请一位博士生就某个前沿课题做较为系统深入的介绍，主题包括但不限于机器学习、高维统计学、运筹优化和理论计算机科学。

TOP

漫蛙漫画

北大数学成就展

人才引进

捐赠