机器学习与数据科学博士生系列论坛(第一百零五期)—— Pretraining Data Strategy for Large Language Models
报告人:孙昱洋(北京大学)
时间:2026-06-25 16:00-17:00
地点:腾讯会议:928-6293-8217
摘要:
Pretraining data strategy has become a central problem in large language model development, where data quality is not only about filtering noisy documents, but also about allocating a finite training budget across topics, sources, quality levels, repetitions, and training stages.
This talk reviews how pretraining data can be organized and optimized: from topic-based web mixing and quality-score fusion, to multi-source mixture search and repetition-aware modeling. We then discuss a stage-wise view of pretraining data design, where the key unit shifts from a static corpus to an executable training recipe. In this setting, high-quality data should be informative, diverse, capacity-aware, and aligned with the objectives of each training stage.
论坛简介:该线上论坛是由张志华教授机器学习实验室组织,每两周主办一次(除了公共假期)。论坛每次邀请一位博士生就某个前沿课题做较为系统深入的介绍,主题包括但不限于机器学习、高维统计学、运筹优化和理论计算机科学。