[1] |
BROWN T, MANN B, RYDER N, et al. Language Models are Few-Shot Learners[C]. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
|
[2] |
KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling Laws for Neural Language Models[Z]. ArXiv, 2020:abs/2001.08361.
|
[3] |
DEAN J, CORRADO G, MONGA R, et al. Large Scale Distributed Deep Networks[C]. In Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012: 1223-1231.
|
[4] |
NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021: 1-15.
|
[5] |
YU E, DONG D, LIAO X. Communication Optimization Algorithms for Distributed Deep Learning Systems: A Survey[J]. IEEE Transactions on Parallel and Distributed Systems, 2012, 34(12): 3294-3308.
|
[6] |
SEIDE F, FU H, DROPPO J, et al. 1-Bit Stochastic Gradient Descent and Its Application to Data-Parallel Distributed Training of Speech DNNs[C]. Interspeech, 2014: 1058-1062.
|
[7] |
ALISTARH D, GRUBIC D, LI J, et al. QSGD: Communication-efficient SGD via Gradient Quantization and Encoding[J]. Advances in Neural Information Processing Systems, 2017: 1707-1718.
|
[8] |
WU J, HUANG W, HUANG J, et al. Error Compensated Quantized SGD and Its Applications to Large-Scale Distributed Optimization[C]. International Conference on Machine Learning, 2018: 5325-5333.
|
[9] |
ZHANG H, LI J, KARA K, et al. ZipML: Training Linear Models with End-To-End Low Precision, and a Little Bit of Deep Learning[C]. International Conference on Machine Learning, 2017: 4035-4043.
|
[10] |
KARIMIREDDY S, REBJOCK Q, STICH S, et al. Error Feedback Fixes SignSGD and Other Gradient Compression Schemes[C]. International Conference on Machine Learning, 2019: 3252-3261.
|
[11] |
HUANG J, DI S, YU X, et al. An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression[C]. IEEE International Parallel and Distributed Processing Symposium, 2024: 752-764.
|
[12] |
FENG H, ZHANG B, YE F, et al. Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression[C]. The International Conference for High Performance Computing, Networking, Storage and Analysis, 2024: 17-22.
|
[13] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C]. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
|
[14] |
DEVLIN J, CHANG M, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
|
[15] |
ZINKEVICH M, WEIMER M, LI L, et al. Parallelized Stochastic Gradient Descent[C]. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2010: 2595-2603.
|
[16] |
HUANG Y, CHENG Y, BAPNA A, et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism[C]. Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019: 103-112.
|
[17] |
NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: Generalized Pipeline Parallelism for DNN Training[C]. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019: 1-15.
|
[18] |
SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism[Z]. ArXiv, 2019: abs/1909.08053.
|
[19] |
SMITH S, PATWARY M, NORICK B, et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model[Z]. ArXiv, 2022: abs/2201.11990.
|
[20] |
TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and Efficient Foundation Language Models[Z]. ArXiv, 2023: abs/2302.13971.
|
[21] |
YU X, DI S, ZHAO K, et al. Ultrafast Error-bounded Lossy Compression for Scientific Datasets[C]. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022: 159-171.
|
[22] |
ZHOU Q, CHU C, KUMAR N, et al. Designing high-performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters[C]. IEEE International Parallel and Distributed Processing Symposium, 2021: 444-453.
|