Anyscale的动态

Anyscale转发了

查看Robert Nishihara的档案

Co-founder at Anyscale

ByteScale is a new LLM training framework from ByteDance - Evaluated on 7B to 141B param models - 256K to 2048K context lengths - 12,000 GPUs - Optimized for mixed long and short sequences The crux of it is a much more dynamic parallelism strategy (as opposed to a static mesh) to account for heterogeneity in sequence length. They call this strategy Hybrid Data Parallelism (HDP), which combines regular data parallelism with context parallelism in a dynamic manner. Their data loading strategy is very network and CPU-memory intensive and requires global coordination across workers (as opposed to each worker doing its own thing). They use Ray actor for this coordination. There are - Servers to fetch and preprocess raw data from HDFS and generate metadata - A scheduler to collect global metadata from all servers, figure out the the loading plan, and broadcast the plan to clients - Clients (on GPUs), which read the partial data from servers based on the loading plan

  • 该图片无替代文字
Srini Vemula

Building NeXT Gen Ai & Quantum Leaders|?A|Q?MATiCS|{igebra.ai}| ExDatabricks

2 天前

Very interesting ??

回复
查看更多评论

要查看或添加评论,请登录