Progress Towards an LLM That Can Handle a Billion Context Tokens
Thanks to the work of Frank Odom https://fkodom.substack.com/ and his work creating an implementation of LongNet (https://arxiv.org/abs/2307.02486) at https://github.com/fkodom/dilated-attention-pytorch, Vieira Santos and I were able to benchmark 288 Million tokens on a single A100 at https://github.com/DarcStar-Solutions-Tech/dilated-attention-pytorch, benchmarks under the doc folder https://github.com/DarcStar-Solutions-Tech/dilated-attention-pytorch/tree/main/doc
288 million tokens was chosen as it was the largest number of tokens that could be handled in the 80GB of available GPU memory of the single A100. Running 256 (2**28) million tokens required 40GB of VRam off of the single GPU. Attempts were made to benchmark 512 (2x 2**28) million context tokens, an additional 64GB of memory would have been required. At 1.5x 2**28 tokens, 48GB of additional VRam was needed. At 1.25x 2**28 tokens, 40GB of additional VRam was requested, this was slightly more than was available on the GPU. So 1.125x 2**28 tokens was chosen as it was empirically tested based upon the provided implementation
Additional results are available for benchmarking 256 million context tokens, 128 million context tokens, and 64 million context tokens.
Next steps are to work on the implementation of LongNet/Dilated Attention to allow for distributed workloads (token processing) so that multiple GPUs can be used for benchmarking larger context windows.