T5 (language model) explained

Text-to-Text Transfer Transformer (T5)
Author:Google AI
Latest Release Version:T5X
Repo:https://github.com/google-research/text-to-text-transfer-transformer
License:Apache-2.0

T5 (Text-to-Text Transfer Transformer) is a series of large language models developed by Google AI introduced in 2019.[1] Like the original Transformer model,[2] T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

T5 models are usually pretrained on a massive dataset of text and code, after which they can perform the text-based tasks that are similar to their pretrained tasks. They can also be finetuned to perform other tasks.

T5 models have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.[3]

Training

The original T5 models are pre-trained on the Colossal Clean Crawled Corpus (C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications.

The T5 models were pretrained on many tasks, all in the format of <input text> -> <output text>.Some examples are:

Architecture

The T5 series encompasses several models with varying sizes and capabilities, all encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper reported the following 5 models:

T5 properties
Name Total parameters Encoder parametersDecoder parameters

nlayer

dmodel

dff

dkv

nhead

Small 76,956,160 35,330,81641,625,3446 512 2048 64 8
Base 247,577,856 109,628,544137,949,31212 768 3072 64 12
Large 770,567,168 334,939,648435,627,52024 1024 4096 64 16
3B 2,884,497,408 1,240,909,8241,643,587,58424 1024 16384 128 32
11B 11,340,220,4164,864,791,5526,475,428,86424 1024 65536 128 128

In the above table,

nlayer

: Number of layers in the encoder; also, number of layers in the decoder. They always have the same number of layers.

nhead

: Number of attention heads in each attention block.

dmodel

: Dimension of the embedding vectors.

dff

: Dimension of the feedforward network within each encoder and decoder layer.

dkv

: Dimension of the key and value vectors used in the self-attention mechanism.Note that unlike typical Transformers, the 3B and 11B models do not satisfy

dmodel=dkvnhead

.[4]

Compared to the original Transformer, it uses a few minor modifications: layer normalization with no additive bias; placing the layer normalization outside the residual path; relative positional embedding.

For all experiments, they used a WordPiece tokenizer, with vocabulary size 32,000. The tokenizer is shared across both the input and output of each model. It was trained on a mixture of English, German, French, and Romanian data from the C4 dataset, at a ratio of 10:1:1:1.

Variants

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X.[5]

Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted.

T5 v1.1 properties
Name Total parameters Encoder parametersDecoder parameters

nlayer

dmodel

dff

dkv

nhead

Small 76,961,152 35,332,80041,628,3528 512 1024 64 6
Base 247,577,856 109,628,544137,949,31212 768 2048 64 12
Large 783,150,080 341,231,104441,918,97624 1024 2816 64 16
XL 2,849,757,184 1,223,527,4241,626,229,76024 2048 5120 64 32
XXL 11,135,332,352 4,762,310,6566,373,021,69624 4096 10240 64 64

Applications

The T5 model itself is an encoder-decoder model, allowing it to be used for instruction following. The encoder encodes the instruction, and the decoder autoregressively generates the reply.

The T5 encoder can be used as a text encoder, much like BERT. It encodes a text into a sequence of real-number vectors, which can be used for downstream applications. For example, Google Imagen[17] uses T5-XXL as text encoder, and the encoded text vectors are used as conditioning on a diffusion model. As another example, the AuraFlow diffusion model[18] uses Pile-T5-XL.

References

  1. Raffel . Colin . Shazeer . Noam . Roberts . Adam . Lee . Katherine . Narang . Sharan . Matena . Michael . Zhou . Yanqi . Li . Wei . Liu . Peter J. . 2020 . Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research . 21 . 140 . 1–67 . 1910.10683 . 1533-7928.
  2. Vaswani . Ashish . Shazeer . Noam . Parmar . Niki . Uszkoreit . Jakob . Jones . Llion . Gomez . Aidan N . Kaiser . Łukasz . Polosukhin . Illia . 2017 . Attention is All you Need . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 30.
  3. 2210.03094 . cs.RO . Yunfan . Jiang . Agrim . Gupta . VIMA: General Robot Manipulation with Multimodal Prompts . 2022-10-06 . en . Zhang . Zichen . Wang . Guanzhi . Dou . Yongqiang . Chen . Yanjun . Fei-Fei . Li . Anandkumar . Anima . Zhu . Yuke.
  4. Web site: 2020-04-24 . config.json · google-t5/t5-11b at main . 2024-09-17 . huggingface.co.
  5. Web site: t5x/docs/models.md at main · google-research/t5x . 2024-08-05 . GitHub . en.
  6. Web site: 2020-11-19 . config.json · google/t5-v1_1-xl at main . 2024-09-17 . huggingface.co.
  7. Web site: 2020-11-19 . config.json · google/t5-v1_1-xxl at main . 2024-09-17 . huggingface.co.
  8. Web site: SwitchTransformers . 2024-08-05 . huggingface.co.
  9. Web site: 2024-03-04 . bigscience/T0 · Hugging Face . 2024-08-21 . huggingface.co.
  10. Xue . Linting . Barua . Aditya . Constant . Noah . Al-Rfou . Rami . Narang . Sharan . Kale . Mihir . Roberts . Adam . Raffel . Colin . 2022-03-25 . ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models . Transactions of the Association for Computational Linguistics . en . 10 . 291–306 . 10.1162/tacl_a_00461 . 2307-387X. 2105.13626 .
  11. Chung . Hyung Won . Hou . Le . Longpre . Shayne . Zoph . Barret . Tay . Yi . Fedus . William . Li . Yunxuan . Wang . Xuezhi . Dehghani . Mostafa . Brahma . Siddhartha . Webson . Albert . Gu . Shixiang Shane . Dai . Zhuyun . Suzgun . Mirac . Chen . Xinyun . 2024 . Scaling Instruction-Finetuned Language Models . Journal of Machine Learning Research . 25 . 70 . 1–53 . 2210.11416 . 1533-7928.
  12. Longpre . Shayne . Hou . Le . Vu . Tu . Webson . Albert . Chung . Hyung Won . Tay . Yi . Zhou . Denny . Le . Quoc V. . Zoph . Barret . Wei . Jason . Roberts . Adam . 2023-07-03 . The Flan Collection: Designing Data and Methods for Effective Instruction Tuning . Proceedings of the 40th International Conference on Machine Learning . en . PMLR . 22631–22648. 2301.13688 .
  13. Web site: 2024-01-04 . google/flan-t5-xl · Hugging Face . 2024-08-05 . huggingface.co.
  14. Roberts . Adam . Chung . Hyung Won . Mishra . Gaurav . Levskaya . Anselm . Bradbury . James . Andor . Daniel . Narang . Sharan . Lester . Brian . Gaffney . Colin . Mohiuddin . Afroz . Hawthorne . Curtis . Lewkowycz . Aitor . Salcianu . Alex . Zee . Marc van . Austin . Jacob . 2023 . Scaling Up Models and Data with t5x and seqio . Journal of Machine Learning Research . 24 . 377 . 1–8 . 1533-7928.
  15. Web site: Training great LLMs entirely from ground up in the wilderness as a startup . 2024-10-18 . Yi Tay . en-US.
  16. Web site: Sutawika . Lintang . Komatsuzaki . Aran . Raffel . Colin . 2024-04-15 . Pile-T5 . 2024-05-05 . EleutherAI Blog . en.
  17. Web site: Imagen: Text-to-Image Diffusion Models . 2024-08-23 . imagen.research.google.
  18. Web site: AuraFlow . 2024-08-23 . huggingface.co.

External links

Notes

  1. Raffel . Colin . Shazeer . Noam . Roberts . Adam . Lee . Katherine . Narang . Sharan . Matena . Michael . Zhou . Yanqi . Li . Wei . Liu . Peter J. . 2020 . Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research . 21 . 140 . 1–67 . 1910.10683 . 1533-7928.
  2. Vaswani . Ashish . Shazeer . Noam . Parmar . Niki . Uszkoreit . Jakob . Jones . Llion . Gomez . Aidan N . Kaiser . Łukasz . Polosukhin . Illia . 2017 . Attention is All you Need . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 30.
  3. 2210.03094 . cs.RO . Yunfan . Jiang . Agrim . Gupta . VIMA: General Robot Manipulation with Multimodal Prompts . 2022-10-06 . en . Zhang . Zichen . Wang . Guanzhi . Dou . Yongqiang . Chen . Yanjun . Fei-Fei . Li . Anandkumar . Anima . Zhu . Yuke.
  4. Web site: 2020-04-24 . config.json · google-t5/t5-11b at main . 2024-09-17 . huggingface.co.
  5. Web site: t5x/docs/models.md at main · google-research/t5x . 2024-08-05 . GitHub . en.
  6. Web site: 2020-11-19 . config.json · google/t5-v1_1-xl at main . 2024-09-17 . huggingface.co.
  7. Web site: 2020-11-19 . config.json · google/t5-v1_1-xxl at main . 2024-09-17 . huggingface.co.
  8. Web site: SwitchTransformers . 2024-08-05 . huggingface.co.
  9. Web site: 2024-03-04 . bigscience/T0 · Hugging Face . 2024-08-21 . huggingface.co.
  10. Xue . Linting . Barua . Aditya . Constant . Noah . Al-Rfou . Rami . Narang . Sharan . Kale . Mihir . Roberts . Adam . Raffel . Colin . 2022-03-25 . ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models . Transactions of the Association for Computational Linguistics . en . 10 . 291–306 . 10.1162/tacl_a_00461 . 2307-387X. 2105.13626 .
  11. Chung . Hyung Won . Hou . Le . Longpre . Shayne . Zoph . Barret . Tay . Yi . Fedus . William . Li . Yunxuan . Wang . Xuezhi . Dehghani . Mostafa . Brahma . Siddhartha . Webson . Albert . Gu . Shixiang Shane . Dai . Zhuyun . Suzgun . Mirac . Chen . Xinyun . 2024 . Scaling Instruction-Finetuned Language Models . Journal of Machine Learning Research . 25 . 70 . 1–53 . 2210.11416 . 1533-7928.
  12. Longpre . Shayne . Hou . Le . Vu . Tu . Webson . Albert . Chung . Hyung Won . Tay . Yi . Zhou . Denny . Le . Quoc V. . Zoph . Barret . Wei . Jason . Roberts . Adam . 2023-07-03 . The Flan Collection: Designing Data and Methods for Effective Instruction Tuning . Proceedings of the 40th International Conference on Machine Learning . en . PMLR . 22631–22648. 2301.13688 .
  13. Web site: 2024-01-04 . google/flan-t5-xl · Hugging Face . 2024-08-05 . huggingface.co.
  14. Roberts . Adam . Chung . Hyung Won . Mishra . Gaurav . Levskaya . Anselm . Bradbury . James . Andor . Daniel . Narang . Sharan . Lester . Brian . Gaffney . Colin . Mohiuddin . Afroz . Hawthorne . Curtis . Lewkowycz . Aitor . Salcianu . Alex . Zee . Marc van . Austin . Jacob . 2023 . Scaling Up Models and Data with t5x and seqio . Journal of Machine Learning Research . 24 . 377 . 1–8 . 1533-7928.
  15. Web site: Training great LLMs entirely from ground up in the wilderness as a startup . 2024-10-18 . Yi Tay . en-US.
  16. Web site: Sutawika . Lintang . Komatsuzaki . Aran . Raffel . Colin . 2024-04-15 . Pile-T5 . 2024-05-05 . EleutherAI Blog . en.
  17. Web site: Imagen: Text-to-Image Diffusion Models . 2024-08-23 . imagen.research.google.
  18. Web site: AuraFlow . 2024-08-23 . huggingface.co.