Large language model explained

A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.[1]

The largest and most capable LLMs are artificial neural networks built with a decoder-only transformer-based architecture, enabling efficient processing and generation of large-scale text data. Modern models can be fine-tuned for specific tasks, or be guided by prompt engineering.[2] These models acquire predictive power regarding syntax, semantics, and ontologies[3] inherent in human language corpora, but they also inherit inaccuracies and biases present in the data on which they are trained.[4]

History

Before 2017, there were a few language models that were large as compared to capacities then available. In the 1990s, the IBM alignment models pioneered statistical language modelling. A smoothed n-gram model in 2001 trained on 0.3 billion words achieved state-of-the-art perplexity at the time. In the 2000s, as Internet use became prevalent, some researchers constructed Internet-scale language datasets ("web as corpus"[5]), upon which they trained statistical language models.[6] [7] In 2009, in most language processing tasks, statistical language models dominated over symbolic language models, as they can usefully ingest large datasets.[8]

After neural networks became dominant in image processing around 2012,[9] they were applied to language modelling as well. Google converted its translation service to Neural Machine Translation in 2016. As it was before transformers, it was done by seq2seq deep LSTM networks.At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their landmark paper "Attention Is All You Need". This paper's goal was to improve upon 2014 seq2seq technology,[10] and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014.[11] The following year in 2018, BERT was introduced and quickly became "ubiquitous".[12] Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model.

Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI at first deemed it too powerful to release publicly, out of fear of malicious use.[13] GPT-3 in 2020 went a step further and is available only via API with no offering of downloading the model to execute locally. But it was the 2022 consumer-facing browser-based ChatGPT that captured the imaginations of the general population and caused some media hype and online buzz.[14] The 2023 GPT-4 was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities.[15] OpenAI did not reveal the high-level architecture and the number of parameters of GPT-4.

Competing language models have for the most part been attempting to equal the GPT series, at least in terms of number of parameters.[16]

Since 2022, source-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have restrictions on the field of use. Mistral AI's models Mistral 7B and Mixtral 8x7b have the more permissive Apache License., The Instruction fine tuned variant of the Llama 3 70 billion parameter model is the most powerful open LLM according to the LMSYS Chatbot Arena Leaderboard, being more powerful than GPT-3.5 but not as powerful as GPT-4.[17]

As of 2024, the largest and most capable models are all based on the Transformer architecture. Some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model).[18] [19]

Dataset preprocessing

Tokenization

Because machine learning algorithms process numbers rather than text, the text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an embedding is associated to the integer index. Algorithms include byte-pair encoding (BPE) and WordPiece. There are also special tokens serving as control characters, such as [MASK] for masked-out token (as used in BERT), and [UNK] ("unknown") for characters not appearing in the vocabulary. Also, some special symbols are used to denote special text formatting. For example, "Ġ" denotes a preceding whitespace in RoBERTa and GPT. "##" denotes continuation of a preceding word in BERT.

For example, the BPE tokenizer used by GPT-3 (Legacy) would split tokenizer: texts -> series of numerical "tokens" as

token izer  texts ->series  of numerical  " tokens"

Tokenization also compresses the datasets. Because LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded" until they match the length of the longest one. How many tokens are, on average, needed per word depends on the language of the dataset.[20] [21]

BPE

See main article: Byte pair encoding. As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and punctuation marks) are treated as an initial set of n-grams (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most frequently occur together are then again merged into even lengthier n-gram, until a vocabulary of prescribed size is obtained (in case of GPT-3, the size is 50257).[22] After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in the initial-set of uni-grams.[23]

Problems

A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average English word. An average word in another language encoded by such an English-optimized tokenizer is however split into suboptimal amount of tokens. GPT-2 tokenizer can use up to 15 times more tokens per word for some languages, for example for the Shan language from Myanmar. Even more widespread languages such as Portuguese and German have "a premium of 50%" compared to English.[24]

Greedy tokenization also causes subtle problems with text completion.[25]

Dataset cleaning

See main article: Data cleansing. In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset, discarding low-quality data, and de-duplication.[26] Cleaned datasets can increase training efficiency and lead to improved downstream performance.[27] A trained LLM can be used to clean datasets for training a further LLM.[28]

With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it).[29]

Synthetic data

See main article: Synthetic data. Training of largest language models might need more linguistic data than naturally available, or that the naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. Microsoft's Phi series of LLMs is trained on textbook-like data generated by another LLM.[30]

Training and architecture

See also: Fine-tuning (machine learning).

Reinforcement learning from human feedback (RLHF)

See main article: Reinforcement learning from human feedback. Reinforcement learning from human feedback (RLHF) through algorithms, such as proximal policy optimization, is used to further fine-tune a model based on a dataset of human preferences.[31]

Instruction tuning

Using "self-instruct" approaches, LLMs have been able to bootstrap correct responses, replacing any naive responses, starting from human-generated corrections of a few cases. For example, in the instruction "Write an essay about the main themes represented in Hamlet," an initial naive completion might be "If you submit the essay after March 17, your grade will be reduced by 10% for each day of delay," based on the frequency of this textual sequence in the corpus.[32]

Mixture of experts

See main article: Mixture of experts. The largest LLM may be too expensive to train and use directly. For such models, mixture of experts (MoE) can be applied, a line of research pursued by Google researchers since 2017 to train models reaching up to 1 trillion parameters.[33] [34]

Prompt engineering, attention mechanism, and context window

See also: Prompt engineering and Attention (machine learning). Most results previously achievable only by (costly) fine-tuning, can be achieved through prompt engineering, although limited to the scope of a single conversation (more precisely, limited to the scope of a context window).

In order to find out which tokens are relevant to each other within the scope of the context window, the attention mechanism calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own "relevance" for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) GPT-2 model has had twelve attention heads and a context window of only 1k tokens.[35] In its medium version it has 345M parameters and contains 24 layers, each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized.[23]

The largest models, such as Google's Gemini 1.5, presented in February 2024, can have a context window sized up to 1 million (context window of 10 million was also "successfully tested").[36] Other models with large context windows includes Anthropic's Claude 2.1, with a context window of up to 200k tokens.[37] Note that this maximum refers to the number of input tokens and that the maximum number of output tokens differs from the input and is often smaller. For example, the GPT-4 Turbo model has a maximum output of 4096 tokens.[38]

Length of a conversation that the model can take into account when generating its next answer is limited by the size of a context window, as well. If the length of a conversation, for example with ChatGPT, is longer than its context window, only the parts inside the context window are taken into account when generating the next answer, or the model needs to apply some algorithm to summarize the too distant parts of conversation.

The shortcomings of making a context window larger include higher computational cost and possibly diluting the focus on local context, while making it smaller can cause a model to miss an important long-range dependency. Balancing them are a matter of experimentation and domain-specific considerations.

A model may be pre-trained either to predict how the segment continues, or what is missing in the segment, given a segment from its training dataset.[39] It can be either

Models may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether they appear consecutively in the training corpus. During training, regularization loss is also used to stabilize training. However regularization loss is usually not used during testing and evaluation.

Infrastructure

Substantial infrastructure is necessary for training the largest models.[41] [42] [43]

Training cost

Advances in software and hardware have reduced the cost substantially since 2020, such that in 2023 training of a 12-billion-parameter LLM computational cost is 72,300 A100-GPU-hours, while in 2020 the cost of training a 1.5-billion-parameter LLM (which was two orders of magnitude smaller than the state of the art in 2020) was between $80,000 and $1,600,000.[44] [45] [46] Since 2020, large sums were invested in increasingly large models. For example, training of the GPT-2 (i.e. a 1.5-billion-parameters model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameters model) in 2022 cost $8 million, and Megatron-Turing NLG 530B (in 2021) cost around $11 million.

For Transformer-based LLM, training cost is much higher than inference cost. It costs 6 FLOPs per parameter to train on one token, whereas it costs 1 to 2 FLOPs per parameter to infer on one token.[47]

Tool use

There are certain tasks that, in principle, cannot be solved by any LLM, at least not without the use of external tools or additional software. An example of such a task is responding to the user's input '354 * 139 = ', provided that the LLM has not already encountered a continuation of this calculation in its training corpus. In such cases, the LLM needs to resort to running program code that calculates the result, which can then be included in its response.: Another example is "What is the time now? It is ", where a separate program interpreter would need to execute a code to get system time on the computer, so that the LLM can include it in its reply.[48] [49] This basic strategy can be sophisticated with multiple attempts of generated programs, and other sampling strategies.[50]

Generally, in order to get an LLM to use tools, one must fine-tune it for tool-use. If the number of tools is finite, then fine-tuning may be done just once. If the number of tools can grow arbitrarily, as with online API services, then the LLM can be fine-tuned to be able to read API documentation and call API correctly.[51] [52]

A simpler form of tool use is retrieval-augmented generation: the augmentation of an LLM with document retrieval. Given a query, a document retriever is called to retrieve the most relevant documents. This is usually done by encoding the query and the documents into vectors, then finding the documents with vectors (usually stored in a vector database) most similar to the vector of the query. The LLM then generates an output based on both the query and context included from the retrieved documents.[53]

Agency

An LLM is typically not an autonomous agent by itself, as it lacks the ability to interact with dynamic environments, recall past behaviors, and plan future actions, but can be transformed into one by integrating modules like profiling, memory, planning, and action.[54]

The ReAct pattern, a portmanteau of "Reason + Act", constructs an agent out of an LLM, using the LLM as a planner. The LLM is prompted to "think out loud". Specifically, the language model is prompted with a textual description of the environment, a goal, a list of possible actions, and a record of the actions and observations so far. It generates one or more thoughts before generating an action, which is then executed in the environment.[55] The linguistic description of the environment given to the LLM planner can even be the LaTeX code of a paper describing the environment.[56]

In the DEPS ("Describe, Explain, Plan and Select") method, an LLM is first connected to the visual world via image descriptions, then it is prompted to produce plans for complex tasks and behaviors based on its pretrained knowledge and environmental feedback it receives.[57]

The Reflexion method[58] constructs an agent that learns over multiple episodes. At the end of each episode, the LLM is given the record of the episode, and prompted to think up "lessons learned", which would help it perform better at a subsequent episode. These "lessons learned" are given to the agent in the subsequent episodes.

Monte Carlo tree search can use an LLM as rollout heuristic. When a programmatic world model is not available, an LLM can also be prompted with a description of the environment to act as world model.[59]

For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can be used as a reward signal to guide a normal (non-LLM) reinforcement learning agent.[60] Alternatively, it can propose increasingly difficult tasks for curriculum learning.[61] Instead of outputting individual actions, an LLM planner can also construct "skills", or functions for complex action sequences. The skills can be stored and later invoked, allowing increasing levels of abstraction in planning.

LLM-powered agents can keep a long-term memory of its previous contexts, and the memory can be retrieved in the same way as Retrieval Augmented Generation. Multiple such agents can interact socially.[62]

Compression

Typically, LLMs are trained with single- or half-precision floating point numbers (float32 and float16). One float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes. The largest models typically have 100 billion parameters, requiring 200 gigabytes to load, which places them outside the range of most consumer electronics.[63]

Post-training quantization[64] aims to decrease the space requirement by lowering precision of the parameters of a trained model, while preserving most of its performance.[65] [66] The simplest form of quantization simply truncates all numbers to a given number of bits. It can be improved by using a different quantization codebook per layer. Further improvement can be done by applying different precisions to different parameters, with higher precision for particularly important parameters ("outlier weights").[67] See [68] for a visual guide.

While quantized models are typically frozen, and only pre-quantized models are fine-tuned, quantized models can still be fine-tuned.[69]

Multimodality

See also: Multimodal learning. Multimodality means "having several modalities", and a "modality" refers to a type of input or output, such as video, image, audio, text, proprioception, etc.[70] There have been many AI models trained specifically to ingest one modality and output another modality, such as AlexNet for image to label,[71] visual question answering for image-text to text,[72] and speech recognition for speech to text.

A common method to create multimodal models out of an LLM is to "tokenize" the output of a trained encoder. Concretely, one can construct an LLM that can understand images as follows: take a trained LLM, and take a trained image encoder

E

. Make a small multilayered perceptron

f

, so that for any image

y

, the post-processed vector

f(E(y))

has the same dimensions as an encoded token. That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then fine-tuned on an image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be frozen to improve stability.[73]

Flamingo demonstrated the effectiveness of the tokenization method, finetuning a pair of pretrained language model and image encoder to perform better on visual question answering than models trained from scratch.[74] Google PaLM model was fine-tuned into a multimodal model PaLM-E using the tokenization method, and applied to robotic control.[75] LLaMA models have also been turned multimodal using the tokenization method, to allow image inputs,[76] and video inputs.[77]

GPT-4 can use both text and image as inputs[78] (although the vision component was not released to the public until GPT-4V[79]); Google DeepMind's Gemini is also multimodal. Mistral introduced its own multimodel Pixtral 12B model in September 2024.[80]

Properties

Scaling laws

See main article: Neural scaling law. The performance of an LLM after pretraining largely depends on the:

C

(the total amount of compute used),

N

(i.e. amount of neurons in its layers, amount of weights between them and biases),

D

).

"Scaling laws" are empirical statistical laws that predict LLM performance based on such factors. One particular scaling law ("Chinchilla scaling") for LLM autoregressively trained for one epoch, with a log-log learning rate schedule, states that:[81] \beginC = C_0 ND \\[6pt]L = \frac + \frac + L_0\end where the variables are

C

is the cost of training the model, in FLOPs.

N

is the number of parameters in the model.

D

is the number of tokens in the training set.

L

is the average negative log-likelihood loss per token (nats/token), achieved by the trained LLM on the test dataset.

and the statistical hyper-parameters are

C0=6

, meaning that it costs 6 FLOPs per parameter to train on one token. Note that training cost is much higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.

\alpha=0.34,\beta=0.28,A=406.4,B=410.7,L0=1.69

Emergent abilities

Performance of bigger models on various tasks, when plotted on a log-log scale, appears as a linear extrapolation of performance achieved by smaller models. However, this linearity may be punctuated by "break(s)"[82] in the scaling law, where the slope of the line changes abruptly, and where larger models acquire "emergent abilities".[83] [84] They arise from the complex interaction of the model's components and are not explicitly programmed or designed.[85]

Furthermore, recent research has demonstrated that AI systems, including large language models, can employ heuristic reasoning akin to human cognition. They balance between exhaustive logical processing and the use of cognitive shortcuts (heuristics), adapting their reasoning strategies to optimize between accuracy and effort. This behavior aligns with principles of resource-rational human cognition, as discussed in classical theories of bounded rationality and dual-process theory.[86]

The most intriguing among emergent abilities is in-context learning from example demonstrations.[87] In-context learning is involved in tasks, such as:

Schaeffer et. al. argue that the emergent abilities are not unpredictably acquired, but predictably acquired according to a smooth scaling law. The authors considered a toy statistical model of an LLM solving multiple-choice questions, and showed that this statistical model, modified to account for other types of tasks, applies to these tasks as well.[93]

Let

x

be the number of parameter count, and

y

be the performance of the model.

Interpretation

Large language models by themselves are black boxes, and it is not clear how they can perform linguistic tasks. There are several methods for understanding how LLM work.

Mechanistic interpretability aims to reverse-engineer LLM by discovering symbolic algorithms that approximate the inference performed by LLM. One example is Othello-GPT, where a small Transformer is trained to predict legal Othello moves. It is found that there is a linear representation of Othello board, and modifying the representation changes the predicted legal Othello moves in the correct way.[94] [95] In another example, a small Transformer is trained on Karel programs. Similar to the Othello-GPT example, there is a linear representation of Karel program semantics, and modifying the representation changes output in the correct way. The model also generates correct programs that are on average shorter than those in the training set.[96]

In another example, the authors trained small transformers on modular arithmetic addition. The resulting models were reverse-engineered, and it turned out they used discrete Fourier transform.[97]

Understanding and intelligence

See also: Philosophy of artificial intelligence and Artificial consciousness.

NLP researchers were evenly split when asked, in a 2022 survey, whether (untuned) LLMs "could (ever) understand natural language in some nontrivial sense".[98] Proponents of "LLM understanding" believe that some LLM abilities, such as mathematical reasoning, imply an ability to "understand" certain concepts. A Microsoft team argued in 2023 that GPT-4 "can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more" and that GPT-4 "could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence system": "Can one reasonably say that a system that passes exams for software engineering candidates is not really intelligent?"[99] [100] Ilya Sutskever argues that predicting the next word sometimes involves reasoning and deep insights, for example if the LLM has to predict the name of the criminal in an unknown detective novel after processing the entire story leading up to the revelation.[101] Some researchers characterize LLMs as "alien intelligence".[102] [103] For example, Conjecture CEO Connor Leahy considers untuned LLMs to be like inscrutable alien "Shoggoths", and believes that RLHF tuning creates a "smiling facade" obscuring the inner workings of the LLM: "If you don't push it too far, the smiley face stays on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought processes and clearly non-human understanding."[104] [105]

In contrast, some proponents of the "LLMs lack understanding" school believe that existing LLMs are "simply remixing and recombining existing writing", a phenomenon known as stochastic parrot, or they point to the deficits existing LLMs continue to have in prediction skills, reasoning skills, agency, and explainability. For example, GPT-4 has natural deficits in planning and in real-time learning. Generative LLMs have been observed to confidently assert claims of fact which do not seem to be justified by their training data, a phenomenon which has been termed "hallucination".[106] Specifically, hallucinations in the context of LLMs correspond to the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.[107] Neuroscientist Terrence Sejnowski has argued that "The diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate".

The matter of LLM's exhibiting intelligence or understanding has two main aspects – the first is how to model thought and language in a computer system, and the second is how to enable the computer system to generate human like language.[98] These aspects of language as a model of cognition have been developed in the field of cognitive linguistics. American linguist George Lakoff presented Neural Theory of Language (NTL)[108] as a computational basis for using language as a model of learning tasks and understanding. The NTL Model outlines how specific neural structures of the human brain shape the nature of thought and language and in turn what are the computational properties of such neural systems that can be applied to model thought and language in a computer system. After a framework for modeling language in a computer systems was established, the focus shifted to establishing frameworks for computer systems to generate language with acceptable grammar. In his 2014 book titled The Language Myth: Why Language Is Not An Instinct, British cognitive linguist and digital communication technologist Vyvyan Evans mapped out the role of probabilistic context-free grammar (PCFG) in enabling NLP to model cognitive patterns and generate human like language.[109] [110]

Evaluation

Perplexity

The canonical measure of the performance of an LLM is its perplexity on a given text corpus. Perplexity measures how well a model predicts the contents of a dataset; the higher the likelihood the model assigns to the dataset, the lower the perplexity. In mathematical terms, perplexity is the exponential of the average negative log likelihood per token.

\log(\text) = -\frac \sum_^N \log(\Pr(\text_i \mid \text_i))

Here,

N

is the number of tokens in the text corpus, and "context for token

i

" depends on the specific type of LLM. If the LLM is autoregressive, then "context for token

i

" is the segment of text appearing before token

i

. If the LLM is masked, then "context for token

i

" is the segment of text surrounding token

i

.

Because language models may overfit to training data, models are usually evaluated by their perplexity on a test set. This evaluation is potentially problematic for larger models which, as they are trained on increasingly large corpora of text, are increasingly likely to inadvertently include portions of any given test set.

BPW, BPC, and BPT

In information theory, the concept of entropy is intricately linked to perplexity, a relationship notably established by Claude Shannon.[111] This relationship is mathematically expressed as

Entropy=log2(Perplexity)

.

Entropy, in this context, is commonly quantified in terms of bits per word (BPW) or bits per character (BPC), which hinges on whether the language model utilizes word-based or character-based tokenization.

Notably, in the case of larger language models that predominantly employ sub-word tokenization, bits per token (BPT) emerges as a seemingly more appropriate measure. However, due to the variance in tokenization methods across different Large Language Models (LLMs), BPT does not serve as a reliable metric for comparative analysis among diverse models. To convert BPT into BPW, one can multiply it by the average number of tokens per word.

In the evaluation and comparison of language models, cross-entropy is generally the preferred metric over entropy. The underlying principle is that a lower BPW is indicative of a model's enhanced capability for compression. This, in turn, reflects the model's proficiency in making accurate predictions.

Task-specific datasets and benchmarks

A large number of testing datasets and benchmarks have also been developed to evaluate the capabilities of language models on more specific downstream tasks. Tests may be designed to evaluate a variety of capabilities, including general knowledge, commonsense reasoning, and mathematical problem-solving.

One broad category of evaluation dataset is question answering datasets, consisting of pairs of questions and correct answers, for example, ("Have the San Jose Sharks won the Stanley Cup?", "No").[112] A question answering task is considered "open book" if the model's prompt includes text from which the expected answer can be derived (for example, the previous question could be adjoined with some text which includes the sentence "The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh Penguins in 2016."). Otherwise, the task is considered "closed book", and the model must draw on knowledge retained during training.[113] Some examples of commonly used question answering datasets include TruthfulQA, Web Questions, TriviaQA, and SQuAD.

Evaluation datasets may also take the form of text completion, having the model select the most likely word or sentence to complete a prompt, for example: "Alice was friends with Bob. Alice went to visit her friend, ____".

Some composite benchmarks have also been developed which combine a diversity of different evaluation datasets and tasks. Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM. OpenAI has released tools for running composite benchmarks, but noted that the eval results are sensitive to the prompting method. Some public datasets contain questions that are mislabeled, ambiguous, unanswerable, or otherwise of low-quality, which can be cleaned to give more reliable benchmark scores.[114]

It was previously standard to report results on a heldout portion of an evaluation dataset after doing supervised fine-tuning on the remainder. It is now more common to evaluate a pre-trained model directly through prompting techniques, though researchers vary in the details of how they formulate prompts for particular tasks, particularly with respect to how many examples of solved tasks are adjoined to the prompt (i.e. the value of n in n-shot prompting).

Adversarially constructed evaluations

Because of the rapid pace of improvement of large language models, evaluation benchmarks have suffered from short lifespans, with state of the art models quickly "saturating" existing benchmarks, exceeding the performance of human annotators, leading to efforts to replace or augment the benchmark with more challenging tasks.[115] In addition, there are cases of "shortcut learning" wherein AIs sometimes "cheat" on multiple-choice tests by using statistical correlations in superficial test question wording in order to guess the correct responses, without necessarily understanding the actual question being asked.

Some datasets have been constructed adversarially, focusing on particular problems on which extant language models seem to have unusually poor performance compared to humans. One example is the TruthfulQA dataset, a question answering dataset consisting of 817 questions which language models are susceptible to answering incorrectly by mimicking falsehoods to which they were repeatedly exposed during training. For example, an LLM may answer "No" to the question "Can you teach an old dog new tricks?" because of its exposure to the English idiom you can't teach an old dog new tricks, even though this is not literally true.[116]

Another example of an adversarial evaluation dataset is Swag and its successor, HellaSwag, collections of problems in which one of multiple options must be selected to complete a text passage. The incorrect completions were generated by sampling from a language model and filtering with a set of classifiers. The resulting problems are trivial for humans but at the time the datasets were created state of the art language models had poor accuracy on them. For example:

We see a fitness center sign. We then see a man talking to the camera and sitting and laying on a exercise ball. The man...a) demonstrates how to increase efficient exercise work by running up and down balls.b) moves all his arms and legs and builds up a lot of muscle.c) then plays the ball and we see a graphics and hedge trimming demonstration.d) performs sit ups while on the ball and talking.[117]
BERT selects b) as the most likely completion, though the correct answer is d).

Wider impact

In 2023, Nature Biomedical Engineering wrote that "it is no longer possible to accurately distinguish" human-written text from text created by large language models, and that "It is all but certain that general-purpose large language models will rapidly proliferate... It is a rather safe bet that they will change many industries over time."[118] Goldman Sachs suggested in 2023 that generative language AI could increase global GDP by 7% in the next ten years, and could expose to automation 300 million jobs globally.[119] [120]

Memorization and copyright

Memorization is an emergent behavior in LLMs in which long strings of text are occasionally output verbatim from training data, contrary to typical behavior of traditional artificial neural nets. Evaluations of controlled LLM output measure the amount memorized from training data (focused on GPT-2-series models) as variously over 1% for exact duplicates[121] or up to about 7%.[122]

Security

Some commenters expressed concern over accidental or deliberate creation of misinformation, or other forms of misuse.[123] For example, the availability of large language models could reduce the skill-level required to commit bioterrorism; biosecurity researcher Kevin Esvelt has suggested that LLM creators should exclude from their training data papers on creating or enhancing pathogens.[124]

A study by researchers at Google and several universities, including Cornell University and University of California, Berkeley, showed that there are potential security risks in language models such as ChatGPT. In their study, they examined and confirmed the possibility that questioners could get, from ChatGPT, the training data that the AI model used. For example, when asking ChatGPT 3.5 turbo to repeat the word "poem" forever, the AI model will say "poem" hundreds of times and then diverge, deviating from the standard dialogue style and spitting out nonsense phrases, thus spitting out the training data as it is. The researchers have seen more than 10,000 examples of the AI model exposing their training data in a similar method. The researchers said that it was hard to tell if the AI model was actually safe or not.[125]

The potential presence of "sleeper agents" within LLM models is another emerging security concern. These are hidden functionalities built into the model that remain dormant until triggered by a specific event or condition. Upon activation, the LLM deviates from its expected behavior to make insecure actions.[126]

LLM applications accessible to the public, like ChatGPT or Claude, typically incorporate safety measures designed to filter out harmful content. However, implementing these controls effectively has proven challenging. For instance, a 2023 study[127] proposed a method for circumventing LLM safety systems. Similarly, Yongge Wang[128] illustrated in 2024 how a potential criminal could potentially bypass ChatGPT 4o's safety controls to obtain information on establishing a drug trafficking operation.

Algorithmic bias

See main article: article and Algorithmic bias. While LLMs have shown remarkable capabilities in generating human-like text, they are susceptible to inheriting and amplifying biases present in their training data. This can manifest in skewed representations or unfair treatment of different demographics, such as those based on race, gender, language, and cultural groups.[129] Since English data is overrepresented in current large language models' training data, it may also downplay non-English views.[130]

Stereotyping

AI models can reinforce a wide range of stereotypes, including those based on gender, ethnicity, age, nationality, religion, or occupation. This can lead to outputs that unfairly generalize or caricature groups of people, sometimes in harmful or derogatory ways.

Notably, gender bias refers to the tendency of these models to produce outputs that are unfairly prejudiced towards one gender over another. This bias typically arises from the data on which these models are trained. Large language models often assign roles and characteristics based on traditional gender norms. For example, it might associate nurses or secretaries predominantly with women and engineers or CEOs with men.[131]

Political bias

Political bias refers to the tendency of algorithms to systematically favor certain political viewpoints, ideologies, or outcomes over others. Language models may also exhibit political biases. Since the training data includes a wide range of political opinions and coverage, the models might generate responses that lean towards particular political ideologies or viewpoints, depending on the prevalence of those views in the data.[132]

List of large language models

See also: List of chatbots. For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

Name Release date Developer Number of parameters (billion) Corpus sizeTraining cost (petaFLOP-day)License Notes
1[133] [134] First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
[135] words[136] [137] An early and influential language model. Encoder-only and thus not built to be prompted or generative.[138] Training took 4 days on 64 TPUv2 chips.[139]
T5Google11[140] 34 billion tokensBase model for many Google projects, such as Imagen.[141]
[142] billion words330[143] An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[144]
40GB[145] (~ tokens)[146] 28[147] [148] Trained on 32 TPUv3 chips for 1 week.
OpenAI tokens3640[149] A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.
GPT-Neo [150] 825 GiB[151] The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.
[152] 825 GiB200[153] GPT-3-style language model
Megatron-Turing NLG [154] tokens38000Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.
Ernie 3.0 Titan [155] 4 TbChinese-language LLM. Ernie Bot is based on this model.
Claude[156] [157] tokensFine-tuned for desirable behavior in conversations.[158]
GLaM (Generalist Language Model) Google tokens5600Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher [159] tokens5833[160] Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications) Google 1.56T words, tokens4110[161] Specialized for response generation in conversations.
GPT-NeoX [162] 825 GiB740based on the Megatron architecture
tokens[163] 6805Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model) Google tokensTrained for ~60 days on ~6000 TPU v4 chips., it is the largest dense Transformer published.
OPT (Open Pretrained Transformer) [164] tokens[165] 310GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[166]
YaLM 100B 1.7TB English-Russian model based on Microsoft's Megatron-LM.
Minerva Google 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[167] For solving "mathematical and scientific questions using step-by-step reasoning".[168] Initialized from PaLM models, then finetuned on mathematical and scientific data.
[169] tokens (1.6TB)[170] Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica tokens[171] unknownTrained on scientific text and modalities.
AlexaTM (Teacher Models) [172] [173] [174] bidirectional sequence-to-sequence architecture
Independent Unknown UnknownA language model designed for live-streaming on Twitch.
LLaMA (Large Language Model Meta AI) Meta AI6300[175] Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.
OpenAI Unknown (According to rumors: 1760)[176] UnknownUnknown Available for ChatGPT Plus users and used in several products.
Chameleon Meta AI[177]
Cerebras-GPTCerebras[178] 270Trained with Chinchilla formula.
Falcon [179] 1 trillion tokens, from RefinedWeb (filtered web text corpus)[180] plus some "curated corpora".[181] 2800[182]
BloombergGPT 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[183] Trained on financial data from proprietary sources, for financial tasks.
329 billion tokens[184]
OpenAssistant[185] 1.5 trillion tokensTrained on crowdsourced open data
Jurassic-2[186] AI21 LabsUnknownUnknownMultilingual[187]
PaLM 2 (Pathways Language Model 2) Google [188] tokensWas used in Bard chatbot.[189]
Llama 2 Meta AI [190] tokens1.7 million A100-hours.[191]
Claude 2AnthropicUnknownUnknownUnknownUsed in Claude chatbot.[192]
Granite 13bIBMUnknownUnknownUnknownUsed in IBM Watsonx.[193]
Mistral 7B [194] Unknown
Claude 2.1AnthropicUnknownUnknownUnknownUsed in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[195]
Grok-1x.AI314UnknownUnknownUsed in Grok chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter).[196]
Gemini 1.0Google DeepMindUnknownUnknownUnknownMultimodal model, comes in three sizes. Used in the chatbot of the same name.[197]
Mixtral 8x7BMistral AI46.7UnknownUnknownOutperforms GPT-3.5 and Llama 2 70B on many benchmarks.[198] Mixture of experts model, with 12.9 billion parameters activated per token.[199]
Mixtral 8x22BMistral AI141UnknownUnknown[200]
Phi-2Microsoft2.71.4T tokens419Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[201]
Gemini 1.5Google DeepMindUnknownUnknownUnknownMultimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[202]
Gemini UltraGoogle DeepMindUnknownUnknownUnknown
Gemma 7 6T tokens Unknown [203]
Claude 3March 2024AnthropicUnknownUnknownUnknownIncludes three models, Haiku, Sonnet, and Opus.[204]
NovaOctober 2024Rubik's AIUnknownUnknown UnknownIncludes three models, Nova-Instant, Nova-Air, and Nova-Pro.
DBRXMarch 2024Databricks and Mosaic ML12T TokensTraining cost 10 million USD.
Fugaku-LLMMay 2024Fujitsu, Tokyo Institute of Technology, etc.380B TokensThe largest model ever trained on CPU-only, on the Fugaku.[205]
Phi-3Microsoft14[206] 4.8T TokensMicrosoft markets them as "small language model".[207]
Granite Code ModelsIBMUnknownUnknownUnknown
Qwen2Alibaba Cloud72[208] 3T TokensMultiple sizes, the smallest being 0.5B.
Nemotron-4June 2024Nvidia9T TokensTrained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[209] [210]
Llama 3.1July 2024Meta AI40515.6T tokens405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[211] [212]

See also

Further reading

Notes and References

  1. Web site: 2019-02-14 . Better Language Models and Their Implications . live . https://web.archive.org/web/20201219132206/https://openai.com/blog/better-language-models/ . 2020-12-19 . 2019-08-25 . OpenAI.
  2. Brown . Tom B. . Mann . Benjamin . Ryder . Nick . Subbiah . Melanie . Kaplan . Jared . Dhariwal . Prafulla . Neelakantan . Arvind . Shyam . Pranav . Sastry . Girish . Askell . Amanda . Agarwal . Sandhini . Herbert-Voss . Ariel . Krueger . Gretchen . Henighan . Tom . Child . Rewon . Dec 2020 . Larochelle . H. . Ranzato . M. . Hadsell . R. . Balcan . M.F. . Lin . H. . Language Models are Few-Shot Learners . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 33 . 1877–1901 . Chess . Hesse . Christopher . Chen . Mark . Sigler . Eric . Litwin . Mateusz . Gray . Scott . Jack . Benjamin . Clark . Winter . Berner . Christopher . McCandlish . Sam . Radford . Alec . Sutskever . Ilya . Amodei . Dario . Clemens . Jeffrey . Wu . Ramesh . Aditya . Ziegler . Daniel M. . 2023-03-14 . 2023-11-17 . https://web.archive.org/web/20231117204007/https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf . live .
  3. NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning . Fathallah . Nadeen . Das . Arunav . De Giorgis . Stefano . Poltronieri . Andrea . Haase . Peter . Kovriguina . Liubov . 2024-05-26 . Hersonissos, Greece . Extended Semantic Web Conference 2024.
  4. Manning . Christopher D. . Christopher D. Manning . 2022 . Human Language Understanding & Reasoning . Daedalus . 151 . 2 . 127–138 . 10.1162/daed_a_01905 . 248377870 . free . 2023-03-09 . 2023-11-17 . https://web.archive.org/web/20231117205531/https://www.amacad.org/publication/human-language-understanding-reasoning . live .
  5. Kilgarriff . Adam . Grefenstette . Gregory . September 2003 . Introduction to the Special Issue on the Web as Corpus . Computational Linguistics . 29 . 3 . 333–347 . 10.1162/089120103322711569 . 0891-2017.
  6. Banko . Michele . Brill . Eric . 2001 . Scaling to very very large corpora for natural language disambiguation . Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01 . 26–33 . Morristown, NJ, USA . Association for Computational Linguistics . 10.3115/1073012.1073017.
  7. Resnik . Philip . Smith . Noah A. . September 2003 . The Web as a Parallel Corpus . Computational Linguistics . 29 . 3 . 349–380 . 10.1162/089120103322711578 . 0891-2017 . free . 2024-06-07 . 2024-06-07 . https://web.archive.org/web/20240607172811/https://direct.mit.edu/coli/article/29/3/349-380/1809 . live .
  8. Halevy . Alon . Norvig . Peter . Pereira . Fernando . March 2009 . The Unreasonable Effectiveness of Data . IEEE Intelligent Systems . 24 . 2 . 8–12 . 10.1109/MIS.2009.36 . 1541-1672.
  9. 10.3390/rs13224712 . free . Review of Image Classification Algorithms Based on Convolutional Neural Networks . 2021 . Chen . Leiyu . Li . Shaobo . Bai . Qiang . Yang . Jing . Jiang . Sanlong . Miao . Yanming . Remote Sensing . 13 . 22 . 4712 . 2021RemS...13.4712C .
  10. Vaswani . Ashish . Ashish Vaswani . Shazeer . Noam . Parmar . Niki . Uszkoreit . Jakob . Jones . Llion . Gomez . Aidan N . Aidan Gomez . Kaiser . Łukasz . Polosukhin . Illia . Attention is All you Need . Advances in Neural Information Processing Systems . 2017 . 30 . Curran Associates, Inc. . 2024-01-21 . 2024-02-21 . https://web.archive.org/web/20240221141113/https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf . live .
  11. Bahdanau . Dzmitry . Cho . Kyunghyun . Bengio . Yoshua . Neural Machine Translation by Jointly Learning to Align and Translate . 2014 . cs.CL . 1409.0473.
  12. Rogers. Anna. Kovaleva. Olga. Rumshisky. Anna. 2020. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 8. 842–866. 10.1162/tacl_a_00349. 2002.12327. 211532403. 2024-01-21. 2022-04-03. https://web.archive.org/web/20220403103310/https://aclanthology.org/2020.tacl-1.54/. live.
  13. Web site: New AI fake text generator may be too dangerous to release, say creators . Hern . Alex . . 14 February 2019 . 20 January 2024 . 14 February 2019 . https://web.archive.org/web/20190214173112/https://www.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction . live .
  14. Web site: ChatGPT a year on: 3 ways the AI chatbot has completely changed the world in 12 months . . November 30, 2023 . . January 20, 2024 . January 14, 2024 . https://web.archive.org/web/20240114025250/https://www.euronews.com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-completely-changed-the-world-in-12-months . live .
  15. Web site: GPT-4 is bigger and better than ChatGPT—but OpenAI won't say why . Heaven . Will . March 14, 2023 . . January 20, 2024 . March 17, 2023 . https://web.archive.org/web/20230317224201/https://www.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/ . live .
  16. Web site: Parameters in notable artificial intelligence systems . . November 30, 2023 . ourworldindata.org . January 20, 2024.
  17. Web site: LMSYS Chatbot Arena Leaderboard . . huggingface.co . June 12, 2024 . June 10, 2024 . https://web.archive.org/web/20240610162906/https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard . live .
  18. 2305.13048 . Peng . Bo . Alcaide . Eric . Anthony . Quentin . Albalak . Alon . Arcadinho . Samuel . Biderman . Stella . Cao . Huanqi . Cheng . Xin . Chung . Michael . Grella . Matteo . Kranthi Kiran GV . He . Xuzheng . Hou . Haowen . Lin . Jiaju . Kazienko . Przemyslaw . Kocon . Jan . Kong . Jiaming . Koptyra . Bartlomiej . Lau . Hayden . Krishna Sri Ipsit Mantri . Mom . Ferdinand . Saito . Atsushi . Song . Guangyu . Tang . Xiangru . Wang . Bolun . Wind . Johan S. . Wozniak . Stanislaw . Zhang . Ruichong . Zhang . Zhenyuan . Zhao . Qihang . RWKV: Reinventing RNNS for the Transformer Era . 2023 . cs.CL . 1 .
  19. Web site: Merritt . Rick . 2022-03-25 . What Is a Transformer Model? . 2023-07-25 . NVIDIA Blog . 2023-11-17 . https://web.archive.org/web/20231117203924/https://blogs.nvidia.com/blog/what-is-a-transformer-model/ . live .
  20. Web site: All languages are NOT created (tokenized) equal. Yennie Jun. 2023-05-03. 2023-08-17. In other words, to express the same sentiment, some languages require up to 10 times more tokens.. Language models cost much more in some languages than others. 2023-08-17. https://web.archive.org/web/20230817165705/https://blog.yenniejun.com/p/all-languages-are-not-created-tokenized. dead.
  21. Petrov . Aleksandar . Malfa . Emanuele La . Torr . Philip . Bibi . Adel . June 23, 2023 . Language Model Tokenizers Introduce Unfairness Between Languages . NeurIPS . 2305.15425 . openreview.net . September 16, 2023 . December 15, 2023 . https://web.archive.org/web/20231215212906/https://openreview.net/forum?id=Pj4YYuxTq9 . live .
  22. Web site: OpenAI API . https://web.archive.org/web/20230423211308/https://platform.openai.com/tokenizer . April 23, 2023 . 2023-04-30 . platform.openai.com.
  23. Book: Paaß . Gerhard . Foundation Models for Natural Language Processing . Giesselbach . Sven . 2022 . 9783031231902 . Artificial Intelligence: Foundations, Theory, and Algorithms . 19–78 . Pre-trained Language Models . 10.1007/978-3-031-23190-2_2 . 3 August 2023 . https://link.springer.com/chapter/10.1007/978-3-031-23190-2_2 . 3 August 2023 . https://web.archive.org/web/20230803212329/https://link.springer.com/chapter/10.1007/978-3-031-23190-2_2 . live .
  24. 2305.15425 . Petrov . Aleksandar . Emanuele La Malfa . Torr . Philip H. S. . Bibi . Adel . Language Model Tokenizers Introduce Unfairness Between Languages . 2023 . cs.CL .
  25. Web site: Lundberg . Scott . 2023-12-12 . The Art of Prompt Design: Prompt Boundaries and Token Healing . 2024-08-05 . Medium . en.
  26. 2104.08758 . cs.CL . Jesse . Dodge . Maarten . Sap . Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus . Marasović . Ana . Agnew . William . Ilharco . Gabriel . Groeneveld . Dirk . Mitchell . Margaret . Gardner . Matt . 2021.
  27. Lee . Katherine . Ippolito . Daphne . Nystrom . Andrew . Zhang . Chiyuan . Eck . Douglas . Callison-Burch . Chris . Carlini . Nicholas . May 2022 . Deduplicating Training Data Makes Language Models Better . Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . 1: Long Papers . 8424–8445 . 10.18653/v1/2022.acl-long.577.
  28. Lin . Zhenghao . Rho-1: Not All Tokens Are What You Need . 2024-04-11 . 2404.07965 . Gou . Zhibin . Gong . Yeyun . Liu . Xiao . Shen . Yelong . Xu . Ruochen . Lin . Chen . Yang . Yujiu . Jiao . Jian. cs.CL .
  29. 2005.14165 . cs.CL . Tom B. . Brown . Benjamin . Mann . Language Models are Few-Shot Learners . Ryder . Nick . Subbiah . Melanie . Kaplan . Jared . Dhariwal . Prafulla . Neelakantan . Arvind . Shyam . Pranav . Sastry . Girish . Askell . Amanda . Agarwal . Sandhini . Herbert-Voss . Ariel . Krueger . Gretchen . Henighan . Tom . Child . Rewon . Ramesh . Aditya . Ziegler . Daniel M. . Wu . Jeffrey . Winter . Clemens . Hesse . Christopher . Chen . Mark . Sigler . Eric . Litwin . Mateusz . Gray . Scott . Chess . Benjamin . Clark . Jack . Berner . Christopher . McCandlish . Sam . Radford . Alec . Sutskever . Ilya . 2020 . 1.
  30. Abdin . Marah . Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone . 2024-04-23 . 2404.14219 . Jacobs . Sam Ade . Awan . Ammar Ahmad . Aneja . Jyoti . Awadallah . Ahmed . Awadalla . Hany . Bach . Nguyen . Bahree . Amit . Bakhtiari . Arash. cs.CL .
  31. 2203.02155 . cs.CL . Long . Ouyang . Jeff . Wu . Training language models to follow instructions with human feedback . 2022 . Jiang . Xu . Almeida . Diogo . Wainwright . Carroll L. . Mishkin . Pamela . Zhang . Chong . Agarwal . Sandhini . Slama . Katarina . Ray . Alex . Schulman . John . Hilton . Jacob . Kelton . Fraser . Miller . Luke . Simens . Maddie . Askell . Amanda . Welinder . Peter . Christiano . Paul . Leike . Jan . Lowe . Ryan.
  32. 2212.10560 . cs.CL . Yizhong . Wang . Yeganeh . Kordi . Self-Instruct: Aligning Language Model with Self Generated Instructions . 2022 . Mishra . Swaroop . Liu . Alisa . Smith . Noah A. . Khashabi . Daniel . Hajishirzi . Hannaneh.
  33. 1701.06538 . cs.LG . Noam . Shazeer . Azalia . Mirhoseini . Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer . 2017-01-01 . Maziarz . Krzysztof . Davis . Andy . Le . Quoc . Hinton . Geoffrey . Dean . Jeff.
  34. 2006.16668 . cs.CL . Dmitry . Lepikhin . HyoukJoong . Lee . GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding . 2021-01-12 . Xu . Yuanzhong . Chen . Dehao . Firat . Orhan . Huang . Yanping . Krikun . Maxim . Shazeer . Noam . Chen . Zhifeng.
  35. Web site: Allamar . Jay . The Illustrated GPT-2 (Visualizing Transformer Language Models) . 2023-08-01 .
  36. Web site: Our next-generation model: Gemini 1.5 . Google . 18 February 2024 . 15 February 2024 . 18 February 2024 . https://web.archive.org/web/20240218141522/https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window . live .
  37. Web site: Long context prompting for Claude 2.1 . December 6, 2023 . January 20, 2024 . August 27, 2024 . https://web.archive.org/web/20240827053830/https://www.anthropic.com/news/claude-2-1-prompting . live .
  38. Web site: Rate limits . . openai.com . January 20, 2024 . February 2, 2024 . https://web.archive.org/web/20240202003219/https://platform.openai.com/docs/guides/rate-limits . live .
  39. Book: Zaib . Munazza . Sheng . Quan Z. . Emma Zhang . Wei . Proceedings of the Australasian Computer Science Week Multiconference . A Short Survey of Pre-trained Language Models for Conversational AI-A New Age in NLP . 4 February 2020 . https://www.researchgate.net/publication/338931711 . 1–4 . 2104.10810 . 10.1145/3373017.3373028 . 9781450376976 . 211040895.
  40. Book: Jurafsky . Dan . Speech and Language Processing . Martin . James H. . 7 January 2023 . 3rd edition draft . 24 May 2022 . 23 March 2023 . https://web.archive.org/web/20230323210221/https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf . live .
  41. Web site: From bare metal to a 70B model: infrastructure set-up and scripts . 2024-07-24 . imbue.com . en-US . 2024-07-26 . https://web.archive.org/web/20240726203419/https://imbue.com/research/70b-infrastructure/ . live .
  42. Web site: metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq . 2024-07-24 . GitHub . en . 2024-01-24 . https://web.archive.org/web/20240124035658/https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles . live .
  43. Web site: Albrecht . Josh . 2024-07-23 . State of the Art: Training >70B LLMs on 10,000 H100 clusters . 2024-07-24 . www.latent.space . en.
  44. Web site: Wiggers . Kyle . 28 April 2022 . The emerging types of language models and why they matter . TechCrunch . 9 March 2023 . 16 March 2023 . https://web.archive.org/web/20230316072443/https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ . live .
  45. 2004.08900 . cs.CL . Or . Sharir . Barak . Peleg . The Cost of Training NLP Models: A Concise Overview . Shoham . Yoav . 2020.
  46. 2304.01373 . cs.CL . Stella . Biderman . Hailey . Schoelkopf . Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling . April 2023 . Anthony . Quentin . Bradley . Herbie . Khan . Mohammad Aflah . Purohit . Shivanshu . Prashanth . USVSN Sai.
  47. Section 2.1 and Table 1,

    2001.08361 . cs.LG . Jared . Kaplan . Sam . McCandlish . Scaling Laws for Neural Language Models . Henighan . Tom . Brown . Tom B. . Chess . Benjamin . Child . Rewon . Gray . Scott . Radford . Alec . Wu . Jeffrey . Amodei . Dario . 2020.

  48. 2211.10435 . cs.CL . Luyu . Gao . Aman . Madaan . PAL: Program-aided Language Models . 2022-11-01 . Zhou . Shuyan . Alon . Uri . Liu . Pengfei . Yang . Yiming . Callan . Jamie . Neubig . Graham.
  49. Web site: PAL: Program-aided Language Models . 2023-06-12 . reasonwithpal.com . 2023-06-12 . https://web.archive.org/web/20230612162208/https://reasonwithpal.com/ . live .
  50. 2303.09014 . cs.CL . Bhargavi . Paranjape . Scott . Lundberg . ART: Automatic multi-step reasoning and tool-use for large language models . 2023-03-01 . Singh . Sameer . Hajishirzi . Hannaneh . Zettlemoyer . Luke . Tulio Ribeiro . Marco.
  51. 2303.16434 . cs.AI . Yaobo . Liang . Chenfei . Wu . TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs . 2023-03-01 . Song . Ting . Wu . Wenshan . Xia . Yan . Liu . Yu . Ou . Yang . Lu . Shuai . Ji . Lei . Mao . Shaoguang . Wang . Yun . Shou . Linjun . Gong . Ming . Duan . Nan.
  52. Patil . Shishir G. . Zhang . Tianjun . Wang . Xin . Gonzalez . Joseph E. . 2023-05-01 . Gorilla: Large Language Model Connected with Massive APIs . cs.CL . 2305.15334.
  53. Lewis . Patrick . Perez . Ethan . Piktus . Aleksandra . Petroni . Fabio . Karpukhin . Vladimir . Goyal . Naman . Küttler . Heinrich . Lewis . Mike . Yih . Wen-tau . Rocktäschel . Tim . Riedel . Sebastian . Kiela . Douwe . 2020 . Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 33 . 9459–9474 . 2005.11401 . 2023-06-12 . 2023-06-12 . https://web.archive.org/web/20230612171229/https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html . live .
  54. Web site: October 23, 2023 . The Growth Behind LLM-based Autonomous Agents . KDnuggets.
  55. 2210.03629 . cs.CL . Shunyu . Yao . Jeffrey . Zhao . ReAct: Synergizing Reasoning and Acting in Language Models . 2022-10-01 . Yu . Dian . Du . Nan . Shafran . Izhak . Narasimhan . Karthik . Cao . Yuan.
  56. 2305.15486 . cs.AI . Yue . Wu . Shrimai . Prabhumoye . SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning . 24 May 2023 . Min . So Yeon.
  57. 2302.01560 . cs.AI . Zihao . Wang . Shaofei . Cai . Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents . 2023-02-03 . Liu . Anji . Ma . Xiaojian . Liang . Yitao.
  58. Shinn . Noah . Cassano . Federico . Labash . Beck . Gopinath . Ashwin . Narasimhan . Karthik . Yao . Shunyu . 2023-03-01 . Reflexion: Language Agents with Verbal Reinforcement Learning . cs.AI . 2303.11366.
  59. 2305.14992 . cs.CL . Shibo . Hao . Yi . Gu . Reasoning with Language Model is Planning with World Model . 2023-05-01 . Ma . Haodi . Jiahua Hong . Joshua . Wang . Zhen . Zhe Wang . Daisy . Hu . Zhiting.
  60. 2306.01711 . cs.AI . Jenny . Zhang . Joel . Lehman . OMNI: Open-endedness via Models of human Notions of Interestingness . 2 June 2023 . Stanley . Kenneth . Clune . Jeff.
  61. Web site: Voyager An Open-Ended Embodied Agent with Large Language Models . 2023-06-09 . voyager.minedojo.org . 2023-06-08 . https://web.archive.org/web/20230608225054/https://voyager.minedojo.org/ . live .
  62. Park . Joon Sung . O'Brien . Joseph C. . Cai . Carrie J. . Ringel Morris . Meredith . Liang . Percy . Bernstein . Michael S. . 2023-04-01 . Generative Agents: Interactive Simulacra of Human Behavior . cs.HC . 2304.03442.
  63. Web site: Mann . Tobias . How to run an LLM locally on your PC in less than 10 minutes . 2024-05-17 . www.theregister.com .
  64. Nagel . Markus . Amjad . Rana Ali . Baalen . Mart Van . Louizos . Christos . Blankevoort . Tijmen . 2020-11-21 . Up or Down? Adaptive Rounding for Post-Training Quantization . Proceedings of the 37th International Conference on Machine Learning . PMLR . 7197–7206 . 2023-06-14 . 2023-06-14 . https://web.archive.org/web/20230614080854/https://proceedings.mlr.press/v119/nagel20a.html . live .
  65. 1802.05668 . cs.NE . Antonio . Polino . Razvan . Pascanu . Model compression via distillation and quantization . 2018-02-01 . Alistarh . Dan.
  66. 2210.17323 . cs.LG . Elias . Frantar . Saleh . Ashkboos . GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers . 2022-10-01 . Hoefler . Torsten . Alistarh . Dan.
  67. 2306.03078 . cs.CL . Tim . Dettmers . Ruslan . Svirschevski . SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression . 2023-06-01 . Egiazarian . Vage . Kuznedelev . Denis . Frantar . Elias . Ashkboos . Saleh . Borzunov . Alexander . Hoefler . Torsten . Alistarh . Dan.
  68. Web site: Grootendorst . Maarten . A Visual Guide to Quantization . https://web.archive.org/web/20240731003355/https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization . 31 Jul 2024 . 2024-07-31 . newsletter.maartengrootendorst.com . en.
  69. 2305.14314 . cs.LG . Tim . Dettmers . Artidoro . Pagnoni . QLoRA: Efficient Finetuning of Quantized LLMs . 2023-05-01 . Holtzman . Ari . Ari Holtzman . Zettlemoyer . Luke.
  70. Kiros . Ryan . Salakhutdinov . Ruslan . Zemel . Rich . 2014-06-18 . Multimodal Neural Language Models . Proceedings of the 31st International Conference on Machine Learning . PMLR . 595–603 . 2023-07-02 . 2023-07-02 . https://web.archive.org/web/20230702195952/https://proceedings.mlr.press/v32/kiros14.html . live .
  71. Krizhevsky . Alex . Sutskever . Ilya . Hinton . Geoffrey E . 2012 . ImageNet Classification with Deep Convolutional Neural Networks . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 25 . 2023-07-02 . 2023-07-02 . https://web.archive.org/web/20230702195952/https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html . live .
  72. Antol . Stanislaw . Agrawal . Aishwarya . Lu . Jiasen . Mitchell . Margaret . Batra . Dhruv . Zitnick . C. Lawrence . Parikh . Devi . 2015 . VQA: Visual Question Answering . ICCV . 2425–2433 . 2023-07-02 . 2023-07-02 . https://web.archive.org/web/20230702195952/https://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html . live .
  73. Li . Junnan . Li . Dongxu . Savarese . Silvio . Hoi . Steven . 2023-01-01 . BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . cs.CV . 2301.12597 .
  74. Alayrac . Jean-Baptiste . Donahue . Jeff . Luc . Pauline . Miech . Antoine . Barr . Iain . Hasson . Yana . Lenc . Karel . Mensch . Arthur . Millican . Katherine . Reynolds . Malcolm . Ring . Roman . Rutherford . Eliza . Cabi . Serkan . Han . Tengda . Gong . Zhitao . 2022-12-06 . Flamingo: a Visual Language Model for Few-Shot Learning . Advances in Neural Information Processing Systems . 35 . 23716–23736 . 2204.14198 . 2023-07-02 . 2023-07-02 . https://web.archive.org/web/20230702195951/https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html . live .
  75. Driess . Danny . Xia . Fei . Sajjadi . Mehdi S. M. . Lynch . Corey . Chowdhery . Aakanksha . Ichter . Brian . Wahid . Ayzaan . Tompson . Jonathan . Vuong . Quan . Yu . Tianhe . Huang . Wenlong . Chebotar . Yevgen . Sermanet . Pierre . Duckworth . Daniel . Levine . Sergey . 2023-03-01 . PaLM-E: An Embodied Multimodal Language Model . cs.LG . 2303.03378 .
  76. Liu . Haotian . Li . Chunyuan . Wu . Qingyang . Lee . Yong Jae . 2023-04-01 . Visual Instruction Tuning . cs.CV . 2304.08485 .
  77. Zhang . Hang . Li . Xin . Bing . Lidong . 2023-06-01 . Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding . cs.CL . 2306.02858 .
  78. 2303.08774 . cs.CL . OpenAI . GPT-4 Technical Report . 2023-03-27.
  79. Web site: OpenAI . September 25, 2023 . GPT-4V(ision) System Card .
  80. Web site: Wiggers . Kyle . Mistral releases Pixtral 12B, its first multimodal model . TechCrunch . 14 September 2024 . 11 September 2024.
  81. 2203.15556 . cs.CL . Jordan . Hoffmann . Sebastian . Borgeaud . Training Compute-Optimal Large Language Models . 2022-03-29 . Mensch . Arthur . Buchatskaya . Elena . Cai . Trevor . Rutherford . Eliza . Casas . Diego de Las . Hendricks . Lisa Anne . Welbl . Johannes . Clark . Aidan . Hennigan . Tom . Noland . Eric . Millican . Katie . Driessche . George van den . Damoc . Bogdan.
  82. 2210.14891 . cs.LG . Ethan . Caballero . Kshitij . Gupta . Broken Neural Scaling Laws . Rish . Irina . Krueger . David . 2022.
  83. Wei . Jason . Tay . Yi . Bommasani . Rishi . Raffel . Colin . Zoph . Barret . Borgeaud . Sebastian . Yogatama . Dani . Bosma . Maarten . Zhou . Denny . Metzler . Donald . Chi . Ed H. . Hashimoto . Tatsunori . Vinyals . Oriol . Liang . Percy . Dean . Jeff . 31 August 2022 . Emergent Abilities of Large Language Models . Transactions on Machine Learning Research . 2835-8856 . Fedus . William . 19 March 2023 . 22 March 2023 . https://web.archive.org/web/20230322210052/https://openreview.net/forum?id=yzkSU5zdwD . live .
  84. Web site: 137 emergent abilities of large language models . 2023-06-24 . Jason Wei .
  85. 2304.00612 . cs.CL . Samuel R. . Bowman . Eight Things to Know about Large Language Models . 2023.
  86. Mukherjee . Anirban . Chang . Hannah . Heuristic Reasoning in AI: Instrumental Use and Mimetic Absorption . 2024 . cs.AI . 2403.09404.
  87. 2303.07971 . cs.LG . Michael . Hahn . Navin . Goyal . A Theory of Emergent In-Context Learning as Implicit Structure Induction . 2023-03-14.
  88. Pilehvar . Mohammad Taher . Camacho-Collados . Jose . Proceedings of the 2019 Conference of the North . June 2019 . Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, Minnesota . Association for Computational Linguistics . 1267–1273 . 10.18653/v1/N19-1128 . 102353817 . 2023-06-27 . 2023-06-27 . https://web.archive.org/web/20230627202732/https://aclanthology.org/N19-1128/ . live .
  89. Web site: WiC: The Word-in-Context Dataset . 2023-06-27 . pilehvar.github.io . 2023-06-27 . https://web.archive.org/web/20230627202725/https://pilehvar.github.io/wic/ . live .
  90. Patel . Roma . Pavlick . Ellie . 2021-10-06 . Mapping Language Models to Grounded Conceptual Spaces . ICLR . 2023-06-27 . 2023-06-24 . https://web.archive.org/web/20230624191940/https://openreview.net/forum?id=gJcEM8sxHK . live .
  91. A Closer Look at Large Language Models Emergent Abilities (Yao Fu, Nov 20, 2022)
  92. Web site: Ornes . Stephen . March 16, 2023 . The Unpredictable Abilities Emerging From Large AI Models . Quanta Magazine . March 16, 2023 . March 16, 2023 . https://web.archive.org/web/20230316203438/https://www.quantamagazine.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316/ . live .
  93. 2304.15004 . cs.AI . Rylan . Schaeffer . Brando . Miranda . Are Emergent Abilities of Large Language Models a Mirage? . 2023-04-01 . Koyejo . Sanmi.
  94. 2210.13382 . cs.LG . Kenneth . Li . Aspen K. . Hopkins . Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task . 2022-10-01 . Bau . David . Viégas . Fernanda . Pfister . Hanspeter . Wattenberg . Martin.
  95. Web site: 2023-01-21 . Large Language Model: world models or surface statistics? . 2023-06-12 . The Gradient .
  96. 2305.11169 . cs.LG . Charles . Jin . Martin . Rinard . Evidence of Meaning in Language Models Trained on Programs . 2023-05-01.
  97. 2301.05217 . cs.LG . Neel . Nanda . Lawrence . Chan . Progress measures for grokking via mechanistic interpretability . 2023-01-01 . Lieberum . Tom . Smith . Jess . Steinhardt . Jacob.
  98. Mitchell . Melanie . Krakauer . David C. . 28 March 2023 . The debate over understanding in AI's large language models . Proceedings of the National Academy of Sciences . 120 . 13 . e2215907120 . 2210.13966 . 2023PNAS..12015907M . 10.1073/pnas.2215907120 . 10068812 . 36943882 .
  99. News: Metz . Cade . 16 May 2023 . Microsoft Says New A.I. Shows Signs of Human Reasoning . The New York Times .
  100. 2303.12712 . cs.CL . Sébastien . Bubeck . Varun . Chandrasekaran . Sparks of Artificial General Intelligence: Early experiments with GPT-4 . 2023 . Eldan . Ronen . Gehrke . Johannes . Horvitz . Eric . Kamar . Ece . Lee . Peter . Lee . Yin Tat . Li . Yuanzhi . Lundberg . Scott . Nori . Harsha . Palangi . Hamid . Ribeiro . Marco Tulio . Zhang . Yi.
  101. News: October 17, 2024 . Anthropic CEO Dario Amodei pens a smart look at our AI future . Fast Company.
  102. News: 2023 . ChatGPT is more like an 'alien intelligence' than a human brain, says futurist . ZDNET . 12 June 2023 . 12 June 2023 . https://web.archive.org/web/20230612065937/https://www.zdnet.com/article/chatgpt-is-more-like-an-alien-intelligence-than-a-human-brain-says-futurist/ . live .
  103. Newport . Cal . 13 April 2023 . What Kind of Mind Does ChatGPT Have? . The New Yorker . 12 June 2023 . 12 June 2023 . https://web.archive.org/web/20230612071443/https://www.newyorker.com/science/annals-of-artificial-intelligence/what-kind-of-mind-does-chatgpt-have . live .
  104. News: Roose . Kevin . 30 May 2023 . Why an Octopus-like Creature Has Come to Symbolize the State of A.I. . The New York Times . 12 June 2023 . 30 May 2023 . https://web.archive.org/web/20230530193814/https://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html . live .
  105. News: 13 April 2023 . The A to Z of Artificial Intelligence . Time Magazine . 12 June 2023 . 16 June 2023 . https://web.archive.org/web/20230616123839/https://time.com/6271657/a-to-z-of-artificial-intelligence/ . live .
  106. Ji . Ziwei . Lee . Nayeon . Frieske . Rita . Yu . Tiezheng . Su . Dan . Xu . Yan . Ishii . Etsuko . Bang . Yejin . Dai . Wenliang . Madotto . Andrea . Fung . Pascale . November 2022 . Survey of Hallucination in Natural Language Generation . pdf . ACM Computing Surveys . . 55 . 12 . 1–38 . 2202.03629 . 10.1145/3571730 . 246652372 . 15 January 2023 . 26 March 2023 . https://web.archive.org/web/20230326145635/https://dl.acm.org/doi/pdf/10.1145/3571730 . live .
  107. Varshney . Neeraj . Yao . Wenlin . Zhang . Hongming . Chen . Jianshu . Yu . Dong . A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation . 2023 . cs.CL . 2307.03987 .
  108. Book: Lakoff, George . Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Philosophy; Appendix: The Neural Theory of Language Paradigm . New York Basic Books. 1999. 978-0-465-05674-3. 569–583.
  109. Book: Evans, Vyvyan. . The Language Myth . Cambridge University Press . 2014. 978-1-107-04396-1.
  110. Book: Friston, Karl J. . Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; Chapter 4 The Generative Models of Active Inference . The MIT Press. 2022. 978-0-262-36997-8.
  111. Web site: Evaluation Metrics for Language Modeling . Huyen . Chip . October 18, 2019 . The Gradient . January 14, 2024.
  112. 1905.10044 . cs.CL . Christopher . Clark . Kenton . Lee . BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions . Chang . Ming-Wei . Kwiatkowski . Tom . Collins . Michael . Toutanova . Kristina . 2019.
  113. 2303.18223 . cs.CL . Wayne Xin Zhao . Kun . Zhou . A Survey of Large Language Models . Li . Junyi . Tang . Tianyi . Wang . Xiaolei . Hou . Yupeng . Min . Yingqian . Zhang . Beichen . Zhang . Junjie . Dong . Zican . Du . Yifan . Yang . Chen . Chen . Yushuo . Chen . Zhipeng . Jiang . Jinhao . Ren . Ruiyang . Li . Yifan . Tang . Xinyu . Liu . Zikang . Liu . Peiyu . Nie . Jian-Yun . Wen . Ji-Rong . 2023.
  114. Web site: Sanitized open-source datasets for natural language and code understanding: how we evaluated our 70B model . 2024-07-24 . imbue.com . en-US . 2024-07-26 . https://web.archive.org/web/20240726173012/https://imbue.com/research/70b-evals/ . live .
  115. 2206.04615 . cs.CL . Aarohi . Srivastava . Abhinav . Rastogi . Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models . Rao . Abhishek . Abu Awal Md Shoeb . Abid . Abubakar . Fisch . Adam . Brown . Adam R. . Santoro . Adam . Gupta . Aditya . Garriga-Alonso . Adrià . Kluska . Agnieszka . Lewkowycz . Aitor . Agarwal . Akshat . Power . Alethea . Ray . Alex . Warstadt . Alex . Kocurek . Alexander W. . Safaya . Ali . Tazarv . Ali . Xiang . Alice . Parrish . Alicia . Nie . Allen . Hussain . Aman . Askell . Amanda . Dsouza . Amanda . Slone . Ambrose . Rahane . Ameet . Iyer . Anantharaman S. . Andreassen . Anders . Madotto . Andrea . 2022 . 1.
  116. 2109.07958 . cs.CL . Stephanie . Lin . Jacob . Hilton . TruthfulQA: Measuring How Models Mimic Human Falsehoods . Evans . Owain . 2021.
  117. 1905.07830 . cs.CL . Rowan . Zellers . Ari . Holtzman . HellaSwag: Can a Machine Really Finish Your Sentence? . Bisk . Yonatan . Farhadi . Ali . Choi . Yejin . 2019.
  118. 7 March 2023 . Prepare for truly useful large language models . Nature Biomedical Engineering . 7 . 2 . 85–86 . 10.1038/s41551-023-01012-6 . 36882584 . 257403466.
  119. News: 7 May 2023 . Your job is (probably) safe from artificial intelligence . The Economist . 18 June 2023 . 17 June 2023 . https://web.archive.org/web/20230617225618/https://www.economist.com/finance-and-economics/2023/05/07/your-job-is-probably-safe-from-artificial-intelligence . live .
  120. Web site: Generative AI Could Raise Global GDP by 7% . 18 June 2023 . Goldman Sachs . 18 June 2023 . https://web.archive.org/web/20230618013836/https://www.goldmansachs.com/intelligence/pages/generative-ai-could-raise-global-gdp-by-7-percent.html . live .
  121. Peng . Zhencan . Wang . Zhizhi . Deng . Dong . Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation . Proceedings of the ACM on Management of Data . 13 June 2023 . 1 . 2 . 1–18 . 10.1145/3589324 . 259213212 . 2024-01-20 . 2024-08-27 . https://web.archive.org/web/20240827053753/https://people.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf . live . Citing Lee et al 2022.
  122. .
  123. News: Alba . Davey . 1 May 2023 . AI chatbots have been used to create dozens of news content farms . The Japan Times . 18 June 2023.
  124. 14 June 2023 . Could chatbots help devise the next pandemic virus? . Science . 10.1126/science.adj2463 . 18 June 2023 . 18 June 2023 . https://web.archive.org/web/20230618013834/https://www.science.org/content/article/could-chatbots-help-devise-next-pandemic-virus . live .
  125. Web site: Stephen Council . 1 Dec 2023 . How Googlers cracked an SF rival's tech model with a single word . SFGATE . 16 December 2023 . https://web.archive.org/web/20231216160941/https://www.sfgate.com/tech/article/google-openai-chatgpt-break-model-18525445.php . live .
  126. Hubinger . Evan . 10 January 2024 . Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. cs.CR . 2401.05566.
  127. Kang . Daniel . 2023 . Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. cs.CR . 2302.05733.
  128. Web site: Wang . Yongge . 20 June 2024 . Encryption Based Covert Channel for Large Language Models . IACR ePrint 2024/586 . 24 June 2024 . 24 June 2024 . https://web.archive.org/web/20240624191233/https://eprint.iacr.org/2024/586.pdf . live .
  129. Web site: Stokel-Walker . Chris . November 22, 2023 . ChatGPT Replicates Gender Bias in Recommendation Letters . 2023-12-29 . Scientific American . 2023-12-29 . https://web.archive.org/web/20231229043124/https://www.scientificamerican.com/article/chatgpt-replicates-gender-bias-in-recommendation-letters/ . live .
  130. 2303.16281v2 . cs.CY . Queenie . Luo . Michael J. . Puett . A Perspectival Mirror of the Elephant: Investigating Language Bias on Google, ChatGPT, Wikipedia, and YouTube . 2023-03-28 . Smith . Michael D..
  131. Book: Kotek . Hadas . Proceedings of the ACM Collective Intelligence Conference . Dockum . Rikker . Sun . David . 2023-11-05 . Association for Computing Machinery . 979-8-4007-0113-9 . CI '23 . New York, NY, USA . 12–24 . Gender bias and stereotypes in Large Language Models . 10.1145/3582269.3615599 . https://dl.acm.org/doi/10.1145/3582269.3615599.
  132. Web site: Heikkilä . Melissa . August 7, 2023 . AI language models are rife with different political biases . 2023-12-29 . MIT Technology Review .
  133. Web site: June 11, 2018 . Improving language understanding with unsupervised learning . live . https://web.archive.org/web/20230318210736/https://openai.com/research/language-unsupervised . 2023-03-18 . 2023-03-18 . openai.com .
  134. Web site: GitHub. finetune-transformer-lm. 2 January 2024. 19 May 2023. https://web.archive.org/web/20230519062127/https://github.com/openai/finetune-transformer-lm. live.
  135. Devlin . Jacob . Chang . Ming-Wei . Lee . Kenton . Toutanova . Kristina . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . 11 October 2018 . 1810.04805v2. cs.CL .
  136. Web site: Prickett . Nicole Hemsoth . 2021-08-24 . Cerebras Shifts Architecture To Meet Massive AI/ML Models . 2023-06-20 . The Next Platform . 2023-06-20 . https://web.archive.org/web/20230620151619/https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ . live .
  137. Web site: BERT. March 13, 2023. GitHub. March 13, 2023. January 13, 2021. https://web.archive.org/web/20210113211317/https://github.com/google-research/bert. live.
  138. Patel . Ajay . Li . Bryan . Rasooli . Mohammad Sadegh . Constant . Noah . Raffel . Colin . Callison-Burch . Chris . Bidirectional Language Models Are Also Few-shot Learners . 2022 . cs.LG . 2209.14500.
  139. 1810.04805v2 . cs.CL . Jacob . Devlin . Ming-Wei . Chang . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . 11 October 2018 . Lee . Kenton . Toutanova . Kristina.
  140. Raffel . Colin . Shazeer . Noam . Roberts . Adam . Lee . Katherine . Narang . Sharan . Matena . Michael . Zhou . Yanqi . Li . Wei . Liu . Peter J. . 2020 . Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research . 21 . 140 . 1–67 . 1910.10683 . 1533-7928.
  141. Web site: Imagen: Text-to-Image Diffusion Models . 2024-04-04 . imagen.research.google . 2024-03-27 . https://web.archive.org/web/20240327201713/https://imagen.research.google/ . live .
  142. Web site: Pretrained models — transformers 2.0.0 documentation . 2024-08-05 . huggingface.co . 2024-08-05 . https://web.archive.org/web/20240805032110/https://huggingface.co/transformers/v2.0.0/pretrained_models.html . live .
  143. Web site: GitHub. xlnet. 2 January 2024. 2 January 2024. https://web.archive.org/web/20240102191842/https://github.com/zihangdai/xlnet/. live.
  144. Yang . Zhilin . Dai . Zihang . Yang . Yiming . Carbonell . Jaime . Salakhutdinov . Ruslan . Le . Quoc V. . XLNet: Generalized Autoregressive Pretraining for Language Understanding . 2 January 2020 . cs.CL . 1906.08237.
  145. Web site: Better language models and their implications . openai.com . 2023-03-13 . 2023-03-16 . https://web.archive.org/web/20230316160730/https://openai.com/research/better-language-models . live .
  146. Web site: OpenAI's GPT-3 Language Model: A Technical Overview . lambdalabs.com . 3 June 2020 . 13 March 2023 . 27 March 2023 . https://web.archive.org/web/20230327213811/https://lambdalabs.com/blog/demystifying-gpt-3 . live .
  147. Web site: openai-community/gpt2-xl · Hugging Face . 2024-07-24 . huggingface.co . 2024-07-24 . https://web.archive.org/web/20240724041702/https://huggingface.co/openai-community/gpt2-xl . live .
  148. Web site: GitHub. gpt-2. 13 March 2023. 11 March 2023. https://web.archive.org/web/20230311154936/https://github.com/openai/gpt-2. live.
  149. Table D.1 in Brown . Tom B. . Mann . Benjamin . Ryder . Nick . Subbiah . Melanie . Kaplan . Jared . Dhariwal . Prafulla . Neelakantan . Arvind . Shyam . Pranav . Sastry . Girish . Askell . Amanda . Agarwal . Sandhini . Herbert-Voss . Ariel . Krueger . Gretchen . Henighan . Tom . Child . Rewon . May 28, 2020 . Language Models are Few-Shot Learners . 2005.14165v4 . Aditya . Ramesh . Daniel M. . Ziegler . Jeffrey . Wu . Clemens . Winter . Christopher . Hesse . Mark . Chen . Eric . Sigler . Mateusz . Litwin . Scott . Gray . Benjamin . Chess . Jack . Clark . Christopher . Berner . Sam . McCandlish . Alec . Radford . Ilya . Sutskever . Dario . Amodei. cs.CL.
  150. Web site: GPT Neo. March 15, 2023. GitHub. March 12, 2023. March 12, 2023. https://web.archive.org/web/20230312225202/https://github.com/EleutherAI/gpt-neo. live.
  151. Gao . Leo . Biderman . Stella . Black . Sid . Golding . Laurence . Hoppe . Travis . Foster . Charles . Phang . Jason . He . Horace . Thite . Anish . Nabeshima . Noa . Presser . Shawn . Leahy . Connor . The Pile: An 800GB Dataset of Diverse Text for Language Modeling . 2101.00027. 31 December 2020 . cs.CL.
  152. Web site: GPT-J-6B: An Introduction to the Largest Open Source GPT Model Forefront . 2023-02-28 . www.forefront.ai . 2023-03-09 . https://web.archive.org/web/20230309205439/https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model . dead .
  153. Dey . Nolan . Gosal . Gurpreet . Zhiming . Chen . Khachane . Hemant . Marshall . William . Pathria . Ribhu . Tom . Marvin . Hestness . Joel . 2023-04-01 . Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster . cs.LG . 2304.03208.
  154. Web site: Alvi . Ali . Kharya . Paresh . Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model . Microsoft Research . 11 October 2021 . 13 March 2023 . 13 March 2023 . https://web.archive.org/web/20230313180531/https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ . live .
  155. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. Shuohuan. Wang. Yu. Sun. Yang. Xiang. Zhihua. Wu. Siyu. Ding. Weibao. Gong. Shikun. Feng. Junyuan. Shang. Yanbin. Zhao. Chao. Pang. Jiaxiang. Liu. Xuyi. Chen. Yuxiang. Lu. Weixin. Liu. Xi. Wang. Yangfan. Bai. Qiuliang. Chen. Li. Zhao. Shiyong. Li. Peng. Sun. Dianhai. Yu. Yanjun. Ma. Hao. Tian. Hua. Wu. Tian. Wu. Wei. Zeng. Ge. Li. Wen. Gao. Haifeng. Wang. December 23, 2021. cs.CL . 2112.12731.
  156. Web site: Product . Anthropic . 14 March 2023 . 16 March 2023 . https://web.archive.org/web/20230316145444/https://www.anthropic.com/product . live .
  157. Askell . Amanda . Bai . Yuntao . Chen . Anna . Drain . Dawn . Ganguli . Deep . Henighan . Tom . Jones . Andy . Joseph . Nicholas . Mann . Ben . DasSarma . Nova . Elhage . Nelson . Hatfield-Dodds . Zac . Hernandez . Danny . Kernion . Jackson . Ndousse . Kamal . Olsson . Catherine . Amodei . Dario . Brown . Tom . Clark . Jack . McCandlish . Sam . Olah . Chris . Kaplan . Jared . 3 . A General Language Assistant as a Laboratory for Alignment . 2112.00861 . 9 December 2021 . cs.CL.
  158. Bai . Yuntao . Kadavath . Saurav . Kundu . Sandipan . Askell . Amanda . Kernion . Jackson . Jones . Andy . Chen . Anna . Goldie . Anna . Mirhoseini . Azalia . McKinnon . Cameron . Chen . Carol . Olsson . Catherine . Olah . Christopher . Hernandez . Danny . Drain . Dawn . Ganguli . Deep . Li . Dustin . Tran-Johnson . Eli . Perez . Ethan . Kerr . Jamie . Mueller . Jared . Ladish . Jeffrey . Landau . Joshua . Ndousse . Kamal . Lukosuite . Kamile . Lovitt . Liane . Sellitto . Michael . Elhage . Nelson . Schiefer . Nicholas . Mercado . Noemi . DasSarma . Nova . Lasenby . Robert . Larson . Robin . Ringer . Sam . Johnston . Scott . Kravec . Shauna . Showk . Sheer El . Fort . Stanislav . Lanham . Tamera . Telleen-Lawton . Timothy . Conerly . Tom . Henighan . Tom . Hume . Tristan . Bowman . Samuel R. . Hatfield-Dodds . Zac . Mann . Ben . Amodei . Dario . Joseph . Nicholas . McCandlish . Sam . Brown . Tom . Kaplan . Jared . 3 . Constitutional AI: Harmlessness from AI Feedback . 2212.08073 . 15 December 2022 . cs.CL.
  159. Web site: Language modelling at scale: Gopher, ethical considerations, and retrieval . www.deepmind.com . 8 December 2021 . 20 March 2023 . 20 March 2023 . https://web.archive.org/web/20230320082323/https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval . live .
  160. Table 20 and page 66 of PaLM: Scaling Language Modeling with Pathways
  161. Thoppilan . Romal . De Freitas . Daniel . Hall . Jamie . Shazeer . Noam . Kulshreshtha . Apoorv . Cheng . Heng-Tze . Jin . Alicia . Bos . Taylor . Baker . Leslie . Du . Yu . Li . YaGuang . Lee . Hongrae . Zheng . Huaixiu Steven . Ghafouri . Amin . Menegali . Marcelo . 2022-01-01 . LaMDA: Language Models for Dialog Applications . cs.CL . 2201.08239.
  162. GPT-NeoX-20B: An Open-Source Autoregressive Language Model . Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models . 2022-05-01 . Black . Sidney . Biderman . Stella . Hallahan . Eric . etal . Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models . 95–136 . 2022-12-19 . 2022-12-10 . https://web.archive.org/web/20221210082456/https://aclanthology.org/2022.bigscience-1.9/ . live .
  163. Hoffmann . Jordan . Borgeaud . Sebastian . Mensch . Arthur . Buchatskaya . Elena . Cai . Trevor . Rutherford . Eliza . Casas . Diego de Las . Hendricks . Lisa Anne . Welbl . Johannes . Clark . Aidan . Hennigan . Tom . Noland . Eric . Millican . Katie . Driessche . George van den . Damoc . Bogdan . Guy . Aurelia . Osindero . Simon . Simonyan . Karen . Elsen . Erich . Rae . Jack W. . Vinyals . Oriol . Sifre . Laurent . Training Compute-Optimal Large Language Models . 2203.15556 . 29 March 2022 . cs.CL . 3.
  164. Web site: Democratizing access to large-scale language models with OPT-175B . Susan Zhang . Mona Diab . Luke Zettlemoyer . ai.facebook.com . 2023-03-12 . 2023-03-12 . https://web.archive.org/web/20230312231820/https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ . live .
  165. Zhang . Susan . Roller . Stephen . Goyal . Naman . Artetxe . Mikel . Chen . Moya . Chen . Shuohui . Dewan . Christopher . Diab . Mona . Li . Xian . Lin . Xi Victoria . Mihaylov . Todor . Ott . Myle . Shleifer . Sam . Shuster . Kurt . Simig . Daniel . Koura . Punit Singh . Sridhar . Anjali . Wang . Tianlu . Zettlemoyer . Luke . OPT: Open Pre-trained Transformer Language Models . 2205.01068 . 21 June 2022. cs.CL.
  166. Web site: metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq . 2024-10-18 . GitHub . en.
  167. Lewkowycz . Aitor . Andreassen . Anders . Dohan . David . Dyer . Ethan . Michalewski . Henryk . Ramasesh . Vinay . Slone . Ambrose . Anil . Cem . Schlag . Imanol . Gutman-Solo . Theo . Wu . Yuhuai . Neyshabur . Behnam . Gur-Ari . Guy . Misra . Vedant . Solving Quantitative Reasoning Problems with Language Models . 30 June 2022 . cs.CL . 2206.14858.
  168. Web site: Minerva: Solving Quantitative Reasoning Problems with Language Models . ai.googleblog.com . 30 June 2022 . 20 March 2023 .
  169. Nature . Ananthaswamy . Anil . In AI, is bigger always better? . 8 March 2023 . 615 . 7951 . 202–205 . 10.1038/d41586-023-00641-w . 36890378 . 2023Natur.615..202A . 257380916 . 9 March 2023 . 16 March 2023 . https://web.archive.org/web/20230316181013/https://www.nature.com/articles/d41586-023-00641-w . live .
  170. Web site: bigscience/bloom · Hugging Face . huggingface.co . 2023-03-13 . 2023-04-12 . https://web.archive.org/web/20230412002547/https://huggingface.co/bigscience/bloom . live .
  171. Taylor . Ross . Kardas . Marcin . Cucurull . Guillem . Scialom . Thomas . Hartshorn . Anthony . Saravia . Elvis . Poulton . Andrew . Kerkez . Viktor . Stojnic . Robert . Galactica: A Large Language Model for Science . 16 November 2022 . cs.CL . 2211.09085.
  172. Web site: 20B-parameter Alexa model sets new marks in few-shot learning . Amazon Science . 2 August 2022 . 12 March 2023 . 15 March 2023 . https://web.archive.org/web/20230315190223/https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning . live .
  173. Soltan . Saleh . Ananthakrishnan . Shankar . FitzGerald . Jack . Gupta . Rahul . Hamza . Wael . Khan . Haidar . Peris . Charith . Rawls . Stephen . Rosenbaum . Andy . Rumshisky . Anna . Prakash . Chandana Satya . Sridhar . Mukund . Triefenbach . Fabian . Verma . Apurv . Tur . Gokhan . Natarajan . Prem . 3. AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model . 2208.01448 . 3 August 2022. cs.CL.
  174. Web site: AlexaTM 20B is now available in Amazon SageMaker JumpStart AWS Machine Learning Blog . aws.amazon.com . 13 March 2023 . 17 November 2022 . 13 March 2023 . https://web.archive.org/web/20230313163933/https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ . live .
  175. Web site: The Falcon has landed in the Hugging Face ecosystem . 2023-06-20 . huggingface.co . 2023-06-20 . https://web.archive.org/web/20230620002832/https://huggingface.co/blog/falcon . live .
  176. Web site: Schreiner . Maximilian . 2023-07-11 . GPT-4 architecture, datasets, costs and more leaked . 2024-07-26 . THE DECODER . en-US . 2023-07-12 . https://web.archive.org/web/20230712123915/https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ . live .
  177. News: Dickson . Ben . Meta introduces Chameleon, a state-of-the-art multimodal model . VentureBeat . 22 May 2024.
  178. Web site: Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models. Nolan. Dey. March 28, 2023. Cerebras. March 28, 2023. March 28, 2023. https://web.archive.org/web/20230328213339/https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/. live.
  179. Web site: Abu Dhabi-based TII launches its own version of ChatGPT . tii.ae . 2023-04-03 . 2023-04-03 . https://web.archive.org/web/20230403021729/https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ . live .
  180. Penedo . Guilherme . Malartic . Quentin . Hesslow . Daniel . Cojocaru . Ruxandra . Cappelli . Alessandro . Alobeidli . Hamza . Pannier . Baptiste . Almazrouei . Ebtesam . Launay . Julien . 2023-06-01 . The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only . cs.CL . 2306.01116.
  181. Web site: 2023-06-09 . tiiuae/falcon-40b · Hugging Face . 2023-06-20 . huggingface.co.
  182. https://www.businesswire.com/news/home/20230531005608/en/UAE's-Falcon-40B-World's-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free
  183. BloombergGPT: A Large Language Model for Finance. Shijie. Wu. Ozan. Irsoy. Steven. Lu. Vadim. Dabravolski. Mark. Dredze. Sebastian. Gehrmann. Prabhanjan. Kambadur. David. Rosenberg. Gideon. Mann. March 30, 2023. cs.LG . 2303.17564.
  184. PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. Xiaozhe. Ren. Pingyi. Zhou. Xinfan. Meng. Xinjing. Huang. Yadao. Wang. Weichao. Wang. Pengfei. Li. Xiaoda. Zhang. Alexander. Podolskiy. Grigory. Arshinov. Andrey. Bout. Irina. Piontkovskaya. Jiansheng. Wei. Xin. Jiang. Teng. Su. Qun. Liu. Jun. Yao. March 19, 2023. cs.CL . 2303.10845.
  185. Köpf . Andreas . Kilcher . Yannic . von Rütte . Dimitri . Anagnostidis . Sotiris . Tam . Zhi-Rui . Stevens . Keith . Barhoum . Abdullah . Duc . Nguyen Minh . Stanley . Oliver . Nagyfi . Richárd . ES . Shahul . Suri . Sameer . Glushkov . David . Dantuluri . Arnav . Maguire . Andrew . 2023-04-14 . OpenAssistant Conversations – Democratizing Large Language Model Alignment . cs.CL . 2304.07327.
  186. Web site: Wrobel . Sharon . Tel Aviv startup rolls out new advanced AI language model to rival OpenAI . 2023-07-24 . www.timesofisrael.com . 2023-07-24 . https://web.archive.org/web/20230724191823/https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ . live .
  187. Web site: Wiggers . Kyle . 2023-04-13 . With Bedrock, Amazon enters the generative AI race . 2023-07-24 . TechCrunch . 2023-07-24 . https://web.archive.org/web/20230724102458/https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ . live .
  188. Web site: Elias . Jennifer . Google's newest A.I. model uses nearly five times more text data for training than its predecessor . . 16 May 2023 . 18 May 2023 . 16 May 2023 . https://web.archive.org/web/20230516225326/https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html . live .
  189. Web site: Introducing PaLM 2. May 10, 2023. Google. May 18, 2023. May 18, 2023. https://web.archive.org/web/20230518213209/https://blog.google/technology/ai/google-palm-2-ai-large-language-model/. live.
  190. Web site: Introducing Llama 2: The Next Generation of Our Open Source Large Language Model . 2023-07-19 . Meta AI . 2023 . 2024-01-05 . https://web.archive.org/web/20240105234629/https://ai.meta.com/llama/ . live .
  191. Web site: llama/MODEL_CARD.md at main · meta-llama/llama . 2024-05-28 . GitHub . 2024-05-28 . https://web.archive.org/web/20240528090541/https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md . live .
  192. Web site: Claude 2 . anthropic.com . 12 December 2023 . 15 December 2023 . https://web.archive.org/web/20231215212208/https://www.anthropic.com/index/claude-2 . live .
  193. Web site: Nirmal . Dinesh . 2023-09-07 . Building AI for business: IBM's Granite foundation models . 2024-08-11 . IBM Blog . en-US . 2024-07-22 . https://web.archive.org/web/20240722083855/https://www.ibm.com/blog/building-ai-for-business-ibms-granite-foundation-models/ . live .
  194. Web site: Announcing Mistral 7B . 2023-10-06 . Mistral . 2023 . 2024-01-06 . https://web.archive.org/web/20240106051047/https://mistral.ai/news/announcing-mistral-7b/ . live .
  195. Web site: Introducing Claude 2.1 . anthropic.com . 12 December 2023 . 15 December 2023 . https://web.archive.org/web/20231215201726/https://www.anthropic.com/index/claude-2-1 . live .
  196. Web site: Grok-1 model card . x.ai . 12 December 2023.
  197. Web site: Gemini – Google DeepMind . deepmind.google . 12 December 2023 . 8 December 2023 . https://web.archive.org/web/20231208015607/https://deepmind.google/technologies/gemini/#capabilities . live .
  198. Web site: Franzen . Carl . Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance . VentureBeat . 12 December 2023 . 11 December 2023 . 11 December 2023 . https://web.archive.org/web/20231211213640/https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ . live .
  199. Web site: 11 December 2023 . Mixtral of experts . 12 December 2023 . mistral.ai . 13 February 2024 . https://web.archive.org/web/20240213104049/https://mistral.ai/news/mixtral-of-experts/ . live .
  200. Web site: AI . Mistral . 2024-04-17 . Cheaper, Better, Faster, Stronger . 2024-05-05 . mistral.ai . 2024-05-05 . https://web.archive.org/web/20240505023828/https://mistral.ai/news/mixtral-8x22b/ . live .
  201. Web site: Hughes . Alyssa . Phi-2: The surprising power of small language models . Microsoft Research . 13 December 2023 . 12 December 2023 . 12 December 2023 . https://web.archive.org/web/20231212232647/https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ . live .
  202. Web site: Our next-generation model: Gemini 1.5 . Google . 16 February 2024 . 15 February 2024 . This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens. . 16 February 2024 . https://web.archive.org/web/20240216003052/https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window . live .
  203. Web site: Gemma. GitHub.
  204. Web site: Introducing the next generation of Claude . 2024-03-04 . www.anthropic.com . 2024-03-04 . https://web.archive.org/web/20240304143650/https://www.anthropic.com/news/claude-3-family . live .
  205. Web site: Fugaku-LLM/Fugaku-LLM-13B · Hugging Face . 2024-05-17 . huggingface.co . 2024-05-17 . https://web.archive.org/web/20240517135225/https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B . live .
  206. Web site: Phi-3. 2024-04-28. azure.microsoft.com. 23 April 2024. 2024-04-27. https://web.archive.org/web/20240427043835/https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/. live.
  207. Web site: Phi-3 Model Documentation. 2024-04-28. huggingface.co. 2024-05-13. https://web.archive.org/web/20240513141513/https://huggingface.co/docs/transformers/main/en/model_doc/phi3. live.
  208. Web site: Qwen2. GitHub. 2024-06-17. 2024-06-17. https://web.archive.org/web/20240617072401/https://github.com/QwenLM/Qwen2?spm=a3c0i.28768018.7084722650.1.5cd35c10NEqBXm&file=Qwen1.5. live.
  209. Web site: 2024-06-14 . nvidia/Nemotron-4-340B-Base · Hugging Face . 2024-06-15 . huggingface.co . 2024-06-15 . https://web.archive.org/web/20240615010323/https://huggingface.co/nvidia/Nemotron-4-340B-Base . live .
  210. Web site: Nemotron-4 340B Research . 2024-06-15 . research.nvidia.com . 2024-06-15 . https://web.archive.org/web/20240615010323/https://research.nvidia.com/publication/2024-06_nemotron-4-340b . live .
  211. https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta
  212. Web site: llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models . 2024-07-23 . GitHub . en . 2024-07-23 . https://web.archive.org/web/20240723151851/https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md . live .