The GPTs and Their Ilk

Topics for student presentations

Note. You are welcome to propose your own topic, so you are not limited to the topics listed in this section.

topic description references

NLP and deep learning foundations Historically, next word predictors were thought of as important components of the NLP pipeline, but not powerful in and of itself. This changed with GPT3: the model demonstrated that next word predictors are surprisingly versatile problem solvers in their own right (as opposed to components of a larger system). Part of understanding this shift in perspective is to understand that next word prediction ultimately can be viewed as AI complete (perhaps modulo multimodality): a machine that can predict the next word as well as a human can be argued to be AI complete.
The forgoing makes only sense for very very good next word predictors. That's why transformers are crucial: they excel at next word prediction. So the work horse of LLMs has been the transformer, trained on next word prediction. transformers, gpt3 Bengio et al on next word prediction (2003)

learning from natural language instructions (1) LLMs are powerful, but how do we get them to do something interesting? We will refer to the initial LLM paradigm as learning from natural language instructions. The dominant paradigm in NLP had been supervised learning, which basically means learning from examples. (It's still the dominant paradigm for most NLP in the real world.) You create a labeled training set, train a model on it and then use it on new data for your application. Notice that typical human learning is different: a few examples may be given, but most of human learning is through meta communication about the task. For example, if somebody teaches you to boil an egg, they don't give you hundreds of examples (which is a small training set in supervised learning). They at most give you one. Most of the learning is through explaining what you have to do.
This is also what we do in learning from natural language instructions: we describe the task to the LLM or at least we couch the task in terms that are easily understandable to the LLM. Because the LLM understands natural language, it can learn in a way that is similar to how humans learn: through information communicated about the task in the medium of natural language.
Perhaps the first clear articulation of this idea can be found in Timo Schick's work, see references. Schick 2020a, Schick 2020b

learning from natural language instructions (2) This is more recent work on building "a model that learns a new task by understanding the human-readable instructions that define it". The authors publish NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, human-authored instructions and task instances (input-output pairs), obtained from crowdsourcing. They focus on cross-task generalization, i.e., the LLM is trained / fine-tuned to specifically understand instructions well. SUPER-NATURALINSTRUCTIONS is a much larger extension of NATURAL INSTRUCTIONS. The authors also introduce Tk-INSTRUCT, a model trained to follow instructions. Natural Instructions Super-NaturalInstructions

LLMs behaving badly Since LLMs are trained on large training corpora, their behavior – unsurprisingly – reflects what's written in those corpora. The corpora contain bad stuff, including sexism, racism and lots of information about how to kill yourself, how to build a bomb and how to obtain/synthesize substances to poison someone. And so a "raw" LLM (which has just been trained on next-word prediction, nothing else) will be sexist, racist and happy to tell you how to build a bomb.
There is a substantial literature on this, including on diagnosing and preventing bad behavior. The latter is called debiasing when focused on the problem of bias. I've only given a few examples, but this would be a great topic for one or more presentations. Sheng et al: systematic study of bias (2019) Schick: self-debiasing (2021) Liang: mitigating social bias (2021) CrowS-Pairs StereoSet

learning from human feedback (RLHF), InstructGPT RLHF is one way of addressing bad behavior. It also addresses a form of bad behavior not mentioned above, hallucinations: raw LLMs tend to make stuff up and they say what they make up with supreme confidence. Beyond bad behavior, LLMs are trained on input that contains little natural dialog. As a result, they are not good at natural dialog in their raw form. Even if what they say is not bad or untruthful, it may violate the expectations humans have in conversation or simply be unhelpful.
RLHF attempts to address all of these problems. This is sometimes called alignment. Concretely, human-written prompts are collected and given to the LLM. Humans write good responses to the prompts. This resource is then used for finetuning the LLM, in part through reinforcement learning. The instantiation of RLHF employed for the GPTs is InstructGPT.
InstructGPT is an ingredient crucial for the success of the GPTs because it greatly reduces "bad" output. RLHF RLAIF InstructGPT auditing 1 auditing 2 cross-cultural alignment deepspeed chat

GPT4 technical report GPT4 technical report

analysis of GPT4 Does GPT4 exhibit some AGI (artificial general intelligence)? Some amazing capabilities of GPT4, in particular: using tools. Does GPT4 have close-to-human performance? Shows that GPT4 is a lot better than previous models. Good discussion of the challenges ahead. Do we need a new paradigm? Bubeck: analysis of GPT4 (2023)

GPT4 system card This is a pretty good analysis, by OpenAI, of many of the dangers that GPT4 presents. GPT4 system card OpenAI: safety

multimodality There have been stunning advances in LLMs that may have (or may not have) plateaued. On the other hand, we have probably just scratched the surface on multimodal models that are as powerful as the GPTs for text, but are multimodal. GPT4 already is multimodal.
There is much work one could make the basis of a presentation here. A small selection is given in the references. CLIP stable diffusion multimodal prompting flamingo DALL-E 2 PaLI mSLAM data2vec

multilinguality One really surprising property of LLMs is that they are multilingually competent. For example, they can translate from French to English even though they have never been trained on this task. There is some direct signal in the training data ("fromage means cheese"), but most people wouldn't have expected that you can learn to translate from that.
There are a number of interesting topics in the area of multilinguality of LLMs, for a few of which references are given. XLM-R: best known multilingual LLM BLOOM NLLB Google:1000Langs flores: low-resource evaluation of MT canine: multilingual (non-)tokenization simalign: demonstrates high-quality representations of LLMs roots: dataset creation LLMs as multilingual knowledge bases evaluation of LLMs for MT ChatGPT multilingual evaluation

linguistics LLMs have an unprecedented understanding of linguistics, on the practical level (they can do things that a human can only do if they understand basic linguistic concepts) and maybe even on the theoretical level (chatgpt understands "morpheme" when asked "Can you please break insurmountability down into morphemes?"). What are the current linguistic capabilities of LLMs? How do we analyze them (also called probing)? Do we still need to study linguistics if want to do NLP and AI? high-level overview Hewitt: probing syntax (2019) Hofmann: LLM tokenization (2022) Weissweiler: CxG (2023) LLMs model linguistics badly

chains of thought On a pessimistic interpretation, LLMs are just fluency models that generate text that looks great when read quickly, but does not make sense if one thinks through it carefullly. This is a particular problem for math and for reasoning – LLMs are bad at both.
A chain of thought is "a series of intermediate reasoning steps". While LLMs are often not capable of creating a complex reasoning sequence on their own, they can reasonably well generate one step from the next. The basic idea of chain of thought approaches is to decompose a complex task the LLM cannot do into a series of simpler tasks that the LLM does have the capability of performing correctly.
One way of getting LLMs to take a chain-of-thought approach is chain-of-thought prompting: chain of thought demonstrations are provided as exemplars in prompting. Wei et al (2023): chain-of-thought prompting compositionality gap LLMs are poor reasoners

how to code with LLMs It is likely that most coding in the future will be a collaboration of AI agents and humans. What are best practices in 2023? What will this look like five years from now? code LLM evaluation security of generated code: 1 2

how to write with LLMs It is likely that most writing in the future will be a collaboration of AI agents and humans. What are best practices in 2023? What will this look like five years from now? You do the research!

what do LLMs mean for me professionally? What will be the impact on those with careers in NLP, computational linguistics, AI, machine learning, data science? labor market impact impact on NLP engineers

prompt engineering One likely impact on NLP/AI/data science engineers is that they will have to do a lot of prompt engineering. More generally, there are some ways of talking with an LLM that work well and others that don't. Basically, we have to learn the language of the LLM and then become fluent in it. This is a growing and rapidly evolving area of NLP. some prompts are better than others automatic prompt engineer

how will LLMs change our lives? negative: unemployment, military use, information warfare, fraud (e.g., generating text for email exchanges)
positive: a lot of boring stuff we no longer have to do (filling out forms, proofreading, complex and long Google sessions to find information); breakthrough in productivity, resulting in a higher standard of living; better detection and prevention of crime; many other promises of AI opportunities/risks of LLMs You do the research!

how will LLMs change education? Should we still learn things LLMs can do? How will we write and how will writing skills be tested? How will we code and how will coding skills be tested? Should using LLMs be a big part of the curriculum, in schools and at university? In collaborations between LLM and student, how can we measure and control how much each party contributes? Ultimately: what is left to learn if most of what we test for in exams today can be done by LLMs? What would be good tests that LLMs cannot do? NRW-Rechtsgutachten TUM/LMU position paper Uni Hamburg forschungundlehre the argument for AI

explanations of what LLMs do Deep learning models are blackboxes. We arguably do not have a good high-level understanding of what they do. But different hypotheses have been put forward, including "parroting" and "emergence". Bender: parrots (2021) Wei: emergence (2022) Bubeck: analysis of GPT4 (2023)

robotics LLMs are likely to have a big impact on robotics. Two examples are given in the references. First, LLMs can generate, reason over and maintain a semantic layer that integrates different modalities (and also is easy to understand and interact with for humans). Second, LLMs provide common sense that is lacking from current robotic systems. semantic layer common sense

parameter-efficient methods Highly performant LLMs are huge. This is problematic for at least three reasons. First, they are (financially and environmentally) costly to train and run. Second, right now, only large industrial labs can develop and do research on them. Third, they cannot be widely deployed, e.g., not on mobile phones. The goal of parameter-efficient methods is to make LLMs smaller and thereby address these problems. adapters low-rank adaptation of LLMs selective tuning of LLMs cost of BLOOM

legal issues Does LLM training violate copyright? Who owns the copyright of what LLMs produce? Who is liable when an LLM causes harm? stable diffusion: 1 2 tweet use who owns AI-generated content?

how to do stuff with LLMs other than writing and coding searching, question answering etc. .

LLMs as research tools in the humanities annotating time

moat

day	topic	resources

Apr	introduction
21

Apr	NLP and deep learning foundations (1)	transformers 0 1 2 3
28	assignment of topics	presentation schedule

May	NLP and deep learning foundations (2)	transformers 0 1 2 3
5	assignment of topics	presentation schedule

May	NLP and deep learning foundations (3)	transformers 0 1 2 3
12	multilinguality	presentation

May	natural language instructions	presentation
19	parameter-efficient methods	presentation

May	LLMs and linguistics	presentation
26	GPT4	presentation

June	LLMs behaving badly	presentation
2	InstructGPT	presentation

June	Glot500: LLM for 500 langs	arxiv
9		presentation

June	LLMs: State of the art, trends, applications (1)	presentation
16		AppliedAI Conf 2023

June	InstructGPT (1)	presentation
23	multimodality	presentation

June	chain of thought	presentation
30	InstructGPT (2)	paper PPO x

July	InstructGPT (3)	paper PPO x
7	LLMs: State of the art, trends, applications (2)	presentation

July	prompt engineering	presentation
14	PEFT, LLM generation (Prof. Glavas's slides)	presentation

July	term paper / Hausarbeit Q&A (zoom only)
21

The GPTs and Their Ilk

Room & zoom

Topic

Credit

Schedule

Topics for student presentations

topic	description	references

NLP and deep learning foundations	Historically, next word predictors were thought of as important components of the NLP pipeline, but not powerful in and of itself. This changed with GPT3: the model demonstrated that next word predictors are surprisingly versatile problem solvers in their own right (as opposed to components of a larger system). Part of understanding this shift in perspective is to understand that next word prediction ultimately can be viewed as AI complete (perhaps modulo multimodality): a machine that can predict the next word as well as a human can be argued to be AI complete. The forgoing makes only sense for very very good next word predictors. That's why transformers are crucial: they excel at next word prediction. So the work horse of LLMs has been the transformer, trained on next word prediction.	transformers, gpt3 Bengio et al on next word prediction (2003)

learning from natural language instructions (1)	LLMs are powerful, but how do we get them to do something interesting? We will refer to the initial LLM paradigm as learning from natural language instructions. The dominant paradigm in NLP had been supervised learning, which basically means learning from examples. (It's still the dominant paradigm for most NLP in the real world.) You create a labeled training set, train a model on it and then use it on new data for your application. Notice that typical human learning is different: a few examples may be given, but most of human learning is through meta communication about the task. For example, if somebody teaches you to boil an egg, they don't give you hundreds of examples (which is a small training set in supervised learning). They at most give you one. Most of the learning is through explaining what you have to do. This is also what we do in learning from natural language instructions: we describe the task to the LLM or at least we couch the task in terms that are easily understandable to the LLM. Because the LLM understands natural language, it can learn in a way that is similar to how humans learn: through information communicated about the task in the medium of natural language. Perhaps the first clear articulation of this idea can be found in Timo Schick's work, see references.	Schick 2020a, Schick 2020b

learning from natural language instructions (2)	This is more recent work on building "a model that learns a new task by understanding the human-readable instructions that define it". The authors publish NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, human-authored instructions and task instances (input-output pairs), obtained from crowdsourcing. They focus on cross-task generalization, i.e., the LLM is trained / fine-tuned to specifically understand instructions well. SUPER-NATURALINSTRUCTIONS is a much larger extension of NATURAL INSTRUCTIONS. The authors also introduce Tk-INSTRUCT, a model trained to follow instructions.	Natural Instructions Super-NaturalInstructions

LLMs behaving badly	Since LLMs are trained on large training corpora, their behavior – unsurprisingly – reflects what's written in those corpora. The corpora contain bad stuff, including sexism, racism and lots of information about how to kill yourself, how to build a bomb and how to obtain/synthesize substances to poison someone. And so a "raw" LLM (which has just been trained on next-word prediction, nothing else) will be sexist, racist and happy to tell you how to build a bomb. There is a substantial literature on this, including on diagnosing and preventing bad behavior. The latter is called debiasing when focused on the problem of bias. I've only given a few examples, but this would be a great topic for one or more presentations.	Sheng et al: systematic study of bias (2019) Schick: self-debiasing (2021) Liang: mitigating social bias (2021) CrowS-Pairs StereoSet

learning from human feedback (RLHF), InstructGPT	RLHF is one way of addressing bad behavior. It also addresses a form of bad behavior not mentioned above, hallucinations: raw LLMs tend to make stuff up and they say what they make up with supreme confidence. Beyond bad behavior, LLMs are trained on input that contains little natural dialog. As a result, they are not good at natural dialog in their raw form. Even if what they say is not bad or untruthful, it may violate the expectations humans have in conversation or simply be unhelpful. RLHF attempts to address all of these problems. This is sometimes called alignment. Concretely, human-written prompts are collected and given to the LLM. Humans write good responses to the prompts. This resource is then used for finetuning the LLM, in part through reinforcement learning. The instantiation of RLHF employed for the GPTs is InstructGPT. InstructGPT is an ingredient crucial for the success of the GPTs because it greatly reduces "bad" output.	RLHF RLAIF InstructGPT auditing 1 auditing 2 cross-cultural alignment deepspeed chat

GPT4 technical report		GPT4 technical report

analysis of GPT4	Does GPT4 exhibit some AGI (artificial general intelligence)? Some amazing capabilities of GPT4, in particular: using tools. Does GPT4 have close-to-human performance? Shows that GPT4 is a lot better than previous models. Good discussion of the challenges ahead. Do we need a new paradigm?	Bubeck: analysis of GPT4 (2023)

GPT4 system card	This is a pretty good analysis, by OpenAI, of many of the dangers that GPT4 presents.	GPT4 system card OpenAI: safety

multimodality	There have been stunning advances in LLMs that may have (or may not have) plateaued. On the other hand, we have probably just scratched the surface on multimodal models that are as powerful as the GPTs for text, but are multimodal. GPT4 already is multimodal. There is much work one could make the basis of a presentation here. A small selection is given in the references.	CLIP stable diffusion multimodal prompting flamingo DALL-E 2 PaLI mSLAM data2vec

multilinguality	One really surprising property of LLMs is that they are multilingually competent. For example, they can translate from French to English even though they have never been trained on this task. There is some direct signal in the training data ("fromage means cheese"), but most people wouldn't have expected that you can learn to translate from that. There are a number of interesting topics in the area of multilinguality of LLMs, for a few of which references are given.	XLM-R: best known multilingual LLM BLOOM NLLB Google:1000Langs flores: low-resource evaluation of MT canine: multilingual (non-)tokenization simalign: demonstrates high-quality representations of LLMs roots: dataset creation LLMs as multilingual knowledge bases evaluation of LLMs for MT ChatGPT multilingual evaluation

linguistics	LLMs have an unprecedented understanding of linguistics, on the practical level (they can do things that a human can only do if they understand basic linguistic concepts) and maybe even on the theoretical level (chatgpt understands "morpheme" when asked "Can you please break insurmountability down into morphemes?"). What are the current linguistic capabilities of LLMs? How do we analyze them (also called probing)? Do we still need to study linguistics if want to do NLP and AI?	high-level overview Hewitt: probing syntax (2019) Hofmann: LLM tokenization (2022) Weissweiler: CxG (2023) LLMs model linguistics badly

chains of thought	On a pessimistic interpretation, LLMs are just fluency models that generate text that looks great when read quickly, but does not make sense if one thinks through it carefullly. This is a particular problem for math and for reasoning – LLMs are bad at both. A chain of thought is "a series of intermediate reasoning steps". While LLMs are often not capable of creating a complex reasoning sequence on their own, they can reasonably well generate one step from the next. The basic idea of chain of thought approaches is to decompose a complex task the LLM cannot do into a series of simpler tasks that the LLM does have the capability of performing correctly. One way of getting LLMs to take a chain-of-thought approach is chain-of-thought prompting: chain of thought demonstrations are provided as exemplars in prompting.	Wei et al (2023): chain-of-thought prompting compositionality gap LLMs are poor reasoners

how to code with LLMs	It is likely that most coding in the future will be a collaboration of AI agents and humans. What are best practices in 2023? What will this look like five years from now?	code LLM evaluation security of generated code: 1 2

how to write with LLMs	It is likely that most writing in the future will be a collaboration of AI agents and humans. What are best practices in 2023? What will this look like five years from now?	You do the research!

what do LLMs mean for me professionally?	What will be the impact on those with careers in NLP, computational linguistics, AI, machine learning, data science?	labor market impact impact on NLP engineers

prompt engineering	One likely impact on NLP/AI/data science engineers is that they will have to do a lot of prompt engineering. More generally, there are some ways of talking with an LLM that work well and others that don't. Basically, we have to learn the language of the LLM and then become fluent in it. This is a growing and rapidly evolving area of NLP.	some prompts are better than others automatic prompt engineer

how will LLMs change our lives?	negative: unemployment, military use, information warfare, fraud (e.g., generating text for email exchanges) positive: a lot of boring stuff we no longer have to do (filling out forms, proofreading, complex and long Google sessions to find information); breakthrough in productivity, resulting in a higher standard of living; better detection and prevention of crime; many other promises of AI	opportunities/risks of LLMs You do the research!

how will LLMs change education?	Should we still learn things LLMs can do? How will we write and how will writing skills be tested? How will we code and how will coding skills be tested? Should using LLMs be a big part of the curriculum, in schools and at university? In collaborations between LLM and student, how can we measure and control how much each party contributes? Ultimately: what is left to learn if most of what we test for in exams today can be done by LLMs? What would be good tests that LLMs cannot do?	NRW-Rechtsgutachten TUM/LMU position paper Uni Hamburg forschungundlehre the argument for AI

explanations of what LLMs do	Deep learning models are blackboxes. We arguably do not have a good high-level understanding of what they do. But different hypotheses have been put forward, including "parroting" and "emergence".	Bender: parrots (2021) Wei: emergence (2022) Bubeck: analysis of GPT4 (2023)

robotics	LLMs are likely to have a big impact on robotics. Two examples are given in the references. First, LLMs can generate, reason over and maintain a semantic layer that integrates different modalities (and also is easy to understand and interact with for humans). Second, LLMs provide common sense that is lacking from current robotic systems.	semantic layer common sense

parameter-efficient methods	Highly performant LLMs are huge. This is problematic for at least three reasons. First, they are (financially and environmentally) costly to train and run. Second, right now, only large industrial labs can develop and do research on them. Third, they cannot be widely deployed, e.g., not on mobile phones. The goal of parameter-efficient methods is to make LLMs smaller and thereby address these problems.	adapters low-rank adaptation of LLMs selective tuning of LLMs cost of BLOOM

legal issues	Does LLM training violate copyright? Who owns the copyright of what LLMs produce? Who is liable when an LLM causes harm?	stable diffusion: 1 2 tweet use who owns AI-generated content?

how to do stuff with LLMs other than writing and coding	searching, question answering etc. .

LLMs as research tools in the humanities		annotating time