OpenAI has revealed ChatGPT successor, GPT-4.
Prof Dimitrios Makris, Professor in Computer Vision and Machine Learning, Kingston University, said:
“One of the major improvements of GPT-4 is the ability to process images and annotate or describe them using words, therefore it goes beyond Language and builds a bridge between language and vision, two of the major pillars of (artificial) intelligence. While models that bring together language and vision have been developed before, the question here is how effective the new GPT-4 model will be on this task and how it will capitalise on the success of Large Language Models (LLMs).”
Prof Sylvie Delacroix, Professor in Law and Ethics, University of Birmingham, and Fellow at The Alan Turing Institute, said:
“Without access to both public domain data and content made available under open access licenses (which often restrict commercial uses), GPT 4 would certainly not be the tool it is today. As such, making non-commercial access to this tool conditional upon fee payment is highly problematic.”
Dr Jeff Dalton, Senior Lecturer in AI at the University of Glasgow said:
How significant is this announcement (compared to GPT 3)?
“There are multiple dimensions to this question. From a textual understanding and model capability, it seems to be an incremental improvement to GPT-3. What’s significantly different is that it has the ability to accept image input and describe the contents of what it “sees”.
“These improvements to the model are important because they make GPT-4 much more usable for diverse real-world applications, including virtual or augmented reality. Beyond its behavior, OpenAI isn’t revealing details about its size, architecture, or training data. We do know that it doesn’t have recent information because its training data still hasn’t been updated past 2021, similar to GPT-3.”
What major improvements does GPT 4 have over GPT 3?
“There are several major advancements. First, the biggest change for GPT-4 is the addition of the ability to understand images. It has the capability to take image input as well as text, although it can still only output text. It can explain why plugging a VGA adapter into a phone might be funny. The internal benchmarks OpenAI gives also shows that it is state-of-the-art or very competitive on several key open visual question answering benchmarks.
“The second improvement to GPT-4 is its ability to handle longer input sequences. Before it could handle up to 2048 tokens, now it’s 8,192; a 4x increase in size. There’s an even larger model that can handle up to 32k tokens (about 50 pages of text). That’s going to change what we can do with GPT-4 for many commercial and research applications.
“Finally, GPT-4 makes advances in some of the capabilities of the model. The biggest improvement is in its ability to do Maths (from 0th percentile to 50th percentile in AP Calculus) and do commonsense reasoning (a 13% improvement across key benchmarks). One noteworthy change is that it is now possible to pass a simulated bar exam scoring in the 90th percentile where before it was 10th percentile.”
Does GPT 4 provide a “fix” to some of the issues with GPT 3 i.e. hallucinations
“The key advance in GPT-4 is that it’s been trained to better follow human instructions so its output is more closely aligned with what we expect, even more so than GPT-3. Although it doesn’t solve the problems of hallucinations, the OpenAI benchmarks show it is more factual by more than 40% over GPT-3. It also has improved safeguards and safety features for sensitive requests, like medical advice.”
Any other comments?
“Although we didn’t realize it, we’ve already been using it in Microsoft’s Bing-GPT as well as likely in some of the other product integrations.
“There are limitations to consider with these evaluations. There are some that worry that these comparisons aren’t completely fair because some of the data in these benchmarks has been seen by the model (both GPT-3 and GPT-4) during their training. That means it may not perform nearly as well on unseen questions or problems.”
Dr Stuart Armstrong, Co-Founder and Chief Researcher at Aligned AI, said:
“GPT-4 is an impressive algorithm with many new capabilities – and it is less likely to invent facts. ‘Less likely’ sounds better, but it might be worse. If your chatbot goes off the rails once in ten times, then you check its output carefully. If it goes off the rails once in a thousand times, then you may never see it go wrong, so you start trusting it to manage your emails, control your investments, or direct your car… Until the dramatic crash.
“Papering over hallucinations is not enough; to be safe and usable, language models need to be fundamentally designed to be aligned and truthful, not just ‘less likely’ to go off the rails.”
Prof Maria Liakata, Professor in Natural Language Processing (NLP) at Queen Mary, University of London, said:
“The technical report released yesterday on GPT-4 reveals nothing about the technical details, i.e. model size, parameters, training data or training method, all of which make it very hard to scrutinise or replicate. This is problematic because what data the model is trained on greatly influences outputs. There could be copyright issues, there are certainly questions about data quality and this sets a negative precedent regarding secrecy about these models. For earlier GPT models relevant technical information had been released.
“Like earlier models GPT-4 has been trained on content up to September 2021, using Reinforcement Learning with Human Feedback. The difference from earlier models seems to be that is has better performance on many benchmark tasks and has been extensively evaluated, including on exams designed for humans (multiple-choice and free-text). The evaluation has been performed on smaller versions of GPT-4 and performance for GPT-4 loss extrapolated via a power law fit to the loss of the smaller models. Human experts have been involved in both training and evaluation, the latter also including evaluation of risks.
“The authors claim that GPT-4 substantially improves over previous models in the ability to follow user intent. They have also improved on safety metrics, reducing the model’s tendency to respond to disallowed requests by 82% compared to earlier models.
“GPT-4 accepts prompts both as images and text. Of note is their example showing the visual capabilities of GPT-4, where the input is a picture consisting of three panels, where one is a phone charging using what looks like a VGA adaptor, the second a Lightning phone charger in the shape of a VGA connector and the third a closer look on the VGA-looking lightning phone adaptor. GPT-4 is asked to output what is funny by describing each picture and what it generates is impressive.
“As mentioned in their report:
“Despite its capabilities, GPT-4 has similar limitations to earlier GPT models: it is not fully reliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important.”
Dr Mark Stevenson, Senior Lecturer from the University of Sheffield’s Department of Computer Science, said:
“GPT-4 is the latest of the large language models (LLM) to be announced over the last few years. These models, such as ChatGPT, GPT-3 and LaMDA, have produced significant interest and demonstrated impressive abilities to generate convincing text that is often difficult to distinguish from that written by humans. Previous versions have focused on interpreting text, but GPT-4 has added the ability to process images too, and is also able to interpret significantly longer input texts than GPT-3. These features will open up new and interesting potential applications, although it’s difficult to say how much of an improvement GPT-4 represents over previous versions as the full model is not available yet.
“A range of issues have already been identified with previous artificial intelligence chatbots, including outputting incorrect statements (known as “hallucination”), limited reasoning capabilities and the generation of upsetting or offensive content. The extent to which the developers of GPT-4 have been able to address these issues will become clearer following the opportunity to experiment with the model.”
Prof Mirella Lapata, professor of natural language processing, University of Edinburgh, said:
“With GPT-4 we are one step closer to life imitating art. In Charlie Brooker’s science fiction series Black Mirror, AI technology can reconstruct one’s voice, style of speaking, facial expressions, and personality traits from their online data profile (episode “Be Right Back”). GPT4 is now multimodal, it has been trained on more data, and is apparently more reasonable. Humans are not fooled by the AI in Black Mirror but they tolerate it. Likewise, GPT4 is not perfect but paves the way for AI being used as a commodity tool on a daily basis.”
Prof Anthony Cohn, Professor of Automated Reasoning, University of Leeds, said:
“GPT-4 is a definitely a step change from the earlier GPTs, particularly in regard to the fact it is now multimodal, accepting both text and visual input. It is a vastly larger model and is claimed to outperform previous models, though aggregate performance statistics do not reveal the kinds of problems which it fails on, and whether it sometimes gets wrong answers to problems GPT-3 got right. OpenAI acknowledges it is subject to “similar limitations”, is not fully reliable (still suffering from “hallucinations”) and, crucially “does not learn from experience”, surely a hallmark of any truly intelligent agent. OpenAI emphasise GPT-4’s value as a tool to aid a human rather than as an AGI (Artificial General Intelligence) in its own right, which is welcome, given these limitations. Although perhaps understandable from a commercial point of view, the secrecy surrounding the architecture, the training regime and the training dataset makes it hard for the scientific community to evaluate it except as a black box. ChatGPT was able to offer justifications and explanations of its answers, but in a recent evaluation of commonsense reasoning I have found these often to be inconsistent with the answer it gave and even internally inconsistent; it will be interesting to see if GPT-4 improves on this performance, but meanwhile such “explanations” clearly do not really reveal the underlying computations or provide a reliable justification for the answers provided to prompts. Foundation models such as GPT have always struggled with reasoning tasks, in particular relating to commonsense reasoning, and whilst I expect GPT-4 may prove to be better, I doubt that it will prove to reliably provide the kinds of reasoning abilities we would expect of a truly intelligent agent. So by all means use Foundation Models such as GPT-4 as a tool, as an assistant, but “caveat emptor”.”
Prof Mike Wooldridge, Professor of Computer Science at University of Oxford, and Director of Foundational AI Research at The Alan Turing Institute, said:
“GPT-4 is the latest in the game-changing GPT class of AI products from OpenAI. The release of GPT-3 in 2020 sent ripples of excitement through the AI community – it became clear that a very big bet by OpenAI on a new generative AI technology was beginning to pay off – this technology represented a clear step change over its predecessors. The biggest news about GPT-4 is that it is multi-modal: it can cope not just with text, but with images. The impressive language capabilities of ChatGPT opened our eyes to a whole swathe of new possibilities – multi-modal generative AI is going to multiply these.
“I’m intrigued to see the claims made about the reasoning capabilities of GPT-4. These were very limited in GPT-3 and ChatGPT – it’s going to be interesting to see how the claims hold up when we test them. You can bet that hundreds of PhD students across the world are going to be working through the night in the weeks ahead to try to figure this out.”
Prof Nello Cristianini, Professor of Artificial Intelligence at the University of Bath, said:
“In the past decade we have created a form of AI by exploiting statistical correlations discovered in vast masses of data, a method akin to a shortcut, that avoided the need for explicit modelling. Language models are one of the results of this shortcut, and GPT-4 moves a step further towards more realistic AI, by exploiting correlations between different “modalities” of data, such images and text”. Calling them “language models” is no longer appropriate, as they are vision models too, at the very least.
“The ability to combine different “types” of information is known as “multimodality”, and takes us one step further towards what we might consider a weak form of “understanding”: a classic example would be predicting what noise should accompany the video of a bouncing ball. GPT-4 has been trained to handle multimodal data streams, and the demo released by OpenAI involves suggesting recipes based on an image of available ingredients. We should not confuse this ability with that of completing different tasks: these can exist alongside each other, and both are improving rapidly. We do not know how much the performance in text-mode is helped by information available in image-mode, as we do not know the size of the model or the amount of training that went into it. Since now there is competition, and a business model too, we should not be surprised by increased discretion but it is still disappointing from a company called OpenAI.
“We should remember that language models such as GPT-4 do not think in a human-like way, and we should not be misled by their fluency with language. They are still based on statistical correlations discovered in the data, a shortcut that avoids them the need for explicit representations of the world. This does not mean they cannot be useful, on the contrary, they will definitely be.”
Prof Peter Bannister, Healthcare Executive Chair, Institution of Engineering and Technology, and Managing Director, Romilly Life Sciences, said:
“GP4 includes several new features including the ability to generate human-like written responses to queries that come in the form of images, not just text, and is already integrated into Microsoft’s web browser. However the creators OpenAI warn that the technology can still “hallucinate” – in other words, produce responses that while very convincing, contain factual errors.
“It is therefore even more important that approaches to manage the risk of misinformation, as well as education to ensure users are aware of the limitations of these artificial intelligence tools and how to employ them effectively, are created at an equivalently impressive pace. We should insist on continued testing against existing, validated information sources to ensure that the accuracy our collective knowledge base is not eroded.”
https://openai.com/product/gpt-4
Declared interests
Dr Stuart Armstrong: “I have no direct interests in OpenAI and ChatGPT. As a human, I am interested in AI safety in general, and as a co-founder of an AI safety startup, I am interested in tools that increase AI safety (though we have no commercial relationships with OpenAI or any of its rivals).”
Prof Anthony Cohn: “No COI to declare.”
Prof Nello Cristianini: “the author of “The Shortcut – why intelligent machines do not think like us” edited by CRC Press.”
Prof Peter Bannister: “I have no COI to declare.”
For all other experts, no reply to our request for DOIs was received.