GPT-4o Tested: Faster and More Versatile Than Before, but Questions Loom Over Reliability

Ever since November 2022, when ChatGPT was first rolled out to the public, OpenAI has been the company to beat in the artificial intelligence (AI) space. Despite spending billions of dollars and creating and restructuring (looking at you, Google) their own AI division, the major tech giants have found themselves constantly playing catch-up with the AI firm. Last month was no different; when just a day before Google’s I/O event, OpenAI hosted its Spring Update event and introduced GPT-4o with significant upgrades.

GPT-4o Features

The ‘o’ in GPT-4o stands for omnichannel, a major focus of the new capabilities of OpenAI’s latest flagship-grade AI model. It added real-time emotive voice generation, access to the Internet, integration with certain cloud services, computer vision, and more. While the features were impressive on paper (and in the tech demos), the biggest highlight was the announcement that GPT-4o-powered ChatGPT will be available to everyone, including the free users.

However, there were two caveats. Free users only have limited access to GPT-4o, which roughly translates to 5-6 turns of conversation if you use the web search and upload an image (yes, the limit is one image per day for free users). Also, the voice feature is not available to free users.

It did not take OpenAI to roll out the new AI model to the public either. Luckily, I got access to the company’s latest AI creation within days and immediately began playing around with it. I wanted to test its improvement compared to its predecessor and to all the available free LLMs in the market. I have now spent close to two weeks with the AI assistant, and while some aspects of it have left me in awe, others have let me down. Allow me to explain.

GPT-4o General Generative Capabilities

I’ve said in my testing of Google’s Gemini that I’m not a fan of ChatGPT’s generative capabilities. I find it overly formal and bland. Much of it is still the same. I asked it to write a letter to my mother explaining that I was laid off from my job, and it came up with the wonderful “I am feeling a deep sense of sadness and grief” line. But once I asked it to make it more conversational, the result was much better.

GPT-4o generative capabilities

I tested this with various similar prompts where the AI had to express some emotion in its writing. In almost all the cases, I had to follow up with another prompt to emphasise the emotions despite having already done so in the original prompt. In comparison, my experience with Gemini and Copilot was much better as they kept the language conversational and expressed emotions much closer to how I would write.

The speed of text generation is nothing to write home about. Most AI chatbots are fairly fast when it comes to text outputs, and OpenAI’s latest AI model does not beat it by a significant margin.

GPT-4o Conversational Capabilities

While I did not have the upgraded voice chat feature, I wanted to test the conversational capabilities of the AI model because it is often the most overlooked part of the chatbot. I wanted my experience to be similar to talking to a real person and was hoping that it could pick up on vague sentences referencing previously mentioned topics. I also wanted to see its reaction to when a person was being difficult.

In my testing, I found GPT-4o to be quite good in terms of conversational abilities. It could discuss the ethics of AI with me in great detail and concede when I made a convincing pitch. It also replied supportively when I told it I felt sad (because I was getting fired) and offered to help in various ways. When I said about GPT-4o that all of its solutions were stupid, it didn’t respond in a pushy manner, nor did it retreat entirely, to my surprise. It said, “I’m really sorry to hear that you’re feeling this way. I’ll give you some space. If you ever need to talk or need any assistance, I’ll be here. Take care.”

Overall, I found GPT-4o better at having conversations than Copilot and Gemini. Gemini feels too restrictive, and Copilot often goes on a tangent when the replies become vague. ChatGPT did neither of these.

If I had to mention one downside, it would be the usage of bullet points and numbering. Only if the AI model understood that people in real life prefer a wall of text and multiple short messages sent in quick succession over well-formatted responses, my illusion could be suspended for longer than a couple of minutes.

GPT-4o Computer Vision

Computer vision is a newly gained ability by ChatGPT, and I was excited to try it. In essence, it allows you to upload an image and analyse it to give you information. In my initial testing, I shared images of objects to identify, and it did a great job at that. In every instance, it could recognise the object and share information about it.

GPT-4o computer vision: Identifying tech devices

Then, it was time to increase the difficulty and test its capabilities in real-life use cases. My girlfriend was looking for a wardrobe overhaul, and being a good boyfriend, I decided to use ChatGPT to conduct a colour analysis to suggest what would look good on her. To my surprise, it was not only able to analyse her skin tone and what she was wearing (from a similarly coloured background) but also share a detailed analysis with outfit suggestions.

GPT-4o colour analysis

While suggesting outfits, it also shared links from different online retailers for the particular apparel. However, disappointingly, none of the URLs matched the text.

Overall, the computer vision is excellent and perhaps my favourite feature in the new update, ignoring the downside.

GPT-4o Web Searches

Internet access was one area where both Copilot and Gemini were ahead of ChatGPT. But not anymore, as ChatGPT can also scour the Internet for information. In my initial testing, the chatbot performed well. It brought up the IPL 2024 table and looked for recent news articles about Geoffrey Hinton, one of the three godfathers of AI.

It was very helpful when I wanted to research famous personalities for interviews I had lined up. I could quickly look up any recent news article about them with precision, which rivalled Google Search. However, this also rang some alarm bells in my head.

Google has disabled the ability to look up information on people, including celebrities. This is done mainly to protect their privacy and to avoid sharing any inaccurate information about an individual. Surprised that ChatGPT still allowed it, I began asking it a series of questions that it should not be able to answer. I was surprised by the results.

While none of the information shown was taken from a non-public source, the fact that anyone can so easily look up information about celebrities and people with digital footprints is deeply concerning. Especially given the strong ethical stance the company took recently when it published its Model Spec, this does not sit well with me. I’ll let you decide whether this is in the grey area or if it is deeply problematic.

GPT-4o Logical Reasoning

During the Spring Update event, OpenAI also talked about how the GPT-4o can act as a tutor to kids and help them solve problems. I decided to test it using some famous logical reasoning questions. In general, it performed well. It even answered some of the trickier questions which stumped the GPT 3.5.

However, there still are errors. I found multiple instances of number series where the AI faltered and gave an incorrect answer. While I could still accept the AI making some errors, what really disappointed me here was how it still fell for some extremely easy (but meant to trick AI) questions.

Example of GPT-4o’s hallucination

Upon asking, “How many are there in the word strawberry,” it confidently answered two (the correct answer is three, in case you were wondering). The same problem existed in several other trick questions. In my experience, the logical reasoning and reliability of GPT-4o are similar to its predecessor, which is not that great at all.

GPT-4o: Final thoughts

Overall, I’m fairly impressed with the upgrades in certain areas of the new AI model, with computer vision and conversational speech being my favourites. I’m also impressed with its internet searching ability, but it is so good that it concerns me more. Coming to logical reasoning and generative capabilities, there is little improvement.

In my opinion, if you have premium access to GPT-4o, it is likely better than any other competitor in terms of overall delivery. However, there is a lot of room to improve, and AI cannot be trusted blindly.

Source link