January 18, 2024

ChatGPT, what’s this? Using the new ChatGPT image input features

Discover the new ChatGPT image input feature, which lets you analyze images, identify objects, read text, and get feedback.
January 18, 2024

ChatGPT, what’s this? Using the new ChatGPT image input features

Discover the new ChatGPT image input feature, which lets you analyze images, identify objects, read text, and get feedback.
January 18, 2024
Briana Brownell
In this article
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions

There’s a new ChatGPT update that multiplies what you can do with the chatbot: the AI can now analyze images, thanks to ChatGPT image input.

This isn’t as simple as it sounds. It can identify what’s in an image, sure, but it can also read text and math from an image, search or find out about the things in an image, and give feedback about the image. That’s a lot of possibility in a single feature.

Here's how to get it working right.

How to upload images to ChatGPT 4

The process for inputting an image for ChatGPT to analyze is incredibly simple. Just navigate to the chat box (on desktop or mobile) and click the paperclip icon.

Next, choose the file on your device, then add a prompt—anything from "Describe this image" to "What color shoes should I wear with this outfit?"‎

Learn more: Using ChatGPT data analysis to interpret charts & diagrams

What’s this? ChatGPT image recognition

ChatGPT image input certainly isn’t the first AI image recognition program. In fact, they have a fairly long history. In 2010 (basically the stone age in AI time scales), there was Google Goggles, an image recognition mobile app.  Despite being a relic, it had some decidedly impressive features: the ability to recognize and translate text, and find similar images using a reverse image search.

OpenAI's latest offering has features reminiscent of Goggles, but with a unique approach. The difference is how ChatGPT now interprets the actual contents of the image, rather than searching the web and comparing it to known images. Specifically, ChatGPT generates a description of the image and uses that description in its search.

And it’s pretty accurate. When I first asked it to identify a lunch, it easily figured out I was eating clam chowder in a bread bowl.

Screenshot of user asking ChatGPT to identify an image of clam chowder in a bread bowl

But in my next test, I asked it to identify the Tokyo Metropolitan Government Building from a photo I took. The tool's reliance on descriptive text led to mixed results.

It cycled through a number of different search terms where it described the building, including “twin towers with spherical structures on top." On my first try, it eventually found the correct building, but referenced an irrelevant Wikipedia page. When I tried it again, it gave me the wrong building (The Tokyo Towers). At least it got the city right.

Meanwhile, a reverse image search located it immediately.

Bing reverse image search of the Tokyo Metropolitan Government Building

As with any emerging technology, expect continuous enhancements. The current version may not always be spot-on with citations or identifications, but it's evolving. In the meantime, be sure to double check ChatGPT’s references. 

Tip: This is where multi-agent prompting—that is, using multiple AI tools for a larger task—comes in handy. Where ChatGPT image input falls short, you can take advantage of Lens in Google Photos and Bard. Bing also has a reverse image search feature.

ChatGPT, read this: Text and math recognition

When it comes to text recognition, ChatGPT shows impressive results, particularly with clear, neatly handwritten text or printed words. 

It's a mixed bag with translations, though. In my tests, ChatGPT's reading of handwritten French was passable, but it amusingly mistook a bottle of black rice vinegar for premium sake when interpreting Japanese—you don't want to make that mistake when you're bringing a gift for a dinner party! Meanwhile, when I used Google Lens, it accurately translated a Japanese sign that ChatGPT told me was "too blurry" to read. (Another perfect example of how using the multi-agent approach lets you play to each of the tools’ strengths.) 

Here's a cool thing though: ChatGPT can recognize written math formulas, which is way easier than typing them out. But solving them? Not its strong suit. It tries, but don't bet your homework on it—after all, it's a prediction engine that's just trying to figure out what word comes next. When I put it to the test on my old macroeconomics assignments it gave wrong but plausible answers 4 out of 4 times. 

Regardless, the ability to input formulas is one big advantage over Lens, even if you have to do most of the heavy lifting from there.

Tip: There are some ChatGPT plugins specifically for math, so it feels like a win-win to use them together.

Find this: ChatGPT image search

Now that ChatGPT uses Bing to search the web, you've got options to retrieve information: either using ChatGPT's internal "knowledge," or using external knowledge from the web. The default for ChatGPT 4 is to dynamically choose the best model, so it decides for you whether it should search or not. 

I found that if you ask about a specific element in an image, it tends to search, but if you ask an interpretive question about the contents of the image, it usually will attempt to answer based on its internal knowledge. 

But rather than relying on its decisions, a better habit to get into is asking it explicitly to use search—or not.


Image of a wine bottle with a ChatGPT prompt asking for what the wine tastes like

Image of a wine bottle with a ChatGPT prompt asking for tasting notes

When I asked it to give me tasting notes on a certain wine from a picture of the bottle's label, it was able to seek out the exact wine by reading the text and searching for it through Bing. Meanwhile, when it used its internal knowledge, it gave me a description of the typical flavor profile of Chablis instead.

The ability to search is great when Bing search finds a reputable site, but awful when it lands on a high-ranking site that’s less authoritative.  My wine search surfaced information from Wine.com from the winemaker themselves along with professional descriptions of the wine, so it was pretty solid. But in other tests, I've seen it end up on a less reliable site and retrieve that information instead, which is much less useful.

For now, you'll have to double check ChatGPT’s work by doing research on your own to make sure it isn't digging up false information or information from questionable sources.

Tip: Monitor as it searches to see what it is looking for and on what sites. You can also explicitly ask it to tell you what it searched for.

Go deeper: ChatGPT image analysis

For me, this is the real meat of what ChatGPT image input can do: You can analyze the image to see whether or not it fits with a theme, or whether it resonates with a certain persona. 

To test it, I gave ChatGPT six possible images for a fictional sci-fi/paranormal-themed podcast and asked which would fit with the overall theme. It rated all six, dropping one as a bad fit—an assessment I agreed with.

But how detailed would it get? Turns out, pretty detailed. I gave it a synopsis of an Outer Limits episode and asked which one was the best fit based on the episode description.

ChatGPT response about the best image for an Outer Limits episode

When I asked how I could improve the image to better fit the theme, it gave some pretty interesting ideas, specifically referencing various parts of the actual episode. A good illustrator could have taken these suggestions and altered the image based on those suggestions.

Conclusion

This is yet another way ChatGPT is becoming multimodal, with its newfound ability to see, hear, and speak. I believe that multimodal is going to be one of the most important strains of AI. Even though the tools are brand new, thinking in terms of multiple types of inputs is a skill that everyone should be starting to develop.

Not to mention, ChatGPT now has all the power to exceed my capabilities in obscure music video trivia. Dang it!

ChatGPT response identifying a building and what music video it appears in


Image of a Beastie Boys music video with the prompt "what building is this" and a correct response
Briana Brownell
Briana Brownell is a Canadian data scientist and multidisciplinary creator who writes about the intersection of technology and creativity.
Share this article
Start creating—for free
Sign up
Join millions of others creating with Descript

ChatGPT, what’s this? Using the new ChatGPT image input features

There’s a new ChatGPT update that multiplies what you can do with the chatbot: the AI can now analyze images, thanks to ChatGPT image input.

This isn’t as simple as it sounds. It can identify what’s in an image, sure, but it can also read text and math from an image, search or find out about the things in an image, and give feedback about the image. That’s a lot of possibility in a single feature.

Here's how to get it working right.

How to upload images to ChatGPT 4

The process for inputting an image for ChatGPT to analyze is incredibly simple. Just navigate to the chat box (on desktop or mobile) and click the paperclip icon.

Next, choose the file on your device, then add a prompt—anything from "Describe this image" to "What color shoes should I wear with this outfit?"‎

Learn more: Using ChatGPT data analysis to interpret charts & diagrams

What’s this? ChatGPT image recognition

ChatGPT image input certainly isn’t the first AI image recognition program. In fact, they have a fairly long history. In 2010 (basically the stone age in AI time scales), there was Google Goggles, an image recognition mobile app.  Despite being a relic, it had some decidedly impressive features: the ability to recognize and translate text, and find similar images using a reverse image search.

OpenAI's latest offering has features reminiscent of Goggles, but with a unique approach. The difference is how ChatGPT now interprets the actual contents of the image, rather than searching the web and comparing it to known images. Specifically, ChatGPT generates a description of the image and uses that description in its search.

And it’s pretty accurate. When I first asked it to identify a lunch, it easily figured out I was eating clam chowder in a bread bowl.

Screenshot of user asking ChatGPT to identify an image of clam chowder in a bread bowl

But in my next test, I asked it to identify the Tokyo Metropolitan Government Building from a photo I took. The tool's reliance on descriptive text led to mixed results.

It cycled through a number of different search terms where it described the building, including “twin towers with spherical structures on top." On my first try, it eventually found the correct building, but referenced an irrelevant Wikipedia page. When I tried it again, it gave me the wrong building (The Tokyo Towers). At least it got the city right.

Meanwhile, a reverse image search located it immediately.

Bing reverse image search of the Tokyo Metropolitan Government Building

As with any emerging technology, expect continuous enhancements. The current version may not always be spot-on with citations or identifications, but it's evolving. In the meantime, be sure to double check ChatGPT’s references. 

Tip: This is where multi-agent prompting—that is, using multiple AI tools for a larger task—comes in handy. Where ChatGPT image input falls short, you can take advantage of Lens in Google Photos and Bard. Bing also has a reverse image search feature.

ChatGPT, read this: Text and math recognition

When it comes to text recognition, ChatGPT shows impressive results, particularly with clear, neatly handwritten text or printed words. 

It's a mixed bag with translations, though. In my tests, ChatGPT's reading of handwritten French was passable, but it amusingly mistook a bottle of black rice vinegar for premium sake when interpreting Japanese—you don't want to make that mistake when you're bringing a gift for a dinner party! Meanwhile, when I used Google Lens, it accurately translated a Japanese sign that ChatGPT told me was "too blurry" to read. (Another perfect example of how using the multi-agent approach lets you play to each of the tools’ strengths.) 

Here's a cool thing though: ChatGPT can recognize written math formulas, which is way easier than typing them out. But solving them? Not its strong suit. It tries, but don't bet your homework on it—after all, it's a prediction engine that's just trying to figure out what word comes next. When I put it to the test on my old macroeconomics assignments it gave wrong but plausible answers 4 out of 4 times. 

Regardless, the ability to input formulas is one big advantage over Lens, even if you have to do most of the heavy lifting from there.

Tip: There are some ChatGPT plugins specifically for math, so it feels like a win-win to use them together.

Find this: ChatGPT image search

Now that ChatGPT uses Bing to search the web, you've got options to retrieve information: either using ChatGPT's internal "knowledge," or using external knowledge from the web. The default for ChatGPT 4 is to dynamically choose the best model, so it decides for you whether it should search or not. 

I found that if you ask about a specific element in an image, it tends to search, but if you ask an interpretive question about the contents of the image, it usually will attempt to answer based on its internal knowledge. 

But rather than relying on its decisions, a better habit to get into is asking it explicitly to use search—or not.


Image of a wine bottle with a ChatGPT prompt asking for what the wine tastes like

Image of a wine bottle with a ChatGPT prompt asking for tasting notes

When I asked it to give me tasting notes on a certain wine from a picture of the bottle's label, it was able to seek out the exact wine by reading the text and searching for it through Bing. Meanwhile, when it used its internal knowledge, it gave me a description of the typical flavor profile of Chablis instead.

The ability to search is great when Bing search finds a reputable site, but awful when it lands on a high-ranking site that’s less authoritative.  My wine search surfaced information from Wine.com from the winemaker themselves along with professional descriptions of the wine, so it was pretty solid. But in other tests, I've seen it end up on a less reliable site and retrieve that information instead, which is much less useful.

For now, you'll have to double check ChatGPT’s work by doing research on your own to make sure it isn't digging up false information or information from questionable sources.

Tip: Monitor as it searches to see what it is looking for and on what sites. You can also explicitly ask it to tell you what it searched for.

Go deeper: ChatGPT image analysis

For me, this is the real meat of what ChatGPT image input can do: You can analyze the image to see whether or not it fits with a theme, or whether it resonates with a certain persona. 

To test it, I gave ChatGPT six possible images for a fictional sci-fi/paranormal-themed podcast and asked which would fit with the overall theme. It rated all six, dropping one as a bad fit—an assessment I agreed with.

But how detailed would it get? Turns out, pretty detailed. I gave it a synopsis of an Outer Limits episode and asked which one was the best fit based on the episode description.

ChatGPT response about the best image for an Outer Limits episode

When I asked how I could improve the image to better fit the theme, it gave some pretty interesting ideas, specifically referencing various parts of the actual episode. A good illustrator could have taken these suggestions and altered the image based on those suggestions.

Conclusion

This is yet another way ChatGPT is becoming multimodal, with its newfound ability to see, hear, and speak. I believe that multimodal is going to be one of the most important strains of AI. Even though the tools are brand new, thinking in terms of multiple types of inputs is a skill that everyone should be starting to develop.

Not to mention, ChatGPT now has all the power to exceed my capabilities in obscure music video trivia. Dang it!

ChatGPT response identifying a building and what music video it appears in


Image of a Beastie Boys music video with the prompt "what building is this" and a correct response

Featured articles:

No items found.

Articles you might find interesting

Product Updates

High Fidelity Remote Recording Just Got a Whole Lot Easier

Earlier this week, Zoom released update 5.2.2, which includes a new feature they’re calling “High Fidelity Audio” mode. Peter Kirn over at Create Digital Music covered the update, writing that it would be excellent for musicians. Zoom has always been the easiest way to record remote interviews, but it came with a cost: the audio quality wasn’t as high as a local recording. We wondered how High Fidelity Audio mode would sound for podcasts and remote interviews, so we tested it ourselves — and the results are incredible. Listen for yourself.

Podcasting

How to keep your creator brand and your personal life in harmony

While it’s important for every show to have a clear, consistent brand, it’s tougher to be clear and consistent as a normal, imperfect, real-life person. We asked two podcasters to share their best tips for navigating the relationship between a public brand and a private life.

Video

Best webcam recorder software for Mac, Windows, and browser

Discover the best webcam recorder software for your needs with our guide, complete with each app's pros, cons, and best features.

Podcasting

Your podcast probably needs a sensitivity reader. Here's how to use one

If you’re writing about a lived experience outside your own — especially of a person from a marginalized group — it’s hard to know whether you’re representing that experience accurately and respectfully. That’s why many creators use sensitivity readers.

Video

16 TikTok video ideas for when your creative well runs dry

Whether you've just created your TikTok account and don't know where to start, or you've posted a few videos and need some inspiration, we've got TikTok video ideas for you.

Related articles:

Share this article

Get started for free →