吉娃娃还是松饼? 我在寻找最好的计算机视觉API (Chihuahua or muffin? My search for the best computer vision API)

This popular internet meme demonstrates the alarming resemblance shared between chihuahuas and muffins. These images are commonly shared in presentations in the Artificial Intelligence (AI) industry (myself included).

这个流行的网络模因展示了吉娃娃和松饼之间惊人的相似之处。 这些图像通常在人工智能(AI)行业的演示文稿中共享(包括我自己)。

But one question I haven’t seen anyone answer is just how good IS modern AI at removing the uncertainty of an image that could resemble a chihuahua or a muffin? For your entertainment and education, I’ll be investigating this question today.

但是我还没有一个人能回答的一个问题是,现代AI在消除可能类似于吉娃娃或松饼的图像不确定性方面有多好? 为了您的娱乐和教育,我今天将调查这个问题。

Binary classification has been possible since the perceptron algorithm was invented in 1957. If you think AI is hyped now, the New York Times reported in 1958 that the invention was the beginning of a computer that would “be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” While perceptron machines, like the Mark 1, were designed for image recognition, in reality they can only discern patterns that are linearly separable. This prevents them from learning the complex patterns found in most visual media.

自感知感知器算法于1957年发明以来,便可以进行二进制分类。如果您认为AI现在被炒作了, 《纽约时报》在1958年报道说,这项发明是一台计算机的开端,它将“能够走路,说话,看,写作,自我复制并意识到它的存在。” 虽然像Mark 1这样的感知器机器是专为图像识别而设计的,但实际上,它们只能分辨出线性可分离的图案。 这阻止了他们学习大多数视觉媒体中发现的复杂模式。

No wonder the world was disillusioned and an AI winter ensued. Since then, multi-layer perceptions (popular in the 1980s) and convolutional neural networks (pioneered by Yann LeCun in 1998) have greatly outperformed single-layer perceptions in image recognition tasks.

难怪这个世界幻灭了,随之而来的是人工智能冬天 。 从那时起,在图像识别任务中, 多层感知 (在1980年代流行)和卷积神经网络 (由Yann LeCun于1998年开创)在性能上大大超过了单层感知 。

With large labelled data sets like ImageNet and powerful GPU computing, more advanced neural network architectures like AlexNet, VGG, Inception, and ResNet have achieved state-of-the-art performance in computer vision.

借助ImageNet等大型标签数据集和强大的GPU计算功能, AlexNet , VGG , Inception和ResNet等更高级的神经网络体系结构已实现了计算机视觉的最先进性能。

计算机视觉和图像识别API (Computer vision and image recognition APIs)

If you’re a machine learning engineer, it’s easy to experiment with and fine-tune these models by using pre-trained models and weights in either Keras/Tensorflow or PyTorch. If you’re not comfortable tweaking neural networks on your own, you’re in luck. Virtually all the leading technology giants and promising startups claim to “democratize AI” by offering easy-to-use computer vision APIs.

如果您是机器学习工程师,可以通过在Keras / Tensorflow或PyTorch中使用预训练的模型和权重来轻松进行实验和微调这些模型。 如果您不愿意自己调整神经网络,那么您很幸运。 几乎所有领先的技术巨头和有希望的初创企业都声称通过提供易于使用的计算机视觉API来“使AI民主化”。

Which one is the best? To answer this question, you’d have to clearly define your business goals, product use cases, test data sets, and metrics of success before you can compare the solutions against each other.

哪一个是最好的? 要回答此问题,必须先明确定义业务目标,产品用例,测试数据集和成功指标,然后才能将解决方案相互比较。

In lieu of a serious inquiry, we can at least get a high-level sense of the different behaviors of each platform by testing them with our toy problem of differentiating a chihuahua from a muffin.


进行测试 (Conducting the test)

To do this, I split the canonical meme into 16 test images. Then I use open source code written by engineer Gaurav Oberoi to consolidate results from the different APIs. Each image is pushed through the six APIs listed above, which return high confidence labels as their predictions. The exceptions are Microsoft, which returns both labels and a caption, and Cloudsight, which uses human-AI hybrid technology to return only a single caption. This is why Cloudsight can return eerily accurate captions for complex images, but takes 10–20 times longer to process.

为此,我将规范模因分为16个测试图像。 然后,我使用工程师Gaurav Oberoi编写的开放源代码来合并来自不同API的结果。 每个图像都通过上面列出的六个API推送,这些API返回高可信度标签作为其预测。 唯一的例外是Microsoft(它同时返回标签和标题)和Cloudsight (其使用人工AI混合技术仅返回单个标题)。 这就是为什么Cloudsight可以返回复杂图像的准确字幕,但处理时间要长10-20倍。

Below is an example of the output. To see the results of all 16 chihuahua versus muffin images, click here.

以下是输出示例。 要查看所有16张奇瓦瓦州和松饼图像的结果, 请单击此处 。

How well did the APIs do? Other than Microsoft, which confused this muffin for a stuffed animal, every other API recognized that the image was food. But there wasn’t an agreement about whether the food was bread, cake, cookies, or muffins. Google was the only API to successfully identify muffin as the label that is most probable.

API做得如何? 除了微软公司(Microsoft )将松饼与毛绒玩具相混淆之外,其他所有API都认为该图像是食物。 但是,关于食物是面包,蛋糕,饼干还是松饼,尚无共识。 Google是唯一成功将松饼识别为最可能的标签的API。

Let’s look at a chihuahua example.


Again, the APIs did rather well. All of them realized that the image is a dog, although a few of them missed the exact breed.

同样,这些API的表现也不错。 他们中的所有人都意识到这只狗是狗,尽管其中一些人错过了确切的品种。

There were definite failures, though. Microsoft returned a blatantly wrong caption three separate times, describing the muffin as either a stuffed animal or a teddy bear.

但是,肯定有失败。 微软分三次错误地返回了一个错误的标题,称松饼是毛绒玩具还是泰迪熊。

Google was the ultimate muffin identifier, returning “muffin” as its highest confidence label for 6 out of the 7 muffin images in the test set. The other APIs did not return “muffin” as the first label for any muffin picture, but instead returned less relevant labels like “bread”, “cookie”, or “cupcake.”

Google是最终的松饼标识符,在测试集中的7张松饼图像中,有6张返回“ muffin”作为其最高置信度标签。 其他API并未将“松饼”作为任何松饼图片的第一个标签返回,而是返回了相关性较低的标签,例如“面包”,“ cookie”或“杯子蛋糕”。

However, despite its string of successes, Google did fail on this specific muffin image, returning “snout” and “dog breed group” as predictions.


Even the world’s most advanced machine learning platforms are tripped up by our facetious chihuahua versus muffin challenge. A human toddler beats deep learning when it comes to figuring out what’s food and what’s Fido.

即使是世界上最先进的机器学习平台,也因我们奇妙的吉娃娃与松饼挑战而challenge绊绊。 当要弄清什么是食物和什么是菲多时,人类的学步者需要进行深度学习。

那么哪种计算机视觉API最好呢? (So which computer vision API is the best?)

