I was recently reminded of an image that’s oddly relevant to a particular area of AI: inference from sparse data. It’s an area for which the current state of the art is rather depressing compared to what the media hype would have us believe about AI more broadly. For the record, I have no idea who made this image; I have only encountered it as it continues to wander through social media in anonymized form.
After you’re done giggling at “uuuuu,” take a moment and appreciate all the hoops you actually had to jump through to understand it enough to laugh at it. It’s not as trivial as you might think. Let’s break down some of the steps.
- To appreciate the image, we need to figure out a mapping of words (or strings) to concepts.
- The first thing we need to observe is that there are 3 syllables and three main components to a hamburger: top bun, middle stuff, and bottom bun.
- We must then observe that the syllables are mapped to parts of the burger: “ham” means the top bun, “bur” means the middle stuff, and “ger” is the bottom bun.
- Finally, we must realize that the word spelling left-to-right describes the food structure top-to-bottom – order matters.
Great – now we know that a “gerburger” is and could probably get there just from observing the top row. But it doesn’t stop there.
- Next, we take the same reasoning and basically recurse to smaller units. Each syllable is 3 letters long and the middle layer has 3 ingredients: lettuce, tomato, and…ok actually that’s a cheese slice on a burger patty, but let’s ignore the cheese and just call it a burger patty (we’re talking about hamburgers after all, not cheeseburgers).
- Applied to the middle, we get a mapping of letters to ingredients. Now we understand what “uuuuuu,” “ur,” and “geru” are all about.
That’s actually quite a few steps and requires recognizing patterns at two levels (syllables vs. letters). From those observations, we can even go further – a “gerrrger” surely consists of two beef patties between two bottom buns. Similarly, while not technically part of the image, we can probably also say that “plink” is not part of this burgerverse and that “hhhh” is somewhat problematic. A lot of knowledge from few examples given how many possible combinations exist (just for the whole-syllable cases involving exactly 3 syllables, we see a mere 6 examples out of 3*3*3 = 27 possibilities).
Let’s assume that we have machine-readable labels for each component of the burger and we’re mapping one symbol sequence to another symbol sequence – that gets rid of the whole image recognition/segmentation component of the problem and turns it into a task one would expect is easier for a machine. However, even that simplification doesn’t really take away the core parts of the inference outlined above. Of course, it would be easy if you encoded each syllable or letter mapping manually – but that’s basically cheating and avoiding inference altogether. If we want a system that can see “hamburger” and “hamhamham” and then correctly infer what “gergerger” is in the same way that we do…that’s quite a task, particularly when we throw “uuuu” into the mix.
Now, one could be a complete contrarian and say something like “we can’t actually infer anything from this with total certainty, so there!” That would certainly make the machine’s task really easy – just punt on each query that isn’t already in the training data. We may know that three top buns is “hamhamham,” but the contrarian has a point: four top buns could be “zebra” and we have no way to prove otherwise unless we can actually see that part of the burgerverse. But that’s not what people do in practice. Those who aren’t trying to be insufferable smart-asses are going to say that four top buns is “hamhamhamham.” If we want to replicate human behavior at any level, we must infer general things from sparse examples.
Neural nets are a popular approach to try to throw at arbitrary learning problems these days. The task might be doable if you took a significant amount of time to train a neural net on the data over, and over, and over…and over again. Maybe. Probably we’d have a better chance with a recurrent net that explicitly treats the data as a sequence – but I’d still have to see it to believe it (perhaps needless to say, my prior experiences with neural nets haven’t given me great confidence in them for these types of tasks). We might get some correct answers, but we might also get some weirdness. Either way, we’d almost certainly get garbage if we only showed the examples to the system once or twice. To have any chance at success, we’d have to embark on a grueling path of training that doesn’t really seem to reflect the way in which we were able to easily figure out some generalizable rules from just a few examples.
Of course, this is not to say that all inference problems are easy for humans. For example, suppose the syllable/letter mapping changed depending on what the top ingredient is – it immediately becomes a more challenging task if the three bottom buns is something like “gerbopger” instead of “gergerger.” Aside from probably not being funny, a more complex mapping would require additional examples. It would probably also require studying it carefully to figure out what’s going on, more like the multitude of passes that a neural net requires. But, for a simple left-to-right, top-down, and recursive pattern…we somehow just “get it,” and we do so almost immediately. It’s like we’re just built to make those kinds of inferences – it’s part of our intuition or common sense.
While there are people researching the aspects of human intuition and common sense that enable us to figure out “hamger” so quickly, it’s still an open problem to get those capabilities into a machine. I’m hopeful that one day this blog post will become obsolete and you’ll be able to ask your phone what a “gerger” is if you give it definitions for a “hamburger” and a “hamhamham,” but that day is not yet here. So, for now, take a moment to appreciate your brain and all of the interesting work it did just to laugh a “uuuuu.”