Teaching Machines to Retrieve Visual Data Like our Brains.
I was trading emails on the topic of TED presentations with my friend Charles a few weeks back and his selection of preferred presentations, which were all focused on visualization technology and concepts, prompted me to come back to an idea I’ve had for years. I’ve provided my first written draft of the concept below, following links to the TED presentations referenced within.
Ted Talks – Johnny Lee: Creating tech marvels out of a $40 Wii Remote
Ted Talks – Blaise Aguera y Arcas: Jaw-dropping Photosynth demo
I have a theory that we can teach machines to deal with data like our brains, and that if we can accomplish this then user interfaces sitting in front of large data stores will automatically become simple, efficient, and super quick without any loss of quality as we perceive it. Perception being the key concept we can leverage here.
The example I’ve always presented is the following:
- think of a beautiful waterfall that you’ve visited before, picture the waterfall in your mind right now,
- think of the exact point where the falling water hit the rocks or pond or stream below, think about the detail of the water and the rocks, think about where you were standing as you gazed at the waterfall.
Now, if you were able to do this then my point is made. When your brain recalls visual memories it doesn’t recall every detail immediately, it brings in detail as-needed. As I forced you to think about specific elements of the waterfall scene your brain loaded the necessary additional detail into memory. These details were entirely irrelevant to your initial visual memory of the waterfall when I first prompted you to think about it and only became necessary as you “navigated” it during the following seconds.
But, in the digital world if I asked my computer to show me the picture I once took at a waterfall, it would have been required to load every pixel of information about the scene, thus requiring the maximum data load up front before I could begin any processing or navigating within that image. But to my brain and my recollection of that scene most of that data is absolutely irrelevant upfront. So, the challenge is, how do we teach machines to retrieve only the minimum amount of information we need at the time we need it in order to achieve the efficiency our brains utilize every day. When you think about it, our brains are massively efficient because of the ability to only retrieve exactly the minimum amount of information we need to process a thought. Machines are far less accurate and have much to gain from operating like we think. So a good place to start thinking about a process for brain-like efficiency is within a large visual user interface (reference: Johnny Lee’s presentation) where normally the greatest amount of information is pre-loaded by default. Now, in our new model, you ask your computer for the waterfall scene and it loads a very small image that contains all of the colors, shapes, and other critical elements of the scene but with very little detail, then as you click (the tactical parallel of what you did in my exercise above to direct your brain to focus on each specific element I prompted you to visualize in detail) the image enlarges in size and allows you to drag, rotate, and spin (the same thing your brain lets you do in your visual memory). Thus, the statement that the amount of data in the scene isn’t the problem (reference: Blaise Aguera’s presentation) because you the user can only comprehend and utilize a finite number of pixels within your viewing area rings true. Your brain has a similar pixel resolution (figuring it out may be the answer to all of this) in which anything that isn’t currently represented within that frame of resolution is of absolutely no concern to you. Like the way the brain interprets motion, (ie: 30 frames/second is fast enough to trick the brain into seeing motion when instead we’re displaying quickly changing still images), there may be a simple pixel resolution threshold of detail for the average human brain.
Here’s how I would test this. I would project a wall-sized image using a high definition projector on the wall in front of me. I would then install an eye-tracking device near the screen so that I could face the screen and have my eyes tracked instantly. You could use a four-point calibration tool for this just like the Nintendo Wiimote guy does for his infrared whiteboard, but in this case you would point your eyes at each of the four points to calibrate your position in comparison to the display screen. You could even take this to the next level by using head-tracking infrared so that after calibration you could actually move around the room with your changing position constantly recalibrating (but that’s for version 2.0). Anyway, using my current example, the computer would contain all of the information about the waterfall scene but would only display high resolution detail for the exact number of pixels my brain can comprehend directly around the area where I focus my eyes. So, if I look away the screen goes back to a very simple low-data environment displaying just the exact number of pixels my brain can comprehend across the entire area of the screen. If my brain can only comprehend about 5,000 pixels then an entire 100 inch screen would go back to a very simple representation of shapes, colors, etc while I focused away from the screen. Theoretically if we get the pixel count (resolution) right I should still be able to recognize the scene on the screen in my peripheral vision. Then, the second I focus back on the screen the image would recall the detailed pixel information it needs to complete the area of focus around where my eyes are pointed. Think about it, that’s nearly the world we already live in. Think about what’s in your peripheral vision right now, it’s very low-data, think about what your eyes are focusing on right now, it’s very high-data.
The ramifications of this are incredible. Think of the efficiencies. Here we are trying to drive high definition streaming video into every screen in our house and on our mobile phones and we’re eating up tons of bandwidth and processing power doing it. When instead, our brain needs very little information about an image in order to actually “see” it. Just in the video game world alone the gain in processing power would be immense and would enable all types of improvement and optimization. Entire 3D worlds now only need to be computed in the smallest amounts with only the appropriate detail provided on-demand when the brain requests it by pointing the eyes at a specific element. This creates a digital world of the perfectly optimized visual user interface, point your eyes at it and you see it, look away and it nearly disappears.
My hunch is that if you created the HD projection screen environment I mentioned above that you would actually not notice what was going on, because every time you looked at something it would appear in great detail instantly. It might be frustrating to try to look quickly at a section of the image in low resolution, since you would know that it existed just outside of the focus of your eyes. If done fast enough (and with the data optimization inherent to this model it could be done extremely quickly) you shouldn’t be able to tell a difference, IE: the same trick being applied here as the magic of the motion picture. Now, as an onlooker who the image would not be reacting to you would see what was going on instantly and it would probably annoy the crap out of you. But, additionally, this technology could support multiple people at once by only providing detail to the area of focus on the image that at least one person in the room was looking at at any point in time. As with most images, especially moving ones, the areas that we focus on are actually quite limited and probably pretty consistent from person-to-person so much of the optimization I predict here should remain even in a multi-person environment.