Gesture control offers an excellent input solution; however, when it is camera-based, it is limited by line-of-sight and field of view, thus requiring multiple cameras and significant processing power. These trade-offs work against achieving a lightweight device with a full day lasting battery. A neural interface wearable wristband can deliver a pointing device functionality and even enhance built-in gesture control cameras with interactions beyond field of view boundary and line of sight limitations. Such an interface promises a superior experience, ushering mass adoption of smart glasses to become as ubiquitous as the smartphone.
Life was simple. Merely 12 years ago you could easily distinguish between face-worn device categories. Google Glass was the first consumer-level Augmented Reality (AR) Glasses, and Oculus DK1 were the first Virtual Reality (VR) headsets. There were several attempts at creating AR and VR consumer products by Vuzix and Sony, but none had achieved the same level of market readiness or mainstream appeal that Glass or Oculus eventually did.
Mixed Reality (MR) became a thing in January 2015 when Microsoft announced the HoloLens, an untethered holographic computer, and Magic Leap leaped into the category in 2017. Extended Reality (XR) became a thing just a year later, when Qualcomm launched the Snapdragon XR1 platform, dedicated for extended reality experiences and applications.
Then in 2021 Mr. Mark Zuckerberg renamed Facebook as Meta to reflect the company's broader ambitions beyond social media and its focus on building the metaverse. And in June 2023, Apple launched the first “Spatial computer” - the Apple Vision Pro.
In our first XR report, What Provides the Best Experience to Interact with Smart Glasses?, we’ve asked “What's in a name?” and noted that any face-worn device allows the user to do two things: display various types of digital elements in the user's environment, and allows the user to interact with these elements. The complexity of the digital overlays and the richness of the interaction will determine the devices’ utility.
Augmented Reality is the fusion of digital information elements with elements in the physical world. These heads-up display (HUD) type experiences overlay data without requiring users to look away from their usual viewpoints. These digital elements are basic, mostly in the form of icons or text, and present information that is not directly tied to the user's immediate physical surroundings. The display is designed to be visible without obstructing the user's normal field of vision, with a resolution of (480 to 640) x (360 to 850) pixels, on a monocular device which displays information to only one eye.
Interaction and input can be achieved using a touchpad control mounted inside the frame along the temple area, allowing users to control the display and navigate through the interface with simple gestures (swipe left and right, tap to select or go back). It may also support voice control, phone touch screen, a designated controller, or computer mouse. The North Focals primary interaction method was the Loop, a small ring worn on the finger which featured a joystick-like nub that users could manipulate to navigate the interface, select items, and control various functions on the glasses.
Virtual Reality involves creating an entirely digital environment that immerses users, completely replacing their physical surroundings. These immersive experiences typically use head-mounted displays (HMDs) to project the virtual world directly in front of the user's eyes. VR provides a fully immersive experience, often encompassing a 360-degree field of view. Users are surrounded by the virtual environment, which is designed to be as engaging and realistic as possible. The displays typically offer high resolutions, ranging from 1080 x 1200 pixels per eye, or even higher, on binocular displays - each eye has its own display, creating stereoscopic 3D effect.
Interaction and Input occur through various input apparatus . These may include handheld controllers with buttons and joysticks, touchpad controls, voice commands, and even motion sensors which track hand and body movements. Head tracking is used to follow the user's head movements, adjusting the display accordingly to maintain immersion.
Mixed Reality is integration of digital information elements that are integrated into the physical world, allowing for interaction between real and virtual objects. MR seamlessly blends digital content with the physical environment, enabling users to interact with both virtual and real-world objects simultaneously. This integration creates a dynamic experience where digital elements are contextually overlaid onto the real world, allowing for meaningful interaction between the two realms. Users may work simultaneously with various screens, applications and widgets, and allow the user to pin a certain element to a specific location. That element will remain where and when the user returns to the specific location. The display is designed to be visible without obstructing the user's normal field of vision, typically with high-resolution, in the range of 720 to 1080 pixels per eye, and use binocular displays.
Interaction and Input may include hand and gesture recognition, allowing users to interact with digital elements using natural hand movements for direct manipulation of holograms.Another method is eye-tracking technology, which allows for more intuitive interactions by understanding where the user is looking. Additional examples include, voice control, and input methods such as designated controllers with haptic feedback and six degrees of freedom (6DoF) tracking, and more traditional interfaces.
The introduction of the Passthrough feature in VR and AR allows users to see their real-world environment while wearing a device that physically blocks real-world view. Passthrough enables the integration of digital elements into the user's physical environment to overlay digital content onto the real world, facilitating tasks like productivity, immersive gaming, and communication.
Passthrough has made a mix in AR/VR/MR product categories, and was introduced in the original HTC Vive (2016) which included a front-facing camera that enabled basic passthrough functionality. This feature allowed users to see a grayscale view of their surroundings without removing the headset, primarily for safety and boundary setup.
Spatial Computing, termed by Apple on the Apple Vision Pro launch, is now used as an alternative term to face-worn, glasses, and headset form factor devices. Extended Reality (XR) is used as an acronym for AR/MR/VR experiences. And the Metaverse - that promise has yet to be fulfilled.
This introduction has briefly elaborated on the various form factors, experiences, displays and input methods of face-worn devices. It is clear that the richness of the display, the core experiences with digital elements, and the scope of the interaction determine the user experience. To become ubiquitous, smart glasses must be slick, elegant, and lightweight. Looks sell.
On our first XR report, What Provides the Best Experience to Interact with Smart Glasses?, we’ve also explored the Apple Vision Pro input user experience, the Meta approach towards neural interface wristband form factor, and the long term vision of Neuralink to transform movement intent into digital command at the speed of thought. We’ve concluded that a simple point, click, drag pointing functionality is optimized for most user experiences.
Our second XR reports, Unlocking Gesture Control: The Rise of a Neural Input Wristband as the Next Generation's Pointing Device, we’ve contemplated how can a neural input wristband reduce the dead-weight of face-worn devices by offering a familiar and comfortable gesture set such as point, click, and pinch and drag. We’ve proven that the fusion of IMU and SNC may provide the same accuracy and user experience as a camera-based gesture control input.
In this XR report we will focus on providing experiences beyond the boundaries of gesture recognition cameras, and on various integration modalities that gesture recognition can be utilized whether by cameras or by neural interfaces.
Part 1 of the report will introduce various concepts which will illuminate how the pace of technology advancements has always been dictated by user interfaces. We’ll cover pre-digital to modern-era interface advancements and the strong correlation between GUI and input. We’ll cover user input and feedback, direct and indirect manipulation, input device types, screen size and its relation to pointing device functionality, and advocate that certain input functionalities - and gesture types - are optimized for specific display types, using gestures.
In Part 2 we shall explore the origins, evolution, technologies, and boundaries of Gesture Control. We’ll cover how it started with a wearable approach and then shifted to camera-based technologies, then went back to wearable, and nowadays is firmly fitted as a built-in solution for smart glasses. We then analyze the line of sight and field of view camera based solution limitations, and their equivalents in neural interface technology.
In Part three, we introduce the work we’ve achieved in collaboration with Qualcomm on the Lenovo AR ThinkReality Smart Glasses. We will demonstrate how we’ve been able to take gesture control and interaction beyond gesture recognition boundaries, by utilizing fingertip pressure gradations control, interactions beyond field-of-view boundaries, and laser-pointer functionality and features.
Let's dive in.
PART 1: PRE-DIGITAL TO MODERN-ERA INPUT INTERFACES
Human-Machine Interaction (HMI) is conveyed through a user's input and machine output, by means of a user interface and an interaction device[1]. A user forms an intent, expressed by selecting and executing an input action. The machine interprets the input command and presents the output result, which the user perceives to evaluate the outcome.
Human-Machine Interaction
In the pre-digital era, the punch wheel was an early example of a mechanical device used to input data or control instructions into machines. It featured a rotating wheel with holes or notches in specific patterns. Such patterns were read sequentially, allowing the machine to execute corresponding actions, a classic Turing Machine setup. An example of such devices are music boxes, where punch wheels control the sequence of notes played by aligning holes with a reading mechanism, producing programmed melodies.
The punch card emerged as a successor to the punch wheel, offering a more flexible and scalable method for data input and control in automated systems. It consisted of a stiff paper card with holes punched in specific positions to represent data or instructions, allowing for efficient storage, sorting, and processing.
The Jacquard loom, invented by Joseph Marie Jacquard in 1801, used punch cards as an input method to control the pattern of the weave. The punch cards encoded specific weaving instructions, allowing for the automatic production of intricate designs. Herman Hollerith later adapted the punch card concept for data processing in the 1890 U.S. Census.
In this pre-digital era, expressing the intent was a physically laborious task with costly consequences in case of errors. The machine presented its interpretation as a form of a task such as looming fabric, and the user perceived the outcome and evaluated it through the finished product or result.
Keyboards became a standard interface in the 1970s and 1980s with the rise of personal computers and screens, allowing users to input text, execute commands, and navigate software. This era introduced the Command Line Interface (CLI), a text-based input method that allows users to interact with computer systems by typing commands into a terminal. It provided precise control over computing tasks, enabling users to perform file manipulation, system configuration, and automation through scripting.
Punch cards and command line interfaces are forms of Indirect Manipulation. They require users to issue commands through an intermediary format rather than directly interacting with objects. With punch cards, users encode instructions via holes in predefined patterns that machines interpret and execute. Similarly, in a CLI, users input text commands like rm filename.txt, which the system processes to perform actions. Both methods necessitate an abstraction layer where users translate their intentions into specific command languages.
The introduction of the Graphical User Interface (GUI) revolutionized computer input by enabling Direct Manipulation of visual elements on the screen, with the computer mouse playing an essential role. GUIs replaced text-based command input with interactive components like windows, icons, and menus. The computer mouse allows users to point, click, and drag these elements, making actions such as opening files, moving objects, and navigating applications intuitive and accessible.
Traditionally, Input devices are categorized as a Pointing Device or a Character Device. Pointing device, e.g. a computer mouse, is used to input position, motion, and pressure. Character device, e.g. a computer keyboard, is used to input text.
The basic computer mouse capabilities are Navigation and Pointing: the ability to navigate in a 2-dimensional space, and to manipulate - select or interact with - certain elements. Navigation and Pointing. Keep these two functions in mind, as they are the fundamental elements of HCI.
Along with the computer mouse, common pointing device products are the trackpad/directional pad and gaming controller. Newer technologies include the touchscreen, gesture recognition cameras and radars, IMU-bases wearables, and neural interfaces. As for character input, “speech-to-text” or voice assistants are now used to transcribe text or perform tasks based on verbal speech or command.
Pointing devices: Left: A 1968 prototype of the first mouse (Getty Images) ; 1983 Nintendo Famicom controller with D_Pad (Nintendo) ; (1977) Atari CX40 joystick (Atari)
The pace of technology advancements has always been dictated by user interfaces.
The future of Computers is tilting increasingly toward face worn smart glasses, which require instant and relevant output interaction through a set of always-on sensors, based on the user’s movement, aim and perspective – translating the user’s intentions and inputs into actions. Input interfaces for future devices will have to be re-invented as well.
As discussed in the introduction, GUI design for face-worn devices is quite versatile, and is mostly dependent on the display size and resolution, which determines the device form-factor and dimensions. The complexity of the digital overlays and the richness of the interaction will determine the devices’ utility.
Let’s briefly explore input functionality for three types of devices: a screenless smartglass device, a monocular AR heads-up display, and a MR device.
Meta Ray-Ban smart glasses are designed to provide a seamless experience for capturing moments, staying connected, and enjoying media. They feature dual 12MP ultra-wide video recording cameras, integrated open-ear speakers for audio playback, and multiple microphones for voice capture and calls. The input method for the device includes a touchpad located on the right temple, utilizing swiping and tapping gestures. The main navigation functions are:
Single Tap - to Play/Pause Audio or Answer/End Call
Double Tap - to Skip to the next audio track.
Triple Tap - to go back to the previous audio track
Swipe Forward - to increase the volume
Swipe Backward - to decrease the volume.
Everysight Raptor is a set of augmented reality (AR) smart glasses designed specifically for cyclists, featuring a monocular heads-up display (HUD) of 872x500 pixels. The input method for the device was a touchpad in the forward portion of the arm at the right temple, controlled using swiping and tapping. The main navigation functions are:
Swipe forwards or backwards - to rotate the carousel, using one finger in a long continuous swipe motion.
Tap - to select an item. The item selected is always the one that is centered.
Swipe down - to go back one screen.
Tap and Hold - to show the list of running apps and settings/adjustments.
Double Tap - to activate the camera in
Microsoft HoloLens 2 is a set of mixed reality (MR) smart glasses designed specifically for professional use, featuring a dual 2K 3D holographic resolution per eye. The input method for the device includes advanced hand tracking for intuitive interaction and eye tracking control. Hand gestures are used to interact with holograms:
Air Tap - to select items
Bloom - to open the Start menu
Tap and Hold - to dragging and manipulating objects
Hand rays - to target and interact with distant objects
Scroll - to scroll through lists and content
Pinch to Zoom - to zoom in/out content
The Meta Ray-Ban glasses have no screen, the Raptor has a minimal display screen, and the HoloLens2 has a very rich display. There are many similar products in the market that use similar input methods for the same display type.
It is clear that as the display becomes richer, the input interactions also become more complex and sophisticated. That implies a strong correlation between GUI and HCI.
We can classify the display types of these three product categories in the following way:
Screenless glasses, e.g.: Meta Ray-Ban glasses, Snap Spectacles 3, Bose Frames, Amazon Echo Frames
Monocular devices, e.g. EverySight Raptor, Google Glass Edition 2, Vuzix M400, Epson Moverio BT-40S, RealWear HMT-1
Mixed Reality headsets, e.g. Microsoft HoloLens 2, Apple Vision Pro, Magic Leap 2, Lenovo ThinkReality A3
The green area exemplifies the richness of display per product category
Image sources: wired
The Meta Ray Ban glasses GUI uses audio and action-results cues for the user to receive feedback. Raptor’s GUI features an Icon-Based Navigation system with a grid layout of icons to navigate through. HoloLens2 uses a full Graphical User Interface, i.e. 2D/3D spatial navigation.
What are the classic types of pointing devices suitable for each product category? We may consider the following input scheme:
Screenless glasses can be controlled using a simple Controller functionality. Using an analog stick for swiping and buttons for selection and skipping.
Monocular devices can be controlled using a simple Directional Pad functionality. Using arrows or a pad for navigation and button or tap for selections.
Mixed Reality headsets can be controlled using a simple Mouse Pointing functionality. Moving a pointer for navigation, buttons for selection and both for dragging.
As briefly discussed in the XR team’s first report, input interaction with a GUI, in the context of Fitt’s law, can be more easily understood using the metaphor of driving a car.
Inputting commands into a screenless GUI is similar to driving on a highway - only a slight nudge of the steering wheel is required to get back on course. In the same way, simple finger movements while the wrist isn’t moving can be used when activating a controller to interact with a screenless GUI.
Inputting commands on a monocular device requires a bit more user attention - browsing through icons, which is analogous to driving in urban areas. Navigation is through large icons when the selectable element is highlighted or centered on the display.
Driving inside a crowded garage requires a good degree of focus, which is relevant when interacting with small digital elements or inline text editing on Mixed Reality headsets.
Qualitative Index of Difficulty, Display Category and Input Type.
Source: Wearable Devices Ltd.
How should one use gestures to interact with each device category?
For screenless GUI devices, which can use minimal input scheme, one can use discrete finger movements to input commands. A tap, a double-tap, and index or thumb finger movements with the wrist orientation up or down can provide 8 input gestures, which is well beyond the scope for navigation and selection, and aligns with Miller’s number of objects an average human can hold in short-term memory [2]. The benefits of using such gestures is that they do not require high attention nor display feedback.
For monocular displays, using flicks - hand swipes and scrolls for navigation - and tap/double tap gestures to input commands. The subtle visual feedback of the selectable highlighted icon provides a simple and tactile gesture input design, which offers clear, distinct directional gestures that minimize the likelihood of accidental presses.
Mixed Reality and rich displays (e.g. VR/XR/Spatia, SmartTVl) pointer and gestures offer a comfortable and familiar interface and a great user experience. Controlling a cursor using small wrist movements, tapping on objects, and drag+move element manipulation using a pointing combination of wrist movement and fingertip pressure is probably the most desirable control method, which has been adopted by the most advanced MR and Spatial devices in the Market.
Discrete gestures input for minimal or screenless devices
Flicks and Taps for monocular or heads-up displays
Pointing gestures for Rich displays
To summarize part 1, we’ve covered the pre-digital era user input was predetermined and costly to generate, and noted that the user’s feedback is received through audio and action-result cues; in the digital era user inputs have evolved from keyboard to mouse, controllers, pad/screen, and nowadays gesture recognition, voice, wearables and neural interfaces. We’ve also elaborated on the scope of the GUI - while in the pre-digital era the GUI was real-life mechanics, in the digital era GUI was 1-dimensional CLIs, 2-dimensional screens, and spatial visualizations.
Then we analyzed three product categories, each with a specific type of GUI and display - screenless, monocular and mixed reality - and laid-out the input methods used for each one. We noted that the pace of technology advancements has always been dictated by user interfaces. We’ve then matched the optimal pointing device functionality with each product category, and suggest the optimal gesture types per device category.
PART 2: GESTURE CONTROL - ORIGINS, EVOLUTION, TECHNOLOGIES, AND BOUNDARIES
Gesture Control technology allows users to interact with digital devices through hand, eye and body movements. It offers a natural and intuitive form of interaction. While nowadays the term is mostly related to the use of cameras or optical sensors, the first significant commercial application of gesture control using finger and hand movements was using a wearable.
Following the groundwork at Myron W. Krueger's artificial reality lab (1985), VIDEOPLACE, where users experimented interaction with computer-generated graphics using their body movements, The first significant application of gesture control technology was in VPL Research’ development of the DataGlove in 1987.
The DataGlove was a wearable device that captured hand movements and finger positions, allowing users to interact with virtual environments through natural gestures. It employed fiber optic sensors to detect finger movements by measuring light transmission changes, a magnetic tracking system to determine hand position and orientation, microprocessors to process sensor data, and flexible circuitry to integrate these components while maintaining the glove's flexibility and user comfort.
The technology was licensed to Mattel which released the Power Glove in 1989 - a controller accessory for the Nintendo Entertainment System. It had traditional NES (Nintendo Entertainment System) joypad controller buttons on the forearm (directional pad and buttons), buttons labeled 0–9, and program button. To input commands, the user pressed the program button and a numbered button. Along with the controller, the player can perform various hand motions to control a character on-screen. It could detect roll, and uses sensors to detect four positions (2 bits) per finger for four fingers. Super Glove Ball, and Bad Street Brawler games were released with specific features for use with the Power Glove, and included moves that can only be used with the glove.
While it sold nearly one million units and was a commercial success, because the controls for the glove were incredibly obtuse it became impractical for gaming. However, it was adopted by the emerging Virtual Reality community in the 1990's to interact with 3D worlds since it was cheaper than the DataGlove.
Power Glove, American model
In the following years, most of the significant research was conducted using cameras, in academic laboratories.
In late 2010 Microsoft launched the Kinnect, a motion-sensing input device that enables users to control and interact with Xbox 360 games and applications through physical gestures and voice commands. The Kinncet revolutionized gesture control by utilizing an RGB camera, depth sensors, and a multi-array microphone to track users' movements and gestures in three-dimensional space. This enabled full-body motion capture and voice recognition, allowing for interaction with games and applications through physical movements and voice commands without traditional controllers. The Kinnect key hand gestures were:
Wave: Raise one hand and wave it side to side - used to start interactions or select items.
Push: Extend your hand forward as if pressing a button - used to select or activate items.
Swipe: Move your hand horizontally or vertically across your body - used to navigate menus or move between screens.
Raise Hand: Lift one hand above your head and hold it - used to initiate interactions or bring up menus.
Grip/Release: Close your hand into a fist to "grip" and open it to "release" - used to drag and drop objects.
Steering Wheel: Hold your hands as if gripping a steering wheel and turn them - used to simulate steering in driving games.
The Kinect revolutionized human-computer interaction and gesture control by introducing motion-sensing capabilities to the mainstream. It enabled intuitive interactions using gestures and voice commands, significantly improving accessibility, and its versatile technology also found applications in virtual and augmented reality, expanding the scope of gesture control in various domains beyond entertainment.
In July 2013, Leap Motion launched its first product, the Leap Motion Controller, a groundbreaking device that allows users to control and interact with their computers using natural hand and finger movements. The Leap Motion Controller uses two infrared cameras and three LEDs to create an interactive 3D space, tracking the precise movements of the user's hands and fingers with incredible accuracy. It offered hand skeletal tracking data like the position of each bone of a finger or the orientation of the palm of the hand. It enabled touch-free interaction with a wide range of applications. Key gestures supported by the Leap Motion Controller included:
Point: Extend a finger to point and select items.
Pinch: Pinch fingers together to grab and manipulate objects.
Swipe: Move a hand or finger horizontally or vertically to navigate menus and screens.
Circle: Move a finger in a circular motion to perform specific commands.
Grab: Close a hand into a fist to "grab" objects and open it to release them.
The Leap Motion Controller revolutionized gesture control by providing high-precision tracking in a compact, affordable device, paving the way for new applications in various fields, including virtual reality, education, and digital art. It was compatible with HTC, Oculus and additional headsets and offered after-market gesture control functionality.
In 2014, the Myo Armband, a gesture control armband developed by Thalmic Labs, brought wearable gesture control technology back into the limelight. This device used electromyography (EMG) sensors to detect muscle activity and motion sensors to track arm movements, allowing users to control digital devices through gestures. The Myo Armband's reintroduction of gesture control into the mainstream highlighted its potential across various applications, from gaming and presentations to drone piloting and virtual reality interactions, marking a significant advancement in wearable technology since the Nintendo Power Glove.
The Myo Mapper was a software application developed to facilitate the mapping of gestures detected by the Myo Armband to various output commands. It enabled users to create custom mappings for different gestures, allowing for versatile control over a range of devices and applications. Gestures supported by the Myo Armband:
Double Tap: tap your index on the thumb twice - used to select items
Wave Left: Move your palm left - used for navigation or switching between items.
Wave Right: Move your palm right - used for navigation or switching between items.
Spread: Spread your fingers wide - used to pause or resume actions.
Fist: Clench your fist - used to select items or perform actions like clicking.
Rotate, Pan: wrist movements - used to adjust volume or scroll through lists.
The Myo Armband gestures (WebArchive)
The Microsoft HoloLens 1, released in 2016, pioneered built-in gesture control in face-worn devices, allowing users to interact with holographic content through natural hand movements. This technology, using internal cameras and sensors, has since been adopted by devices like the Oculus Quest (e.o. 2019), HTC Vive Focus 3 (2021), and culminated in the Apple Vision Pro (2024), hailed by many for its accurate, natural and intuitive gestures. These devices feature hand-tracking capabilities, enabling users to navigate interfaces, manipulate objects, and perform actions within virtual spaces using simple gestures for point, click, drag+move functions.
Gesture control technologies involve capturing movements, processing raw signal data by software to improve signal-to-noise ratio, and Machine learning algorithms to interpret the processed signals to recognize and classify gestures.
For Vision-based gesture control, the hardware may include RGB Cameras that capture standard color images, Depth Cameras to measure the distance of objects, Infrared Cameras to track movements in various lighting conditions by detecting heat signatures. In addition, IMU (Inertial Measurement Units) sensors may be integrated to complement camera data with motion tracking. The software contains data processing firmware and signal processing libraries. The Algorithms may include Support Vector Machines (SVM) used for pattern recognition and classification tasks, Image Processing algorithms for Segmentation (Dividing) images into meaningful segments to isolate hands and other relevant parts, Feature Extraction for Identifying key points and features in the images (e.g., fingertips, hand contours), Tracking Algorithms (Kalman Filters) : For predicting and smoothing the positions of moving objects for skeletal tracking reconstruction, and Machine Learning algorithms using Neural Networks to train models to recognize and classify gestures from visual data.
For Wearable-based gesture control, the general structure and types of technologies are quite similar, and the major distinguishable difference is at the hardware level. The wearable should be snugly fitted to the wrist to accurately detect bio-potential signals and wrist movement. The snug fit ensures minimal relative motion between the device and the skin or wrist, enhancing the precision of signal acquisition and motion tracking. The interfacing medium between the skin and the wearable device is the electrodes, which is used to detect electrical activity in the body.
We now turn to the fundamentals of signal acquisition for gesture recognition. For vision-based technology these are known as “Line of Sight” (LOS) and Field of View (FOV), and for wearable-based the equivalents are Electrode-Skin Contact Quality and Sensor Coverage Area.
“Line of Sight” refers to the requirement that the sensor or camera must have a clear, unobstructed view of the user’s gesture to accurately detect and interpret it. If the view is blocked by an object the system may fail to recognize the gestures correctly.
"Field of View" refers to the observable area that a sensor or camera can capture at any given moment, and it determines the complexity of the gestures that can be recognized. A wider FOV allows the system to capture more information, thus enhancing the accuracy and flexibility of the gesture.
Electrode-Skin Contact Quality is crucial for accurate bio-potential signal detection, so as to ensure the electrodes can reliably measure bio-potential electrical activity without interference. Poor contact or obstruction will reduce the accuracy of gesture recognition.
Sensor Coverage Area refers to the area on the skin where the electrodes can effectively detect signals. Proper placement and sufficient coverage are necessary to capture the full range of signal patterns and accurately classify the correlating gestures.
A good example of vision gesture recognition for MR is comparing the HoloLens 1 gesture control with that of the Apple Vision Pro, as we’ve covered on XR team’s first report.
The HoloLens 1 required users to position their hand in front of their nose to perform gestures, obstructing their real-world view because the gesture camera's field of view was at its center. Apple placed multiple outward-facing cameras, which resulted in comfortable body postures, along the waist and in relaxed spatial comfort. The advancements in cameras, sensors and algorithms allows AVP users to input subtle gestures such as tap, pinch and hold, and glide, whereas the HL gestures were a very specific “air tap” and bloom gestures, which are a bit more rough and less familiar and intuitive.
This, of course, comes at a price. HoloLens 1 was priced at $3,499 back in 2016, which is the same price for the 2024 Apple Vision Pro. HL1 weighs 579 g (1.28 lb) and the AVP is around 650g (1.40 lb). The HL1 contained an internal rechargeable battery with average life rated at 2–3 hours of active use, or 2 weeks of standby time, whereas the AVP’ external battery supports up to 2 hours of general use and up to 2.5 hours of video playback. Both devices have a “helmet” or ski-goggles form factor, with users complaining that with the weight and front-loaded design being commonly cited as problematic.
On average, a pair of sight glasses typically weighs between 20 to 50 grams (0.7 to 1.8 ounces), while a pair of sunglasses generally weighs between 25 to 50 grams (0.9 to 1.8 ounces). Snap Spectacles weigh 56.5 grams, and Meta Ray-Ban glasses range around 49 grams. Devices are priced at around the $299 mark.
For AR glasses to achieve mass market adoption, they must be lightweight, comfortable with enough ‘juice’ for all-day wear and continuous use. We believe that a neural input wristband can offer the same accuracy and experience as that of the Apple Vision Pro, while massively reducing the glass’ weight, form factor and price, and a battery that can last longer.
To summarize Part 2, we’ve reckoned that gesture control has alternated from the Power Glove wearable in the late 1980s to vision based solutions such as the Kinnect and Leap Motion Controller in the early 2010s; It then bounced to the Myo Armband wearable, and alternated to built-in gesture control solutions in the HoloLens 1 and subsequent products in XR.
We’ve presented the gestures used by each generation of technology, noticing that as time passed the gesture became smaller and comfortable, moving from dexterity to large body movements, from twisting of the palm to subtle finger movements. We then surveyed the technologies at the basis of gesture recognition, and concluded that a wearable approach to gesture recognition may solve the inherent line of sight and field of view limitations of camera-based technologies.
Recapping on Wearable Devices XR team's reports so far, our 2023 white paper laid the foundation and basis for the use of a neural input wristband gesture control product category in extended reality experiences. The first report established that the Mudra neural input interface wrist-wearable provides the same gesture control user experience as that of the Apple Vision Pro. The second report advocated that the pendulum is now shifting back to a wearable-based gesture recognition approach with a pointing device functionality. And in Part 3 of this report post we shall convey how wearable gesture control technology can solve the line of sight and field of view limitations of vision-based gesture control technologies, thus providing an enhanced user experience to accelerate the mass adoption of face-worn devices.
PART 3: BEYOND BOUNDARIES - ENHANCING USER EXPERIENCE THROUGH THE BLEND OF WEARABLE AND CAMERA GESTURE CONTROL
In June 2024 we successfully demonstrated the Mudra technology on Lenovo's ThinkReality A3 headset, at the Augmented World Expo (AWE) Santa Clara. The demonstration was on Lenovo's ThinkReality A3 Smart Glasses.
The Lenovo ThinkReality A3 smart glasses are advanced augmented reality (AR) eyewear designed for professional use. They feature high-resolution stereoscopic 1080p displays, providing a virtual screen experience equivalent to viewing multiple monitors. The glasses are powered by the Qualcomm Snapdragon XR1 platform and offer integrated 8MP RGB cameras for video recording and streaming, along with dual fish-eye cameras for room-scale tracking. The ThinkReality A3 is lightweight and can be connected to a PC or select Motorola smartphones via USB-C, making them versatile for use in various business applications, including virtual monitors, remote assistance, and 3D visualization.
The A3 smart glasses offer versatile input methods to enhance user interaction. These methods include voice commands, head movement tracking, and camera-based gesture control. When connected to a PC or Motorola smartphone via USB-C, the smart glasses can be controlled through the device's interface.
We’ve demonstrated three input modalities which enhance the user experience and show how the use of camera and wearable gesture control enhances the user experience:
Blended modality - Mudra input works alongside the A3 head tracking. The user navigates to areas of interest with the A3 head movements, and pointing is achieved using Mudra tap or pinch and drag gestures. This empowers comfortable body postures for input without raising the arms in mid-air.
Extended Input - we’ve used Mudra input to enhance interaction with the glasses beyond the field of view boundaries. While inside the FOV the A3 gesture control is used, and outside the FOV a laser pointer is controlled by Mudra. This benefits the user with a more streamlined interaction without the constant need to move the head.
Air stylus - Mudra technology has the unique ability to measure fingertip pressure gradations. We’ve created a digital art experience which lets users draw in mid-air simply by moving the hand, and control the width of the line by applying various gradations of fingertip pressure. This method offers enhanced freedom and flexibility, enabling more dynamic gestures by eliminating the need for a stylus, reducing hand strain, and providing precise sensitivity for accurate, natural-feeling drawing.
In Part 3 we’ve shown how built-in gesture control cameras and wrist-worn neural interface wearable enhance the user experience in multiple modalities: navigation by headset and pointing by wearable; working outside of the field of view using a wearable; enriching input by adding sensitive fingertip pressure gradations.
CONCLUSION
We have witnessed how the pace of technology advancements has always been dictated by user interfaces. From the punch wheel to neural interface wearables, direct manipulation of digital objects is the best approach for face-worn devices using simple hands-free and touchless gestures: point, click, drag. We’ve highlighted how complexity of the digital overlays and the richness of the interaction will determine the devices’ utility.
In the past 35 years gesture control technology has been alternating between wearables, external sensors, and built-in camera arrays. The line of sight and field of view limitations determine how comfortable the body posture and familiar the gesture can be. We’ve correlated these factors with electrode-skin contact quality and sensor coverage area for neural interfaces. We’ve demonstrated in our previous reports that point, click and drag functions can be as accurate, intuitive and natural using a neural interface beyond the line of sight and field of view limitations.
We’ve concluded with our newest demos which reveal how a wearable neural input wristband enhances face-worn devices user experience in navigation, boundless input, and sensitive fingertip gradations control.
THE MUDRA BAND
Mudra Band is the world’s first neural input wristband. It translates movement intent into digital commands to control digital devices using subtle finger and hand gestures. It connects to the Apple Watch just like any regular watch band, and lets you control Apple ecosystem devices using simple gestures. Your iPhone, iPad, Apple TV, Mac computer, Vision Pro, and additional Bluetooth controlled devices can be paired with the Mudra Band and be operated using Touchless Gestures.
The mudra Band Is equipped with three proprietary Surface Nerve Conductance (SNC) sensors. These sensors are located on the inside face of the band and keep constant contact with the skin surface. Each sensor is approximately located above the ulnar, median, and radial nerve bundles, which control hand and finger movement.
The Mudra Band also uses an IMU to track your wrist movement and speed. If you’ve moved your wrist up, down, left or right, inwards or outwards - the IMU captures the motion.
Using sensor fusion, our algorithms integrate fingertip pressure and wrist motion to determine the type of gesture you’ve performed. It can be a mere navigation function that is only using wrist movement, or it can also incorporate any type of fingertip pressure for pointing. Combining the two readings, motion and pressure, manifests in the magical experience of Air-Touch: performing simple gestures such as tap, pinch and glide, using a neural wristband.
If you’ve liked what you’ve read, we welcome you to Start a Movement and Join the Band at www.mudra-band.com
[1] Norman, D. A, (1984) Stages and levels in human-machine interaction. Int. J. Man-Machine Studies (1984), 21, 365-375
[2] Miller, G. A., "The magical number seven, plus or minus two: Some limits on our capacity for processing information". Psychological Review. 63 (2): 81–97(1956)
Comments