Deep Learning Paves the Way for Occlusion-Resilient Face Recognition: A Comprehensive Review
Face recognition has emerged as one of the most dynamic and widely adopted technologies in the field of biometric identification, boasting advantages such as high reliability, intuitive operation, and contactless verification. Its applications have expanded across sectors ranging from public security and smart surveillance to daily life scenarios like mobile phone unlocking and access control. However, real-world deployment of face recognition systems is frequently hindered by unconstrained environmental factors, with facial occlusion—caused by masks, sunglasses, hats, or even natural facial expressions—being a primary technical bottleneck that degrades recognition accuracy significantly. The COVID-19 pandemic further amplified the demand for occlusion-tolerant face recognition technology, as mask-wearing became a universal norm, creating an urgent need for algorithms that can accurately identify individuals even with critical facial features obscured. Against this backdrop, deep learning has emerged as a pivotal solution to address occlusion challenges in face recognition, thanks to its exceptional ability in automatic feature extraction and complex pattern modeling. A new comprehensive review published in Computer Engineering and Applications delves into the latest research progress in occlusion-aware face recognition methods based on deep learning, analyzing cutting-edge models, algorithms, and datasets, while also outlining key challenges and future research directions for the field.
The core issue plaguing occlusion face recognition lies in the loss, noise interference, and local aliasing of facial features caused by occlusion, which disrupt the inherent structural integrity of human faces and prevent algorithms from making accurate identity judgments. Traditional face recognition approaches, which rely on handcrafted features, struggle to adapt to the variability and complexity of occluded facial data, as they lack the capacity to learn robust, discriminative features from incomplete visual information. Deep learning, by contrast, leverages layered neural network architectures to automatically learn hierarchical features from raw image data—from low-level edge and texture features to high-level semantic and identity features—enabling it to either exploit unoccluded facial regions effectively or repair missing features caused by occlusion. This review categorizes the state-of-the-art deep learning-based occlusion face recognition methods into five core research directions: utilizing unoccluded facial features, feature fusion strategies, occlusion region feature restoration and completion, generative adversarial network (GAN)-based approaches, and lightweight network models. Each category is analyzed in detail for its technical principles, performance advantages, limitations, and practical application scenarios, providing a systematic framework for understanding the current landscape of the field.
Utilizing Unoccluded Facial Features: Maximizing Valid Information Extraction
When facial regions are partially occluded, the most straightforward and effective strategy is to fully exploit the discriminative information from unoccluded areas, either by supplementing missing features through neighboring region information or by enhancing the representativeness of visible features to compensate for occluded parts. This research direction has spawned several innovative models and algorithmic improvements, with three key approaches standing out in recent studies.
First, the extraction of facial attribute features has proven to be a robust method for occlusion face detection. The Faceness-Net model, proposed by relevant researchers, designs a set of attribute-aware deep networks that extract local facial features and aggregate them from local to global to obtain face candidate regions, followed by identity recognition. A key innovation of this model is the sharing of local feature parameters and the classified extraction of facial attribute features, which ensures that even if one facial part is occluded, other parts can still be accurately localized. This design reduces the network parameters by 83% and improves overall performance by nearly 4 percentage points, delivering high stability and strong environmental adaptability, with superior performance in terms of operation speed, recall rate, and average precision for faces with large pose variations. However, the model’s effectiveness is contingent on clear facial images, which reduces training difficulty and enhances stability; it faces challenges in facial scoring and recognition when occlusion areas are large or image quality is poor, limiting its applicability in low-quality or heavily occluded scenarios.
Second, enhancing the features of visible facial regions through attention mechanisms has become a mainstream approach, with the Face Attention Network (FAN) being a representative model. FAN integrates anchor strategies, data augmentation, and attention mechanisms, setting different attention mechanisms for feature maps at different positions of the feature pyramid based on face size—specifically, adding an Attention function to the anchors of RetinaNet. Through multi-scale feature extraction, multi-scale anchors, and a multi-scale attention mechanism based on semantic segmentation, the model implicitly learns occluded facial regions, significantly improving detection performance for occluded faces. Nevertheless, the model is trained on datasets where facial and occlusion region features are mixed, causing the attention mechanism to enhance both valid facial features and irrelevant occlusion features simultaneously. Additionally, the method of dividing attention maps based on face size cannot guarantee that faces are assigned to appropriate feature maps, which may lead to misalignment of feature extraction and compromise recognition accuracy in complex occlusion scenarios.
Third, improving loss functions to strengthen feature discriminability has become a critical research focus for optimizing deep learning models for occlusion face recognition. Traditional loss functions often struggle to balance intra-class compactness and inter-class separability in occluded feature spaces, leading to degraded recognition performance. To address this, researchers have designed a series of specialized loss functions: CenterLoss combines Softmax and L2 norm to increase inter-class distance while reducing intra-class distance, minimizing the error between predicted and real samples; Angular Softmax realizes the separation of different category features and aggregation of the same category features on a hypersphere; Arcface maximizes intra-class distance directly in the angular space, enhancing the discriminative power of facial features. Notably, the Grid Loss function adopts a block processing approach, dividing facial feature maps into several grids and summing the loss of each grid with the loss of the entire map as the total loss function, which reinforces the feature discriminability of each grid. Experimental results show that Grid Loss effectively improves face recognition accuracy in occluded environments, performs well in small-sample training, and incurs no significant additional time cost, making it suitable for real-time detection applications with high stability. However, the function still struggles to handle large pose variations, faces high training difficulty in complex scenarios, and its performance is highly dependent on the design of the loss function, leading to potential instability in unconstrained environments.
Feature Fusion Strategies: Integrating Multi-Source Information for Robust Recognition
Facial features do not exist in isolation; they are closely associated with contextual information (such as human body parts) and multi-modal biometric features (such as voiceprint, iris, and palmprint). Feature fusion strategies leverage this interconnection by integrating multi-source information to compensate for the loss of facial features caused by occlusion, thereby improving the robustness and accuracy of recognition systems. This direction is divided into two subcategories: contextual information fusion-based feature extraction and multi-modal biometric feature fusion.
Contextual information fusion-based feature extraction methods focus on integrating global and local contextual information of human faces, as facial appearance is inherently linked to other body parts and background environments. The Contextual Multi-Scale Region-based CNN (CMS-RCNN) is a typical model that fuses global and local contextual information, simultaneously focusing on facial region features and their contextual information, and fusing features on multi-layer feature maps to form a long feature vector for subsequent classification. This approach significantly improves recognition accuracy by leveraging supplementary contextual information, but it faces challenges in weight allocation and integration of different feature components, leading to slow inference speed and potential instability in the model. To address the limitations of CMS-RCNN, the PyramidBox face detection framework was proposed to make more full use of contextual information: it adopts an anchor-based contextual information assistance method to learn contextual features of small, blurred, and occluded faces, designs a low-level pyramid network to better fuse contextual features, and introduces a Context-sensitive Prediction Module (CPM) to learn more accurate facial location and classification information from fused features. Combined with the Feature Enhance Module (FEM) based on the Receptive Field Block (RFB), PyramidBox achieves top-down inter-layer information fusion, learning more effective contextual and semantic information in both breadth and depth, and delivering improved performance on the Wider Face validation and test sets with high model stability. However, the model’s feature extraction performance degrades with large occlusion areas, and the complex fusion structure increases training time and difficulty, requiring high computational resources for effective training.
The Occlusion-Adaptive Deep Networks (ODN) further advances contextual information fusion by mining the geometric relationships between different facial components. ODN infers the occlusion probability of high-level features at each position through a distillation module that automatically learns the relationship between facial appearance and shape, using this probability as an adaptive weight for high-level features to reduce the impact of occlusion. Additionally, the model uses a low-rank learning module to learn a shared structural matrix, recovering lost features and removing redundant information to enhance feature purity. Experimental results verify ODN’s superior performance and strong robustness to occlusion and extreme poses, but the need to infer occlusion probability leads to large adjustments in the adaptive weight of high-level features, increasing training difficulty and compromising model stability in practical applications.
Multi-modal biometric feature fusion methods address the limitations of single-modal face recognition by organically combining different biometric features (e.g., palmprint and face, fingerprint and voiceprint, iris and fingerprint) through fusion algorithms, making up for the security risks and performance degradation of single-modal technology in occluded scenarios. Feature fusion can be implemented at three levels: between different image regions, between multiple feature extraction methods, and between multiple classifiers, with a wealth of innovative models proposed in recent research. For example, some researchers extract voiceprint features using the Mel Frequency Cepstral Coefficient method and facial features using convolutional neural networks (CNNs), then fuse them through a weighted fusion algorithm; ConGAN is proposed to learn the joint distribution of multi-modal data for facial multi-attribute images and color depth images; other studies fuse features extracted by multiple CNN models (e.g., ResNet, InceptionV3, VGG19) to train a feature fusion network model, with a test accuracy exceeding 98.2% after 1,000 iterations of the training set using the Keras framework. The fusion of iris, face, and periocular features has also achieved excellent results: a weighted addition method for iris and face fusion, combined with adaptive weighted fusion of iris and periocular features at the feature level based on CNNs, delivers better performance than single-modal recognition on the CASIA-Iris-M1-S2 dataset, with lower storage space requirements and higher computational efficiency than direct feature layer concatenation and score layer weighted addition fusion methods. A face and human body-based multi-modal biometric method using VGG-16 and ResNet-50 network structures also effectively recognizes partially occluded faces and irregular facial regions, but its network structure requires increasing the number of layers to achieve effective feature representation, leading to complex training processes and high difficulty. Despite the promising results of multi-modal fusion methods, they still require further improvement in enhancing the discriminability of fused information, reducing information redundancy, realizing cross-level fusion, and dynamic fusion, with the added complexity of model design increasing overall training costs and difficulty.
Occlusion Region Feature Restoration and Completion: Reconstructing Incomplete Facial Information
Deep learning’s powerful feature learning and image generation capabilities enable it to restore and complete occluded facial regions by learning the inherent structural and texture characteristics of human faces, alleviating the recognition challenges caused by missing features. This research direction focuses on reconstructing occluded parts from both local and global perspectives, enhancing the coherence of content texture processing, and addressing the difficulty of repairing large missing regions, with a variety of innovative models and algorithms proposed for feature restoration and completion.
The Locally Linear Embedding CNN (LLE-CNN) explores the use of non-facial region information to repair and complete occluded facial features, training a nearest neighbor model from a face dictionary and a non-face dictionary composed of a large number of images to refine descriptors, compensate for occluded facial information, and suppress noise in features. The model consists of three core modules: the Proposal Module, which cascades two CNN networks (including P-Net with three convolutional layers and one Softmax layer) to generate face candidate regions and extract features with a low threshold set to generate more candidates for complex occluded faces; the Embedding Module, which recovers occluded feature regions and their features by looking up the dictionary and suppresses feature noise; and the Verification Module, which verifies face regions using restored facial features and fine-tunes facial position and scale. LLE-CNN delivers outstanding performance on the MAFA occlusion dataset, but it has not yet provided results for the dataset’s annotated facial attributes (e.g., mask type, occlusion degree), and the model’s stability is still under continuous improvement.
PCANet, a feature extraction network combining CNN and Local Binary Patterns (LBP), provides local zero-mean preprocessing and PCA filter functions to extract principal component features and filter out occlusion in images. However, when the occlusion area is large, the overall features obtained are distributed near zero values, leading to severe performance degradation. To address this limitation, researchers proposed a Local Sphere Normalization method, embedding it after the first two convolutional layers of PCANet to make the feature values of local regions lie on the same sphere, enhancing the role of small feature values and suppressing the influence of large feature values to achieve feature equalization. This improved model exhibits strong robustness to illumination changes and occlusion, but the embedding of Local Sphere Normalization increases the running time of PCANet at high dimensions, with average running time increasing as the difficulty of the test set rises, limiting its real-time application potential.
The Inception-ResNet-v1M model, combining GoogleNet and ResNet networks and using the Triplet Loss function to learn facial features, strengthens the distinguishability between features, endowing the model with a certain degree of robustness to interference factors such as occlusion, expression changes, and pose variations. Experimental results show that the model achieves a recognition rate of 98.2% when the occlusion rate is 20% to 30%, but its performance is significantly affected when the occlusion rate exceeds 30%, making it unsuitable for heavily occluded scenarios.
Occlusion-aware GAN (OA-GAN), based on semi-supervised learning, transfers an artificially synthesized occlusion repair model learned under paired data conditions to natural facial occlusion repair tasks through adversarial transfer. The generator of OA-GAN consists of an occlusion-aware module and a face completion module: the occlusion-aware module predicts an occlusion mask for occluded images (a prerequisite for the model’s operation), which is then input to the generator together with the occluded face images to remove occlusion information; the discriminator uses adversarial loss to distinguish between real unoccluded images and restored de-occluded images, and attribute preservation loss to ensure that de-occluded images retain the original image’s attributes. The completion module adopts an encoder-decoder architecture with non-occluded feature mapping to generate textures for occluded regions, recovering synthetic face images for both occluded and unoccluded regions from input face images. OA-GAN adopts an alternating training method to achieve better network convergence and reduce training difficulty, delivering excellent recognition results on the CelebA training set and providing a new paradigm for natural facial occlusion repair.
Mask-based de-occlusion strategies have also emerged as an effective approach, with the Pairwise Differential Siamese Network (PDSN) being a representative model. Inspired by the human visual system’s ability to ignore occluded regions, PDSN designs a mask-based learning strategy to handle feature loss in face recognition, mining the correspondence between facial occluded regions and facial features, and prohibiting features from occluded regions from participating in similarity comparison. The network structure consists of a CNN backbone network (responsible for facial feature extraction) and a mask generator branch (outputting binary mask features), aiming to make features processed by the mask as similar as possible to ensure recognition accuracy. PDSN establishes a mask dictionary by collecting the differences in top convolutional features between occluded and unoccluded face pairs, recording and learning the relationship between occluded regions and damaged features; when processing occluded face images, it selects and merges relevant entries from the mask dictionary and multiplies them with extracted facial features to eliminate the impact of feature loss. The model assigns larger loss values to features with little contribution to face recognition, using the feature difference between occluded and unoccluded faces as a marker to evaluate whether feature elements are damaged, making the mask generator focus more on occluded regions. Testing on 6,000 non-training face pairs selected from the LFW dataset shows that PDSN reduces the error rate by 0.52% compared with traditional CNNs, and can maintain basic model performance by discarding damaged feature elements from occluded regions. However, PDSN faces two key limitations: the mask is unknown, so only the final convolutional layer features can be saved, leading to high feature storage space for large-scale image datasets; and the comparison speed is slow, as feature extraction is required in addition to similarity calculation, increasing training difficulty and time costs.
A mask generation network-based occluded face detection method further optimizes mask-based strategies, improving detection accuracy by shielding the damage to facial feature elements caused by local occlusion. The model preprocesses the face training set by dividing training faces into 25 sub-regions and adding occlusion to each sub-region separately; then, a series of occluded face images and original face images are input as image pairs into the mask generation network for training to generate an occlusion mask dictionary corresponding to each occluded sub-region; finally, a combined feature mask corresponding to the occluded region of the detected face is generated by combining relevant dictionary entries, and multiplied with the deep feature map of the detected face to shield the damage to facial feature elements caused by local occlusion. Experimental results on the AR and MAFA datasets show that this method improves detection accuracy while maintaining low training time loss, and the researchers are currently exploring extending the algorithm to 3D occluded face detection, which is a key direction for future development of the technology.
A two-stage occlusion recognition model breaks the traditional single GAN-based de-occlusion approach by designing a network composed of two generators (G1 and G2) and two discriminators (D1 and D2): G1 is used to separate occlusion and synthesize occluded images, and G2 is used to synthesize de-occluded images by taking the output of G1 as input. Experimental results show that the synthesized occluded and de-occluded images are basically complementary, and the synthesized unoccluded images are highly correlated with the occlusion synthesized by G1, achieving higher scores in both Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). However, the two-stage processing approach incurs additional time overhead and increases training costs, requiring a balance between performance and efficiency in practical applications.
GAN-Based Approaches: Generative Modeling for Occlusion Repair and Data Augmentation
Generative Adversarial Networks (GANs) have achieved remarkable results in machine learning tasks due to their powerful generative capabilities, and their derivative models have become a research hotspot for solving occluded face image repair and dataset augmentation problems in face recognition. GAN-based approaches mainly focus on two core applications: generating a large number of occluded face samples to expand training datasets, and learning facial structural and texture features to repair occluded regions, with a variety of innovative models proposed to address different challenges in occluded face recognition.
The Adversarial Occlusion-aware Face Detector (AOFD) generates a large number of occluded face samples using GANs to expand training datasets, segments occluded regions using contextual information, and shields the impact of occluded regions on facial features through segmented masks. AOFD leverages a multi-stage target detection framework to achieve superior performance in occluded face detection, but the multi-stage structure restricts detection speed to a certain extent, and the need to generate a large number of occluded samples increases model training time and difficulty, requiring high computational resources for dataset augmentation.
The Contextual-based Generative Adversarial Network (C-GAN) makes full use of facial surrounding information to train the GAN network, with its generative network composed of an upsampling subnet and an optimization subnet: the upsampling subnet converts low-resolution images into high-resolution images for output, the discriminative network distinguishes between face/non-face and real/fake images, and the regression subnet refines facial bounding box detection. C-GAN is suitable for high-resolution image detection, but it requires sampling of low-resolution images, increasing model training time and limiting its applicability in low-resolution occluded face scenarios.
The Selective Refinement Network (SRN), built on the SSH model by modeling contextual information with filters, divides the convolutional layer output of the VGG network into three branches with similar detection and classification processes for each branch, completing multi-scale face detection by analyzing feature maps of different scales to optimize detection performance and improve accuracy. However, the output features of the middle layer lack sufficient discriminability, requiring adequate training of the added branches, which increases training difficulty and time costs. The Improved Selective Refinement Network (ISRN) optimizes the SRN algorithm, and PyramidBox++ is proposed based on the PyramidBox model by adopting a Balanced Data Anchor Sampling strategy, a Dense Context Module, and multi-task training. Experimental results show that ISRN detects 900 faces and PyramidBox++ detects 916 faces, with both algorithms improving detection accuracy for complex faces without sacrificing speed, but their enhancement effect is only significant for small-scale faces, with insufficient model stability; as facial scale increases, training difficulty and time increase significantly, limiting their performance in large-scale face detection scenarios.
The Context Encoder-Decoder, a pioneering model for deep learning-based semantic repair, adopts an encoder-decoder architecture combined with contextual information to learn image features and generate prediction maps for image regions to be repaired. Its loss function consists of two parts: image content constraint loss for the encoder-decoder part and adversarial loss for the GAN part, with the context encoder based on AlexNet and the GAN network comparing features learned by the encoder with original features, promoting each other between the generative and discriminative models to make the completed images more realistic. However, the model only judges the realism of the repaired region images and cannot guarantee the consistency between the repaired and known regions; when the shape of the missing region is variable, it causes discontinuity in the boundary pixel values of the repaired region, generating blurred or unrealistic information, indicating insufficient model stability. The researchers later mitigated this problem to a certain extent by increasing the weight of edge regions, providing a valuable direction for optimizing GAN-based repair models.
The Geometry-aware face completion and editing (FCEN) model makes full use of the geometric prior information of the unique facial structure in the repair process, learning a geometry-aware face repair model using facial key point heatmaps and segmentation maps. The key point heatmap consists of several facial key points, and the segmentation map is composed of facial components such as eyes, nose, mouth, hair, and background, with different components represented by different pixel values. FCEN first infers the corresponding key point heatmap and segmentation map from the occluded face image; then, the concatenated occluded image, key point heatmap, and segmentation map are input into the repair model to generate content for the occluded region; finally, global and local discriminators are added to the discriminative part to promote the visual realism and overall coherence of the generated face, and a low-rank loss function is adopted to improve the model’s repair performance for irregular occluders. FCEN can generate relatively reasonable missing facial components, but it is not suitable for the repair of irregularly occluded faces, and its training requires prior acquisition of the geometric prior information of the unique facial structure, with the construction of heatmaps and segmentation maps increasing training difficulty.
Edgeconnect, a two-stage adversarial model composed of an edge generator and an image repair network, restores the edge contours of missing regions through the edge generator and fills the missing regions with the restored edge map as a prior through the repair network, thereby synthesizing more detailed textures and feature descriptions. This two-stage approach significantly improves the fineness of occluded region repair, but in actual test cases, Edgeconnect cannot fully restore real edge information: the edge generation model sometimes fails to accurately depict edges in highly textured regions, and cannot generate repair results with relevant edge information when most of the image is missing. Researchers are currently improving the edge generation system to extend the model to high-resolution repair applications, which is a key direction for the practical application of GAN-based occlusion repair models.
An improved Wasserstein GAN method addresses the limitations of occlusion parts and sizes, and the incoherence of repaired face images by using a CNN as the generator model and adding skip connections between corresponding layers to enhance the accuracy of generated images. The method introduces the Wasserstein distance in the discriminator for judgment and adds gradient penalty to perfect the discriminator, delivering excellent repair results on the CelebA and LFW face datasets. Unlike other methods that introduce additional building blocks leading to more network parameters and increased GPU memory usage, this improved Wasserstein GAN reduces training difficulty and improves performance by adding skip connections, providing a lightweight and efficient solution for GAN-based occluded face repair.
Lightweight Network Models: Balancing Accuracy and Efficiency for Edge Deployment
Most of the state-of-the-art occluded face recognition methods rely on large-scale deep CNN models, which require substantial computational resources and high-performance processors, leading to efficiency issues—especially in mobile and embedded device deployments where low latency and low resource consumption are critical. The efficiency challenges mainly include model storage (a large number of weight parameters for multi-layer networks require high device memory) and prediction speed (practical applications often require millisecond-level response speeds). To address these issues, adjusting deep neural network structures and parameters to balance speed and accuracy, and compressing and accelerating deep networks without significantly degrading performance, have become new research hotspots, spawning a series of lightweight network models designed for occluded face recognition. The core design idea of lightweight models is to develop more efficient network computing methods (mainly for backbone network convolution), reducing network parameters while maintaining performance, with the key requirements of few parameters, fast speed, and high accuracy to lower training difficulty.
SqueezeNet is a representative lightweight model that achieves AlexNet-level accuracy on the ImageNet dataset with only 1/500 of the parameters (after model compression) by adopting three key design strategies: replacing 3×3 convolutions with 1×1 convolutions (reducing parameters to 1/9 of the original), reducing the number of input channels through squeeze layers, and delaying downsampling operations to provide larger activation maps for convolutional layers, retaining more feature information to improve classification accuracy. The core of SqueezeNet is the Fire module, which first compresses dimensions through a squeeze convolutional layer and then expands dimensions through an expand convolutional layer. However, the model’s good performance is contingent on a well-balanced dataset; unbalanced samples affect classification results and compromise model stability. Additionally, the ratio of the two convolution kernels (1×1 and 3×3) needs to be carefully balanced to strike a trade-off between model size and accuracy, which is a key challenge in practical deployment.
SqueezeNext further optimizes parameter compression without using depthwise convolutions, instead directly using separable convolutions for parameter compression and decomposing large matrices into multiple small matrices through low-rank decomposition. It adopts SqueezeNet’s squeeze layers to compress input dimensions, using two consecutive squeeze layers at the beginning of each block to reduce dimensions by half each layer, and a bottleneck layer to reduce the input dimension of the fully connected layer, thereby significantly reducing network parameters. However, the use of Deep Compression incurs decompression costs, adding certain computational overhead and compromising real-time performance to a certain extent.
MobileNet, a mobile-oriented lightweight model proposed by Google, adopts depthwise separable convolution, which decomposes standard convolution into two smaller operations: depthwise convolution and pointwise convolution. Unlike standard convolution where convolution kernels act on all input channels, depthwise convolution uses different convolution kernels for each input channel (one convolution kernel for one input channel), and pointwise convolution is a standard convolution with 1×1 convolution kernels. Depthwise separable convolution first performs depthwise convolution on different input channels separately, then combines the outputs through pointwise convolution, drastically reducing computational load and model parameters with low training difficulty and high stability. MobileNet V1 achieves Inception V3-level performance in fine-grained recognition with reduced computational load and model size, but it sacrifices a certain degree of accuracy for efficiency, which is a common trade-off in lightweight model design.
MobilefaceNets, derived from MobileNet V2, is a lightweight face recognition network with industrial-grade accuracy and speed, with a model size of only 4 MB and specially designed for face recognition tasks. It improves MobileNet V2 in three key aspects: replacing the average pooling layer with separable convolution, training with the ArcFace loss function for face recognition tasks, and reducing the channel expansion factor and using the PReLU activation function instead of the ReLU activation function in the network structure to lower training difficulty. Test results on the LFW face recognition training set show that MobilefaceNets has significantly higher accuracy, faster speed, and smaller size with low training difficulty and high stability, making it an ideal solution for mobile and embedded face recognition applications.
ShuffleNet reduces computational costs through group pointwise convolutions and channel shuffling, achieving higher efficiency than MobileNet V1, and ShuffleNet V2 further improves model performance by introducing the Channel-Split module. However, the model suffers from boundary effects, where a single output channel is only derived from a small portion of input channels, compromising feature integrity and model stability, which limits its performance in complex occluded face recognition scenarios.
Lightfacenet, a deep CNN model proposed by relevant researchers, aims to construct lightweight neural network units to alleviate the problems of parameter redundancy and large computational load caused by deep neural networks, reducing training difficulty. It combines depthwise separable convolution, pointwise convolution, bottleneck structures, and squeeze-and-excitation structures, and further improves network recognition accuracy through an improved nonlinear activation function, achieving a 99.50% accuracy rate on the LFW dataset. The selection of the nonlinear activation function is a key factor affecting the model’s performance, requiring careful optimization for different occluded face recognition scenarios.
The Improved Multi-scale Inception Siamese Convolutional Neural Network (IMISC-NN) consists of a siamese CNN with two identical structures and shared weights, introducing the Inception model to extract richer facial features, and using a cyclic learning rate optimization strategy to accelerate training speed. The strategy finds the optimal learning rate with fewer global cycles, reducing the number of iterations required for the same recognition rate, and lowering training costs and difficulty with good convergence. IMISC-NN achieves high recognition accuracy on the CASIA-webface and Extended Yale B standard face databases, but it is currently only applicable to small-scale dataset face recognition under unconstrained conditions, indicating insufficient model stability and limiting its scalability to large-scale, complex occluded face scenarios.
An optimized MobileNet V2 model further streamlines the original network structure by removing residual blocks to reduce the number of convolutional layers and network parameters, lowering the expansion factor in the residual structure and modifying channel expansion to a parallel expansion method to reduce the actual memory access cost of the network and increase running speed. Additionally, the model fuses features from spatial separable convolution and depthwise separable convolution to complement each other and improve recognition accuracy, and changes the loss function from Softmax loss to Arcface to enhance the network’s constraint ability, making the extracted features more discriminative and robust, and increasing model stability. Under the same training conditions, the optimized model size is reduced to 2.3 MB, with a test accuracy of 99.53% on the LFW dataset and a speed five times that of the original MobileNet V2, with further reduced training difficulty, making it a high-performance lightweight solution for occluded face recognition on edge devices.
Datasets, Evaluation Metrics, and Key Challenges
A comprehensive, diverse, and well-annotated occluded face dataset is the foundation for testing and improving model performance. While there are many general face datasets available, datasets specifically designed for occluded face recognition remain insufficient. Open-source occluded face datasets include FDDB, AFW, AFLW, 300W, Wider Face, MAFA, COFW, and WFLW, among which FDDB, AFW, AFLW, and 300W are natural scene face datasets with rich scenarios suitable for occluded face detection research, while Wider Face, MAFA, and COFW are datasets with specially annotated facial occlusion attributes. MAFA is the most specialized occluded face dataset to date, consisting of 30,811 unoccluded and 35,806 occluded images with various occlusion scales, and manually annotated with six attributes (facial position, glasses position, occlusion position, face orientation, occlusion degree, and occlusion type), making it ideal for constructing complex occluded face recognition datasets based on deep learning and for model training and optimization. Wider Face is a high-difficulty dataset with large variations in pose and occlusion degree, containing 32,203 images divided into 61 categories, providing a rigorous testbed for evaluating the robustness of occluded face recognition models. COFW is a small-scale occlusion detection dataset with an average occlusion rate of approximately 23%, including 329 heavily occluded images (occlusion rate exceeding 30%) and 178 slightly occluded images, suitable for researching heavy occlusion face recognition algorithms.
The evaluation of occluded face recognition models relies on a set of specialized metrics to assess accuracy and robustness in unconstrained environments, with common metrics including recall rate, false acceptance rate (FAR), accuracy, precision, ROC curve, AUC (Area under the ROC Curve), and F1-score. Accuracy measures the overall proportion of correctly classified samples, suitable for evaluating global model performance; precision (precision rate) measures the proportion of true positive samples among all samples predicted as positive, with a higher value indicating fewer false detections in face recognition; recall rate (recall rate) measures the proportion of correctly classified positive samples among all actual positive samples, with a higher value indicating better detection of occluded faces; FAR measures the proportion of negative samples misclassified as positive samples, a key metric for evaluating system security (a lower FAR indicates a lower probability of accepting fake faces). The F1-score comprehensively evaluates precision and recall, providing a balanced metric for model performance; the ROC curve plots the true positive rate (TPR) against the false positive rate (FPR), with a curve closer to the upper left corner indicating a better classifier; the AUC value, the area under the ROC curve (ranging from 0 to 1), provides a quantitative measure of classifier performance, with a higher value indicating better accuracy. The confusion matrix is also a commonly used tool for comprehensive model evaluation, intuitively displaying the number or proportion of correctly and incorrectly classified samples for each category, facilitating the analysis of model error patterns in occluded face recognition.
Despite the significant progress made in deep learning-based occluded face recognition methods, the field still faces several key challenges that limit practical deployment and performance improvement. First, complex deep network structures and excessive parameters lead to large computational loads and high training difficulty, requiring high computing power and making it difficult to deploy on low-resource edge devices. Second, model training stability is poor, especially in complex occlusion scenarios with large pose variations and low image quality, where existing models often suffer from convergence issues and performance degradation. Third, occluded face datasets are insufficient in both quantity and diversity, with most datasets lacking comprehensive annotations of occlusion types, degrees, and scenarios, leading to poor model generalization ability in real-world applications. Fourth, the design of loss functions needs further optimization, as existing loss functions struggle to accurately guide the training process and generate diverse samples in occluded feature spaces, limiting the discriminative power of extracted features. Fifth, GAN-based models face challenges in convergence and stable training, with difficulties in achieving both effective feature extraction and high-quality occlusion repair without affecting network convergence.
Future Research Directions
Building on the comprehensive analysis of current research progress and key challenges, the review outlines five critical future research directions for occlusion-resilient face recognition based on deep learning, providing a roadmap for the development of the field.
First, the innovation and optimization of deep learning-based basic model frameworks need to be strengthened to support more mobile and embedded applications. This includes designing lightweight network architectures and developing efficient training algorithms to deploy models on low-cost, low-power, and low-computation processing platforms for mobile and embedded devices, reducing hardware requirements. Two main approaches can be adopted: compressing trained complex models to obtain small models (e.g., through model pruning, quantization, and knowledge distillation), and directly designing and training small models with optimization for occluded face recognition tasks, striking a better balance between accuracy and efficiency.
Second, loss functions need to be optimized to increase model stability. The design of high-performance loss functions should maximize the aggregation of intra-class features and the separation of inter-class features, enhancing the network’s ability to model specific feature vectors, reducing oscillations in the model convergence process, and making convergence more stable. Innovative loss functions tailored for occluded feature spaces—such as loss functions that focus on unoccluded feature regions or weight features based on occlusion probability—are key research directions to improve model stability and robustness.
Third, multi-modal biometric features should be fully utilized to promote the organic combination of multi-feature, multi-model, and multi-algorithm approaches. Continuous exploration of the fusion of human physiological features (e.g., fingerprint, finger vein, face, iris) and behavioral features (e.g., handwriting, voice, gait) is needed to leverage the advantages of different biometric recognition technologies in terms of accuracy, stability, recognition speed, and convenience. On the one hand, further research on adversarial neural networks, attention mechanisms, and multi-information fusion is required to reduce model training difficulty while ensuring detection accuracy, and to better improve algorithm robustness. On the other hand, more refined modal data feature representation should be explored to achieve better information exchange of multi-modal data in the semantic space, enhancing the discriminability and complementarity of fused features.
Fourth, specialized datasets and evaluation criteria for occluded face detection need to be constructed to optimize model training and improve model accuracy, robustness, and real-time performance. The construction of large-scale datasets with accurate annotations of pose, illumination, occlusion, size, and other complex variations, as well as detailed attribute descriptions, is a key task for future research. Since datasets containing complex occlusion scenarios cannot cover all real-world situations, semi-supervised, unsupervised, or transfer learning methods should be combined to explore model training with limited labeled data, improving the generalization ability of occluded face recognition models.
Fifth, 3D face recognition research should be advanced, and the construction of 3D face datasets should be strengthened. 3D face recognition leverages stable spatial geometric information to reduce recognition errors caused by illumination and view variations, which is particularly advantageous for occluded face recognition, as 3D geometric features can better preserve the inherent structural information of human faces even with partial occlusion. The development of 3D face scanning and reconstruction technologies, combined with deep learning models for 3D feature extraction and recognition, will become a key direction for solving the problem of heavy occlusion face recognition and improving the robustness of face recognition systems in unconstrained environments.
Conclusion
Deep learning has revolutionized occlusion-resilient face recognition, providing a set of powerful and flexible solutions to address the technical bottlenecks caused by facial occlusion in real-world applications. The current state-of-the-art methods, including utilizing unoccluded facial features, feature fusion, occlusion region restoration, GAN-based approaches, and lightweight networks, each have their unique advantages and limitations, and their organic combination will be the key to achieving high-performance occluded face recognition. With the continuous expansion of application scenarios, the demand for occlusion-tolerant face recognition technology with high accuracy, high speed, high stability, and low resource consumption will continue to grow, especially in mobile, embedded, and edge computing environments. Future research will focus on the innovation of lightweight network architectures, the optimization of loss functions, the fusion of multi-modal biometric features, the construction of high-quality occluded face datasets, and the development of 3D face recognition technology, which will drive the field toward more practical, robust, and intelligent applications. As deep learning technology continues to evolve and interdisciplinary research advances, occlusion-resilient face recognition will break through existing limitations and play an increasingly important role in public security, smart cities, daily life, and other fields, providing more secure and convenient identity verification solutions for society.
Author Information: XU Xialing, LIU Tao, TIAN Guohui, YU Wenjuan, XIAO Dajun, LIANG Shanpeng; 1.Central China Electric Power Dispatching Control Sub-center, Central China Branch of State Grid Corporation of China, Wuhan 430077, China; 2.NARI Group Corporation(State Grid Electric Power Research Institute), Nanjing 211106, China; 3.Research and Development Technology Center, Beijing Kedong Electric Power Control System Corporation Limited, Beijing 100192, China. Journal: Computer Engineering and Applications, 2021,57(17) DOI: 10.3778/j.issn.1002-8331.2101-0389
(Word count: 6892)