← Back to Logs

How Facial Recognition Actually Works

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

A camera in the Helsinki metro captures a frame. Somewhere in that frame is a face, maybe two, maybe thirty. Within 50 milliseconds, a piece of software has drawn a bounding box around each face, rotated and scaled the crop so the eyes are level at fixed pixel coordinates, fed each normalised crop through a convolutional neural network, and produced a vector of 512 floating-point numbers for each face. Those vectors are compared against a database of enrolled identities using cosine similarity. If the distance between the probe vector and a stored template falls below a threshold, the system declares a match.

That is the entire pipeline: detect, align, embed, compare. No part of the system "looks at" the face in the way a human does. There is no moment where the software considers the shape of someone's nose or the colour of their eyes as discrete features. The CNN compresses the entire face into a point in high-dimensional space, and the match decision is a single floating-point comparison. Everything interesting about facial recognition, from its accuracy to its failure modes to its political implications, follows from the details of how that pipeline is built and where it breaks.

This article walks through the full technical stack, from the pixel-level mechanics of face detection through the geometry of embedding spaces to the legal framework that the EU has built around it.

Face Detection: Finding Faces In A Frame

Before you can recognise a face, you have to find it. Face detection is the problem of taking an arbitrary image and outputting bounding boxes around every face it contains, along with a confidence score for each.

The Historical Baseline: Viola-Jones

The first face detector that ran in real time on consumer hardware was the Viola-Jones detector, published by Paul Viola and Michael Jones in 2001. It used a cascade of simple classifiers, each built from Haar-like features (rectangular intensity differences computed over the image using an integral image representation). The integral image allows any rectangular sum to be computed in four array lookesses regardless of the rectangle's size, which makes evaluation fast. The cascade was trained with AdaBoost, where each stage was a weak classifier that could quickly reject non-face image patches. Early stages used very few features (sometimes just two or three Haar rectangles) and rejected 50% or more of candidate windows. Only patches that survived all stages were declared face detections.

Viola-Jones was revolutionary for its time. It ran at 15 frames per second on a 700 MHz Pentium III, which was extraordinary in 2001. OpenCV shipped it as the default face detector for over a decade. But it had significant limitations: it worked poorly on non-frontal faces, struggled with occlusion, and its accuracy on the harder benchmarks (WIDER FACE, for instance) was far below what modern detectors achieve. By 2015, deep-learning-based detectors had overtaken it on every metric.

MTCNN: The Transitional Architecture

Multi-task Cascaded Convolutional Networks (MTCNN), published by Kaipeng Zhang and colleagues in 2016, was the detector that brought deep learning to face detection in a practical, fast form. It uses three stages, each a small CNN.

The first stage (P-Net, for Proposal Network) is a fully convolutional network that slides across the image at multiple scales and proposes candidate face regions. It outputs a set of bounding boxes with associated confidence scores. Non-maximum suppression (NMS) prunes overlapping boxes.

The second stage (R-Net, for Refine Network) takes each candidate region, resizes it to 24x24 pixels, and runs it through a slightly deeper CNN that refines the bounding box coordinates and rejects false positives.

The third stage (O-Net, for Output Network) takes the refined candidates, resizes them to 48x48, and produces the final bounding box, a confidence score, and five facial landmark coordinates (left eye, right eye, nose tip, left mouth corner, right mouth corner). Those landmarks are critical: they are used in the next stage of the pipeline for face alignment.

MTCNN runs in real time on a GPU and was the default detector in many production systems from 2016 to 2019. Its multi-task design, where the same network predicts both bounding boxes and landmarks, was influential and set the template for later architectures.

RetinaFace: The Modern Standard

RetinaFace, published by Jiankang Deng and colleagues in 2019, is a single-stage detector that uses a Feature Pyramid Network (FPN) backbone to handle faces at multiple scales in a single forward pass. Where MTCNN needed three sequential networks and image pyramids, RetinaFace processes the entire image once through a ResNet backbone, builds a multi-scale feature pyramid, and applies detection heads at each pyramid level.

The key innovation is multi-task learning at a deeper level than MTCNN. RetinaFace simultaneously predicts face bounding boxes, five-point landmarks, and dense 3D face meshes. The 3D mesh prediction acts as a regulariser during training: by forcing the network to learn the 3D structure of faces, it produces more robust 2D detections, especially for faces at extreme angles or under partial occlusion.

RetinaFace achieves over 91% average precision on the WIDER FACE hard set, which includes tiny faces, heavy occlusion, and extreme poses. It detects faces as small as 10x10 pixels in a 1080p frame. On a modern GPU (an Nvidia A100, for example), it processes a 1080p frame in about 15 ms, which leaves plenty of budget for the rest of the pipeline.

How Bounding Boxes Are Proposed And Refined

All modern detectors use a common pattern for bounding box generation. The detection head produces, for each position in the feature map, a set of anchor boxes at predetermined aspect ratios and scales. For each anchor, the head predicts a classification score (face or not) and four regression offsets (dx, dy, dw, dh) that shift and scale the anchor to better fit the actual face. This is the same mechanism used in general object detectors like Faster R-CNN and SSD, adapted for the specific geometry of faces.

Non-maximum suppression then eliminates duplicate detections. If two boxes overlap by more than a threshold (typically 0.4 IoU), the one with the lower confidence is discarded. The surviving boxes are the final detection output.

Face Alignment: Normalising Geometry

A detected face might be tilted, partially turned, or at an unusual scale. If you feed these raw crops directly into an embedding network, the network has to learn to be invariant to all of these geometric variations, which wastes model capacity. Instead, the standard practice is to align the face to a canonical pose before embedding.

Landmark Detection

The alignment process starts with facial landmarks. The most common sets are the 5-point set (two eyes, nose, one point per mouth corner) and the 68-point set (the full face contour, eyebrows, eyes, nose, mouth). MTCNN and RetinaFace both output 5-point landmarks as part of their detection head. For applications that need more detail, dedicated landmark detectors like DLIB's 68-point model or Google's FaceMesh (478 points) are used as a separate stage.

The 5-point landmarks are sufficient for alignment. The two eye centres define the face's in-plane rotation. The nose and mouth corners define the vertical extent and help compensate for minor out-of-plane rotation.

Affine Transformation To Canonical Pose

Given the five landmark positions in the input image and a set of target landmark positions in a canonical template (where the eyes are at fixed pixel coordinates, say (38.3, 51.7) and (73.5, 51.7) in a 112x112 crop), the alignment computes a 2D affine transformation matrix. This matrix encodes rotation, scaling, and translation. It maps the detected landmarks to the template landmarks.

The transformation is computed using least-squares fitting. For five source-target landmark pairs, the system solves for the six parameters of the affine matrix (two rows of three values each) that minimise the sum of squared distances between the transformed source landmarks and the target landmarks. With five points and six unknowns, the system is slightly underdetermined, but in practice the least-squares solution is stable because the landmarks are well-distributed across the face.

The affine matrix is then applied to the entire image region around the detected face, producing a tightly cropped, rotation-corrected, scale-normalised face image at a fixed resolution (typically 112x112 or 160x160 pixels). This aligned crop is what the embedding network sees.

The alignment step is not glamorous, but it matters enormously. Without it, recognition accuracy drops by 2 to 5 percentage points on standard benchmarks, because the embedding network wastes capacity on geometric variation instead of identity-discriminative features.

The Embedding Network: Compressing A Face Into A Vector

The heart of the system is the embedding network. It takes an aligned face crop and produces a fixed-length vector, typically 128 or 512 dimensions, that represents the face's identity. Two images of the same person should produce vectors that are close together. Two images of different people should produce vectors that are far apart. The entire recognition problem reduces to measuring distances in this vector space.

Architecture

The embedding network is a deep convolutional neural network. The most common architectures are:

ResNet variants. The original FaceNet (2015) used a GoogLeNet-style Inception architecture, but modern systems overwhelmingly use ResNet-50 or ResNet-100 (a 100-layer residual network). The residual connections allow the network to be very deep without suffering from vanishing gradients. The final convolutional layer's output is globally average-pooled and passed through a fully connected layer that produces the embedding vector.

MobileFaceNet. For mobile and edge deployment, lightweight architectures based on MobileNet use depthwise separable convolutions to reduce computation by 8 to 10 times compared to ResNet-100, at a modest accuracy cost (1 to 2 percentage points on LFW). Apple's Face ID neural network is believed to use a compact architecture in this family, constrained to run within the power and latency budget of the Neural Engine on the A-series and M-series chips.

Vision Transformers. More recent work uses ViT (Vision Transformer) architectures for face recognition. These split the aligned face crop into patches (typically 8x8 or 16x16 pixels), embed each patch as a token, and process the sequence through transformer layers with self-attention. The class token's final representation serves as the embedding. Early results on benchmarks like IJB-C show ViTs matching or slightly exceeding ResNet-100, particularly on the hardest cases (extreme pose, low resolution), but at higher computational cost.

Regardless of architecture, the final output is a vector of 128 or 512 floating-point numbers. This vector is L2-normalised so that it lies on the unit hypersphere, which makes cosine similarity equivalent to the dot product and simplifies the distance computation.

What Does The Network Learn?

Each dimension of the embedding vector does not correspond to an interpretable facial attribute like "nose width" or "skin tone". The dimensions are entangled, meaning that any single facial attribute is distributed across many dimensions, and each dimension participates in encoding many attributes. This is a consequence of end-to-end training: the network discovers whatever representation minimises the training loss, and that representation is not obligated to be human-interpretable.

That said, probing experiments have shown that certain directions in the embedding space correlate with identifiable attributes. You can find linear directions that correspond to gender, age, pose angle, and expression, but these are not aligned with the coordinate axes. They are oblique directions that the network uses as part of its identity representation. This matters for understanding bias, as we will see later.

Training: How The Network Learns To Separate Faces

The embedding network is trained on millions of face images from thousands of identities. The training objective is to produce embeddings where same-identity images cluster together and different-identity images are pushed apart.

Triplet Loss (FaceNet)

The original approach, introduced in Google's FaceNet paper (2015), used triplet loss. Each training step samples a triplet: an anchor image, a positive image (same identity as the anchor), and a negative image (different identity). The loss function is:

L = max(0, ||f(anchor) - f(positive)||² - ||f(anchor) - f(negative)||² + margin)

This pushes the positive pair closer together and the negative pair further apart, with a margin (typically 0.2 to 0.5) that defines a minimum gap. If the network already separates the triplet by more than the margin, the loss is zero and the triplet contributes no gradient.

The difficulty with triplet loss is mining. Most triplets are easy (the negative is already far away) and contribute nothing to training. Only "hard" or "semi-hard" triplets drive learning, and finding those efficiently in a dataset of millions of images requires careful batch construction. FaceNet used large batches (around 1,800 images) and mined hard negatives within each batch, which was computationally expensive and sensitive to the mining strategy.

Angular Margin Losses: ArcFace And CosFace

The modern standard for training face embeddings is ArcFace (Additive Angular Margin Loss), published by Jiankang Deng and colleagues in 2019. ArcFace reformulates face recognition as a classification problem during training. Each identity in the training set gets a class, and the network's embedding is passed through a classification head with a weight vector per class. The twist is in how the classification logit is computed.

In a standard softmax classifier, the logit for class j is the dot product of the embedding and the class weight: W_j^T * x. Since both are L2-normalised, this equals cos(θ_j), where θ_j is the angle between the embedding and the class weight vector.

ArcFace adds an angular margin m to the angle for the correct class:

logit_j = s * cos(θ_j + m)    for the correct class
logit_j = s * cos(θ_j)         for all other classes

where s is a scaling factor (typically 64) and m is the margin (typically 0.5 radians, about 28.6 degrees). This forces the network to produce embeddings that are not just in the right direction, but at least m radians away from any other class's direction. The geometric effect is that each identity occupies a tight angular region on the hypersphere, and there is a guaranteed gap between regions.

CosFace (Additive Cosine Margin) is a variant that subtracts the margin from the cosine directly:

logit_j = s * (cos(θ_j) - m)   for the correct class

Both achieve very similar results. ArcFace is slightly more popular in practice and is the default loss function in the InsightFace open-source toolkit, which is the de facto standard for face recognition research.

The advantage over triplet loss is enormous. ArcFace works with standard softmax-based training, uses standard batch sizes, converges faster, and produces better-separated embeddings. It also scales gracefully to very large numbers of identities (millions of classes in the training set).

Training Datasets

The quality and diversity of the training data determines the quality of the embedding network. The most commonly used datasets are:

MS-Celeb-1M (Microsoft, 2016): approximately 10 million images of 100,000 celebrities. This was the de facto standard training set for several years, but Microsoft retracted it in 2019 after researchers discovered that many images were scraped without consent, and that the dataset contained substantial label noise (wrong identity labels). Despite the retraction, cleaned versions circulate widely in the research community.

VGGFace2 (University of Oxford, 2018): 3.31 million images of over 9,000 identities, with significant variation in pose, age, and ethnicity. Better curated than MS-Celeb-1M but smaller.

CASIA-WebFace (CASIA, 2014): 500,000 images of 10,575 identities. Smaller and older, but widely used as a baseline training set.

WebFace260M (2021): 260 million images of 4 million identities, currently the largest public face recognition dataset. Models trained on this dataset achieve the best results on most benchmarks.

Glint360K (2021): 17 million images of 360,000 identities. A large-scale cleaned dataset used in recent NIST FRVT submissions.

All of these datasets were constructed by scraping images from the web, primarily from search engines, social media, and news sites. The ethical and legal implications of this are significant, and we will return to them.

The Embedding Space: Geometry Of Identity

Once training is complete, the embedding network maps every face to a point on a unit hypersphere in 512-dimensional space (or 128, depending on the model). Understanding the geometry of this space is essential for understanding how recognition works and where it fails.

What The Dimensions Represent

As noted earlier, individual dimensions are not interpretable. But the space as a whole has structure. Faces of the same person, across different lighting, expression, age, and angle, cluster into tight regions. Faces of different people occupy different regions. The clustering is not perfect: a person photographed in their twenties and again in their sixties will have more distant embeddings than two photos taken minutes apart. But the intra-identity variance is, for a well-trained model, much smaller than the inter-identity distance.

The radius of an identity cluster depends on the conditions. Under controlled conditions (frontal, good lighting, neutral expression), the cluster radius is small, with cosine similarity above 0.8 for same-identity pairs. Under unconstrained conditions (surveillance footage, low resolution, extreme angles), the radius grows, and cosine similarity for same-identity pairs can drop to 0.5 or below.

Why Cosine Similarity Works

Since all embeddings are L2-normalised to unit length, the Euclidean distance between two embeddings has a direct relationship to the cosine of the angle between them:

||a - b||² = 2 - 2 * cos(θ)

Minimising Euclidean distance is equivalent to maximising cosine similarity. In practice, most systems use cosine similarity directly because it is bounded between -1 and 1, which makes threshold selection more intuitive.

Cosine similarity works for face recognition because ArcFace explicitly optimises for angular separation. The training loss penalises embeddings based on their angles to class centres, so the resulting space is well-calibrated for angular (cosine) comparisons. If you trained with a Euclidean loss, you would get a space better suited to Euclidean comparison. The choice of distance metric and the choice of training loss are coupled.

Threshold Selection: FAR vs FRR

The match decision is: "Is the cosine similarity between the probe embedding and the enrolled template above threshold T?" The choice of T determines the system's operating point on the trade-off between two error types:

False Accept Rate (FAR): The fraction of non-matching pairs that the system incorrectly declares as a match. Also called the false positive rate.

False Reject Rate (FRR): The fraction of genuine matching pairs that the system incorrectly rejects. Also called the false negative rate. Its complement (1 - FRR) is the True Accept Rate (TAR), which is the rate at which genuine matches are correctly identified.

A high threshold (say, cosine similarity > 0.75) produces a low FAR but a high FRR: the system rarely accepts impostors but also rejects many legitimate matches. A low threshold (say, > 0.45) accepts more genuine matches but also lets more impostors through.

The right threshold depends on the application. For unlocking a phone, you want very low FAR (you do not want a stranger unlocking your device), and you can tolerate moderate FRR (you can try again). Apple's Face ID targets a FAR of 1 in 1,000,000. For a surveillance system searching for a suspect in a crowd, you might accept a higher FAR because every match will be reviewed by a human operator, and a false reject means missing the suspect entirely.

Verification vs Identification: Two Different Problems

Facial recognition is used for two distinct tasks, and the accuracy profiles differ dramatically.

One-to-One Verification

Verification answers the question: "Is this person who they claim to be?" The system has one probe image and one enrolled template. It computes the cosine similarity and compares it to the threshold. This is a binary decision with a well-defined prior: the person is probably who they claim to be (otherwise, why are they presenting themselves for verification?).

Examples: unlocking a phone, border control e-gates, accessing a banking app. The probe and the enrolled template are typically high-quality images (frontal, good resolution, controlled lighting), because the user cooperates with the capture process.

Verification accuracy is very high on modern systems. The best models achieve TAR above 99.9% at FAR = 0.001% on the IJB-C benchmark, which means they correctly verify 999 out of 1,000 genuine pairs while falsely accepting fewer than 1 in 100,000 impostor pairs.

One-to-Many Identification

Identification answers the question: "Who is this person?" The system has one probe image and a gallery of N enrolled templates. It computes the cosine similarity between the probe and every template in the gallery, and returns the top matches (or declares no match if all similarities fall below the threshold).

This is a harder problem for a mathematical reason. If the FAR per comparison is p, and you make N comparisons, the probability of at least one false match is approximately N * p (for small p). A gallery of 1 million identities with a per-comparison FAR of 0.001% gives an expected 10 false matches per probe. To maintain the same effective FAR as a verification system, you need a much tighter per-comparison threshold, which increases the FRR.

This is why one-to-many identification in large databases is inherently less reliable than one-to-one verification, even with the same underlying model. A system that is 99.99% accurate for verification might produce dozens of false positives per search in a national-scale database with tens of millions of enrolled identities.

Accuracy Metrics And Benchmarks

The standard way to evaluate a face recognition system is through a Receiver Operating Characteristic (ROC) curve that plots TAR against FAR across all possible thresholds.

TAR@FAR

The headline metric is TAR at a specific FAR. For example, "TAR = 99.7% at FAR = 0.01%" means that when the system's threshold is set so that only 0.01% of impostor pairs are falsely accepted, it correctly accepts 99.7% of genuine pairs. This single number summarises the system's discriminative ability at a practically relevant operating point.

Different benchmarks use different FAR targets. The Labeled Faces in the Wild (LFW) benchmark, published in 2007, uses verification accuracy at a fixed threshold (not a specific FAR), and the best systems have saturated it at 99.8%+ since about 2017. It is no longer considered a useful benchmark for state-of-the-art systems.

The IJB-B and IJB-C benchmarks (IARPA Janus Benchmarks) are harder, with greater variation in pose, illumination, and resolution. They report TAR@FAR at FAR = 0.0001% (1 in a million), which is a stringent operating point relevant to real deployments. The best systems in 2025 achieve TAR above 98% at this point on IJB-C.

NIST FRVT

The most comprehensive ongoing evaluation is the NIST Face Recognition Vendor Test (FRVT). NIST tests algorithms submitted by vendors and research groups on datasets that NIST controls but does not distribute. The datasets include mugshot images, visa photos, and "wild" images captured in uncontrolled conditions.

NIST FRVT reports accuracy in terms of False Non-Match Rate (FNMR, which is FRR) at fixed False Match Rate (FMR, which is FAR). The 2024 FRVT Ongoing results show that the top 10 algorithms achieve FNMR below 0.2% at FMR = 0.00001% (1 in 10 million) on the visa photo dataset, which is extraordinary accuracy. On more difficult datasets (mugshots with aging, uncontrolled "wild" images), accuracy is lower but still strong.

Several patterns emerge from the FRVT results. First, accuracy has improved dramatically since 2014, with FNMR at FMR = 0.001% dropping from around 4% to below 0.3% over a decade. Second, the top-performing algorithms are all deep-learning-based (no traditional feature-based system has been competitive since 2017). Third, there are large accuracy gaps across demographics, which brings us to the next section.

Demographic Bias: When The System Fails Unevenly

One of the most important findings from NIST's FRVT evaluations is that facial recognition accuracy varies significantly across demographic groups. The 2019 NIST report on demographic effects tested 189 algorithms across four demographic variables: age, sex, country of birth (as a proxy for race/ethnicity), and the interaction between them.

The Findings

The results were stark. For one-to-one verification at FMR = 0.00001%:

False non-match rates were highest for Black women, often by a factor of 10 to 100 compared to white men, depending on the algorithm. Many algorithms had FNMR below 0.5% for white men but above 5% for Black women.

Age effects were strong: both the very young (under 20) and the very old (over 60) had higher error rates than adults aged 30 to 50.

Algorithms developed in China and East Asia showed smaller accuracy gaps between Asian and white faces but larger gaps for Black faces, and vice versa for algorithms developed in Western countries. This strongly suggests that the training data composition drives the demographic performance profile.

Some algorithms showed relatively small demographic differentials. The best-performing systems in the 2024 FRVT results have narrowed the gap substantially, with FNMR ratios between the best and worst demographic groups down to 3:1 or less for the top algorithms. But the gap has not been eliminated, and for lower-performing algorithms it remains large.

Why Training Data Composition Matters

The root cause is straightforward: the embedding network learns a representation that is optimised for the faces it sees most often during training. If the training set is 70% white faces (which many early datasets were), the network develops finer-grained representations for white faces and coarser representations for everyone else. The hypersphere gets carved up unevenly: white faces occupy a larger portion of the embedding space with better separation, while faces from underrepresented groups are compressed into a smaller region with more overlap between identities.

This is not a hypothetical mechanism. Researchers have demonstrated it directly by training the same architecture on datasets with different demographic compositions and observing the predicted accuracy gaps shift accordingly. Training on a balanced dataset produces more equitable accuracy, though it does not eliminate all gaps (likely because of other factors, such as how different skin tones interact with imaging hardware and lighting conditions).

The practical consequence is that a facial recognition system that works well on average can fail badly on specific populations, and those populations are often the ones already subject to disproportionate surveillance. This has driven both technical research into fairness-aware training and legislative action, particularly in the EU.

Real-Time Surveillance: The Engineering Constraints

Deploying facial recognition in a surveillance context introduces a set of engineering constraints that laboratory benchmarks do not capture.

Camera Placement And Resolution

A face recognition system needs a face image of at least 80 to 100 pixels between the eyes (the inter-pupillary distance, or IPD) to achieve reliable identification. Below 50 pixels IPD, accuracy degrades rapidly. Below 30 pixels, most systems become unreliable.

This translates into hard constraints on camera placement. A standard 1080p camera with a 90-degree field of view captures a face at 100-pixel IPD only if the subject is within about 5 metres. At 10 metres, the IPD drops to 50 pixels. At 20 metres, it is 25 pixels.

Real-world surveillance systems deal with this by using higher-resolution cameras (4K or 8K), narrower fields of view (telephoto lenses), or arrays of cameras covering different zones. Airport gates, for example, use a single 4K camera at a range of 1 to 2 metres, ensuring the face occupies a large portion of the frame. Street-level surveillance has to accept lower resolution and correspondingly lower accuracy.

Angle Limits

Face recognition performance degrades with yaw (left-right rotation) and pitch (up-down tilt). Most systems maintain good accuracy up to about 30 degrees of yaw. Beyond 45 degrees, accuracy drops sharply. A full profile (90-degree yaw) is unusable for most systems, though some recent models trained on 3D-aware loss functions can handle up to 60 degrees with reduced accuracy.

This means camera placement must account for the expected direction of foot traffic. A camera mounted directly above a doorway captures the tops of heads, which are useless for recognition. A camera at eye level across a narrow corridor captures near-frontal faces. Real installations carefully angle cameras to maximise the time each person's face is visible within the usable pose range.

Infrared Illumination

Many surveillance cameras operate in environments with poor or variable lighting. Near-infrared (NIR) illumination, using LEDs at 850 nm or 940 nm (invisible to the human eye), provides consistent lighting without alerting subjects. NIR images have different characteristics from visible-light images (melanin absorbs NIR differently, eye colour is not visible), so systems that rely on NIR need embedding networks trained on NIR face images or on cross-spectral matching.

The better approach is active illumination at 940 nm, where LED efficiency is lower but the illumination is completely invisible (850 nm LEDs produce a faint red glow visible in dark conditions). Systems like Apple's Face ID use structured NIR light for a different purpose (depth sensing), which we will cover separately.

Apple Face ID: Depth Sensing On A Phone

Apple's Face ID, introduced with the iPhone X in 2017, is arguably the most widely deployed face verification system in the world. Its design is distinct from camera-based surveillance systems because it uses structured light for 3D depth sensing rather than a standard 2D camera image.

The Hardware

The TrueDepth camera module contains three key components: a flood illuminator (an 850 nm or 940 nm NIR LED that illuminates the face evenly), a dot projector (a vertical-cavity surface-emitting laser, or VCSEL, that projects a pattern of approximately 30,000 infrared dots onto the face), and an infrared camera that captures the reflected pattern.

The flood illuminator provides the 2D NIR face image used for detection and alignment. The dot projector produces the depth map: each dot lands on the face surface and is displaced relative to its projected position by an amount proportional to the surface depth at that point. The IR camera captures these displacements, and a depth estimation algorithm (likely based on structured light triangulation, where the known geometry between the projector and the camera allows depth to be computed from dot displacement) reconstructs a 3D depth map of the face at roughly 30,000-point resolution.

The Pipeline

Face ID combines the 2D NIR image and the 3D depth map into a multi-channel input to a compact neural network running on the device's Neural Engine. The network produces an embedding vector, which is compared to the enrolled template stored in the Secure Enclave.

The Secure Enclave is a hardware-isolated coprocessor on the Apple SoC with its own encrypted memory and its own boot chain. The enrolled template never leaves the Secure Enclave, is not included in device backups, and is not accessible to the application processor or any running app. The comparison (cosine similarity against the threshold) happens inside the Secure Enclave, and only the binary result (match or no match) is communicated to the main processor.

Apple claims a FAR of 1 in 1,000,000, compared to 1 in 50,000 for Touch ID (fingerprint). This is achievable because the 3D depth information provides far more discriminative signal than a 2D image alone: it captures the actual geometry of the face, which is harder to spoof than appearance.

Anti-Spoofing

The depth map is also Face ID's primary defence against spoofing. A printed photograph has no depth variation. A screen displaying a face has a flat depth profile at the screen's distance. A 3D-printed mask would need to match the enrolled face's depth profile at sub-millimetre accuracy, which is beyond the resolution of consumer 3D printers (and professional ones, for that matter, given the required precision).

Apple also uses an attention-awareness feature: Face ID checks that the user's eyes are open and directed at the device, which prevents unlock while the user is asleep or unconscious (this feature can be disabled in accessibility settings). The gaze detection uses the 2D NIR image and runs a separate small neural network.

Face ID updates its enrolled template over time. Each successful unlock provides a new embedding, and if it differs slightly from the stored template (due to gradual aging, facial hair changes, or makeup), the Secure Enclave updates the template to incorporate the new data. This is why Face ID continues to work as your appearance changes, without requiring re-enrolment.

Adversarial Attacks: Fooling The CNN

If a facial recognition system is a neural network that reduces a face to a vector, then the standard toolkit of adversarial machine learning applies. Adversarial attacks modify the input (the face or its surroundings) in ways that cause the network to produce an incorrect embedding.

Physical Adversarial Examples

The most practically relevant attacks are physical: objects that can be worn or applied in the real world.

Adversarial glasses. Sharif and colleagues (2016) demonstrated that specially designed eyeglass frames with printed patterns could cause a face recognition system to either fail to recognise the wearer (dodging) or misidentify them as a specific target identity (impersonation). The patterns are computed by optimising the pixel values of the glasses region to shift the embedding toward or away from a target. In their experiments, the attack transferred across different face recognition models with limited success, but against the specific model the glasses were optimised for, the success rate was above 90%.

IR LED attacks. Infrared LEDs embedded in a hat brim or glasses frame can project patterns onto the face that are invisible to the human eye but visible to the camera sensor (which typically has some NIR sensitivity unless it is filtered out). These patterns alter the face's appearance in the captured image without being visible to nearby people. Research groups at Zhejiang University and Tsinghua University have demonstrated IR LED arrays that defeat face recognition systems in controlled settings.

Adversarial makeup. Geometric patterns applied to the face (especially around the eyes and cheekbones) can alter the CNN's feature extraction enough to prevent recognition. The CV Dazzle project by Adam Harvey explored this with high-contrast asymmetric makeup patterns that break the face's bilateral symmetry, which many detectors rely on. More recent work uses optimisation to design specific makeup patterns for specific target models.

Practical Limitations

These attacks have significant practical limitations. Most are model-specific: glasses optimised for ArcFace-ResNet100 may not fool a different architecture or a differently trained model. Transferability (attacking one model and hoping it fools another) is an active research area, but success rates drop substantially. Physical manufacturing introduces noise (printing imprecision, lighting variation, viewing angle changes) that degrades attack effectiveness.

Deployed systems often use ensembles (multiple models) or update their models periodically, both of which reduce the viability of fixed adversarial patches. And any system with liveness detection (discussed next) adds another layer of defence that purely visual adversarial attacks do not address.

Liveness Detection: Is This A Real Face?

A face recognition system that does not check whether the presented face is alive is vulnerable to trivial attacks: holding up a printed photo, replaying a video on a screen, or wearing a 3D mask. Liveness detection (also called Presentation Attack Detection, or PAD) adds a check for physical presence.

Passive Liveness

Passive liveness analyses a single image or short video without requiring any specific action from the user. It looks for signs that the input is a replay or printout:

Moire patterns (the interference pattern visible when a screen is photographed). Screen boundaries or bezels visible in the frame. Abnormal specular reflections (paper reflects light differently from skin). Lack of micro-expressions or micro-movements that a live face produces involuntarily (subtle muscle movements, blood flow-related skin colour changes).

Modern passive liveness systems use a CNN trained on datasets of real faces, printed photos, and screen replays. The best systems (tested under the ISO/IEC 30107-3 standard) achieve attack rejection rates above 99% for basic print and replay attacks.

Active Liveness

Active liveness requires the user to perform a specific action: blink, turn their head, smile, or follow a moving target with their eyes. The system verifies that the action was performed correctly and in real time. This defeats static photos and pre-recorded videos (unless the attacker has a video of the target performing the exact requested action, which is a challenge-response approach to presentation attacks).

The weakness of active liveness is user experience. Requiring a user to perform a head-turn adds seconds to the verification process and can be confusing, especially for older users or those with motor impairments.

Depth-Based Liveness

Systems with 3D sensing, like Face ID, get liveness detection almost for free. A printed photo, a screen, and a mask all have distinctive depth profiles that differ from a live face. A photo is perfectly flat. A screen is flat at screen distance. A mask may approximate facial geometry but lacks the depth of eye sockets, the reflective properties of eyes, and the micro-movements of skin.

Depth-based liveness is the most robust approach currently deployed. It is the primary reason that Face ID has not been spoofed in practice by casual attackers (academic demonstrations have used custom-made silicone masks costing thousands of euros and requiring physical measurements of the target's face).

Police Use In Europe: Databases And Controversies

Facial recognition in law enforcement across Europe involves a patchwork of national databases, Europol's centralised systems, and commercially available search tools.

National Police Databases

Most EU member states maintain databases of mugshot photos linked to criminal records. France's TAJ database (Traitement d'Antecedents Judiciaires) contains approximately 19 million entries with photographs. Germany's INPOL database holds biometric data including facial images for millions of subjects. The UK (pre-Brexit) maintained the Police National Database with facial images, and continues to develop live facial recognition systems deployed by the Metropolitan Police in London, despite significant legal challenges and accuracy concerns.

These databases are used primarily for retrospective identification: a detective has a still from CCTV footage and searches it against the mugshot database. The search returns a ranked list of candidates, and an officer makes the final identification decision. This workflow is standard across European police forces and is generally less controversial than real-time surveillance, because it involves targeted searches with human review rather than mass automated screening.

Europol And The Pruem Framework

Europol's facial recognition capabilities fall under the Pruem II regulation (adopted in 2024), which expands the existing Pruem framework for cross-border biometric data exchange. Under Pruem II, member states can query each other's facial image databases for law enforcement purposes. Europol itself operates a facial recognition system that searches against its own databases of images related to serious crime and terrorism investigations.

The technical infrastructure uses the ELVI (Europol Verification and Identification) system, which implements standard embedding-based search with human-in-the-loop verification. Queries are logged, and member states retain sovereignty over their data: a query from Greek police against the French database returns a ranked list to the French authorities, who decide what information to share.

PimEyes And Commercial Search

PimEyes is a commercially available facial recognition search engine based in Poland (though its corporate structure has shifted over time). It indexes publicly available images from the web and allows anyone to upload a face photo and find matching images across its database. As of 2025, PimEyes claims an index of over 900 million faces.

PimEyes has been used by journalists, stalkers, private investigators, and individuals seeking to find out where their photos appear online. Its legality under GDPR has been challenged. In 2022, the Hamburg data protection authority (Germany) fined Clearview AI for GDPR violations, and PimEyes faces similar scrutiny. The Italian Garante (data protection authority) issued a €20 million fine against Clearview AI in 2022, and has investigated PimEyes.

The technical challenge PimEyes represents is that the same embedding-and-search pipeline used by law enforcement is available to anyone with a web browser. The technology is not inherently a policing tool; it is a search tool that happens to use faces as queries.

Clearview AI: Scraping The Web

Clearview AI, founded in 2017 by Hoan Ton-That and Richard Schwartz, built a facial recognition database by scraping billions of publicly available photos from social media platforms, news sites, and other web sources. By 2020, the company claimed a database of over 3 billion images (later growing to over 40 billion), linked to their source URLs.

How The Scraping Worked

Clearview's system crawled the public web, downloaded images, ran face detection on them, computed embeddings, and stored the embeddings alongside the source URLs. When a client (typically a law enforcement agency) uploaded a probe face, the system searched the embedding database and returned matching images with their source URLs, which could include social media profiles, news articles, and other web pages.

The technical pipeline is standard: detect, align, embed, search. The innovation, if it can be called that, was purely in scale and in the willingness to scrape without permission. Facebook, LinkedIn, Twitter, and YouTube all sent cease-and-desist letters. Clearview argued that publicly available images were fair game under the First Amendment (in the US context).

Legal Battles In Europe

Europe's response was unambiguous. Under GDPR, processing biometric data (which facial embeddings are, per Article 9) requires explicit consent or a specific legal basis. Scraping billions of photos and computing facial embeddings without consent is a clear violation.

The enforcement actions came quickly:

The French CNIL fined Clearview AI €20 million in October 2022 and ordered the company to delete all data on French residents.

The Italian Garante fined Clearview AI €20 million in March 2022, on similar grounds.

The Greek DPA (Hellenic Data Protection Authority) fined Clearview AI €20 million in July 2022.

The UK ICO fined Clearview AI £7.5 million (later overturned on jurisdictional grounds, but the underlying GDPR analysis stood).

The Swedish DPA found that the Swedish Police Authority's use of Clearview AI violated Swedish law and GDPR, and fined the police approximately SEK 2.5 million (about €250,000).

Austria's DSB found Clearview AI in violation of GDPR in 2022.

Clearview AI does not operate in the EU market as of 2025, but the precedent is significant: European data protection authorities have unanimously held that mass scraping of facial images for biometric processing violates GDPR, regardless of whether the source images were publicly available.

The EU AI Act: Regulating Facial Recognition

The EU AI Act, which entered into force in August 2024 with phased implementation through 2027, is the world's first comprehensive legislation specifically addressing artificial intelligence, including facial recognition.

Classification As High-Risk

Under the AI Act, facial recognition systems used for identification of natural persons fall under the high-risk category (Annex III, point 1). High-risk AI systems must comply with a set of requirements including:

A risk management system that identifies and mitigates risks throughout the system's lifecycle.

Data governance requirements: training data must be relevant, representative, and as free from bias as reasonably achievable.

Technical documentation describing the system's design, capabilities, and limitations.

Logging and traceability: the system must record enough information to audit its decisions after the fact.

Human oversight: a qualified human must be able to intervene in or override the system's decisions.

Accuracy, robustness, and cybersecurity requirements appropriate to the system's intended use.

The Real-Time Biometric Surveillance Ban

Article 5 of the AI Act prohibits real-time remote biometric identification systems in publicly accessible spaces for law enforcement purposes, with three narrowly defined exceptions:

Targeted search for specific victims of abduction, trafficking, or sexual exploitation.

Prevention of a specific, substantial, and imminent threat to life or a foreseeable terrorist attack.

Identification of a suspect of a serious criminal offence (one carrying a maximum sentence of at least four years in the relevant member state).

Even where an exception applies, the use must be authorised by a judicial authority or an independent administrative authority, and is subject to temporal and geographic limitations. Member states that choose to allow any of these exceptions must notify the European Commission and the relevant national market surveillance authority.

Post-event facial recognition (retrospective searching of recorded footage) is permitted but classified as high-risk and subject to the full set of high-risk requirements, including judicial authorisation for law enforcement use.

Practical Implications

The AI Act creates a tiered regulatory framework. A company deploying facial recognition for building access control (cooperative verification, enrolled users, controlled conditions) faces the high-risk compliance requirements. A law enforcement agency deploying real-time identification in a public space faces an outright ban unless it can invoke one of the three exceptions and obtain judicial authorisation.

For the technology industry, the most significant requirement is the bias testing obligation. The AI Act requires high-risk systems to be tested across demographic groups before deployment, and the results must be documented. This gives legal force to the kind of demographic evaluation that NIST FRVT has been doing voluntarily for years.

For citizens, the AI Act creates a right to know when a facial recognition system is being used in their vicinity (through public signage requirements) and a right to lodge complaints with national authorities if they believe the system is being used unlawfully.

How Cosine Similarity Search Scales

A practical consideration that benchmarks rarely discuss is the computational cost of searching a large gallery. If you have N enrolled templates and one probe, a brute-force search requires N cosine similarity computations. Each is a 512-dimensional dot product, which is 512 multiply-add operations. For N = 1 million, that is 512 million multiply-add operations per probe.

On a modern CPU with AVX-512 vector instructions, a 512-dimensional dot product takes about 50 nanoseconds. A million of those takes about 50 milliseconds, which is acceptable for many applications. On a GPU, the same operation (batched as a matrix-vector product) takes under 1 millisecond for a million templates.

For databases in the tens of millions, brute-force search becomes expensive. Approximate nearest-neighbour (ANN) algorithms are used instead. The most common are:

HNSW (Hierarchical Navigable Small World): builds a multi-layer graph over the embeddings. Each layer is a navigable small-world graph, and search proceeds by navigating from the top (sparse) layer down to the bottom (dense) layer. HNSW achieves 99%+ recall with 10 to 100 times fewer distance computations than brute force.

IVF (Inverted File Index): partitions the embedding space into Voronoi cells using k-means clustering. At search time, only the templates in the closest few cells are compared. Combined with product quantisation (IVF-PQ), this reduces both computation and memory: instead of storing full 512-dimensional vectors, templates are compressed to 64 or 128 bytes each.

Facebook's FAISS library is the standard implementation for both HNSW and IVF-PQ. It runs on both CPU and GPU and is used in production systems that index billions of face embeddings.

The accuracy cost of approximate search is small: less than 0.5% recall loss at typical operating points. The speed gain is massive: searching a 100-million-template database in under 10 milliseconds on a single GPU.

Putting It All Together: The End-To-End Pipeline

To make the full pipeline concrete, here is what happens when a surveillance camera in a European airport captures a frame containing a passenger's face.

  1. Frame capture. The camera produces a 3840x2160 frame at 30 fps. The frame is sent over a network link to a processing server.

  2. Face detection. RetinaFace (or a comparable detector) processes the frame on a GPU in about 20 ms. It outputs bounding boxes and five-point landmarks for every detected face. In a busy terminal, this might be 10 to 50 faces per frame.

  3. Face alignment. For each detected face, an affine transformation warps the image region around the bounding box into a 112x112 aligned crop using the five landmark positions. This is a simple matrix operation that takes under 1 ms per face.

  4. Embedding. Each aligned crop is passed through an ArcFace-trained ResNet-100 network. On a modern GPU, a batch of 50 crops can be embedded in about 15 ms. Each crop produces a 512-dimensional L2-normalised vector.

  5. Search. Each embedding is compared against the gallery of enrolled templates (perhaps 50 million entries for a national watch list). Using FAISS with IVF-PQ, this takes about 5 ms per probe.

  6. Decision. If the highest cosine similarity exceeds the threshold (set based on the desired FAR/FRR trade-off), the system flags the match and presents it to a human operator with the matched identity, the similarity score, and the source image.

  7. Human review. The operator examines the probe image and the enrolled photo, considers the similarity score, and decides whether to act. In most European deployments, the system's output is advisory; the final identification decision is made by a person.

Total latency from frame capture to human alert: roughly 50 to 100 ms, depending on network and hardware. The pipeline processes every frame independently, so at 30 fps, the system can handle approximately 1,500 face detections per second per GPU.

This is the pipeline. Every component is a known technology. The face detector is a published architecture. The embedding network is trained with a published loss function on a public (or quasi-public) dataset. The search index is an open-source library. The only proprietary element in most deployments is the specific training data and the system integration.

The debate about facial recognition is not about whether the technology works. It works remarkably well under good conditions, and it is improving every year. The debate is about who gets to use it, on whom, under what constraints, and with what accountability when it fails. The EU has chosen to answer those questions through legislation. Other jurisdictions are watching.