Roboflow Glossary

A glossary of common computer vision terms.

Written by Mohamed Traore

Last published at: June 14th, 2022

This is a auto-generated Article of all your definitions within the glossary.


This is a auto-generated Article of all your definitions within the glossary.

  • Accuracy

    The proportion of "correct" vs "incorrect" predictions a model makes. Common in classification models that have a single correct answer (vs object detection where there is a gradient from "perfect" to "pretty close" to "completely wrong".) Often terms such as "top-5 accuracy" are used which means "how much of the time was the correct answer in the model's top 5 most confident predictions?" Top-1 accuracy and Top-3 accuracy are also common.

  • Annotation

    The "answer key" for each image. Annotations are markup placed on an image (bounding boxes for object detection, polygons, or a segmentation map for segmentation) to teach the model the ground truth.

  • Annotation format

    The particular way of encoding an annotation. There are many ways to describe a bounding box's size and position (JSON, XML, TXT, etc) and to delineate which annotation goes with which image.

  • Annotation group

    Describes what types of object you are identifying. For example, "Chess Pieces" or "Vehicles". Classes (eg "rook", "pawn") are members of an annotation group.

  • Architecture

    A specific neural network layout (layers, neurons, blocks, etc). These often come in multiple sizes whose design is similar except for the number of parameters. For example, EfficientDet ranges from D0 (smallest) to D7 (largest).

  • AUC

    Area Under the Curve. An evaluation metric for the efficacy of a prediction system that is trading off precision at the expense of recall. The precision-recall curve is downward sloping as a prediction algorithms confidence is decreased, to allow more, but less precise predictions.

  • Augmentation

    Creating more training examples by distorting your input images so your model doesn't overfit on specific training examples. For example, you may flip, rotate, blur, or add noise. On Roboflow, augmentations are only applied to images within your training set.

  • Auto-ML

    One-click to train models that optimize themselves (usually hosted in the cloud). They can be a good starting point, a good baseline, and in some cases a "it just works" solution vs tuning your own models.

  • Backprop

    Back propagation is the way that neural networks improve themselves. For each batch of training data, they do a “forward pass” through the network and then find the direction of the “gradient” of each neuron in each layer from the end working backwards and adjust it a little bit in the direction that most reduces the loss function. Over millions of iterations, they get better bit by bit and this is how they “learn” to fit the training data.

  • Batch inference

    Making predictions on many frames at once to take advantage of the GPU’s ability to perform parallel operations. This can help improve performance if you are doing offline (as opposed to real-time) prediction. It increases throughput (but not FPS).

  • Batch size

    The number of images your model is training on in each step. This is a hyperparameter you can adjust. There are pros (faster training) and cons (increased memory usage) to increasing the batch size. It can also affect the model’s overall accuracy (and there is a bit of an art to choosing a good batch size as it is dependent on a number of factors). You may want to experiment with larger or smaller batch sizes.

  • Black Box

    A system that makes it hard to peek behind the curtain to understand what is going on. Neural networks are often described as black boxes because it can be hard to explain “why” they are making a particular prediction. Model explainability is currently a hot topic and field of study.

  • Bounding box

    A rectangular region of an image containing an object. Commonly described by its min/max x/y positions or a center point (x/y) and its width and height (w/h) along with its class label.

  • Checkpoint

    A point-in-time snapshot of your model’s weights. Oftentimes you will capture a checkpoint at the end of each epoch so you can go back to it later if your model’s performance degrades because it starts to overfit.

  • Class

    A type of thing to be identified. For example, a model identifying pieces on a Chess board might have the following classes: white-pawn, black-pawn, white-rook, black-rook, white-knight, black-knight, white-bishop, black-bishop, white-queen, black-queen, white-king, black-king. The Annotation Group in this instance would be “Chess Pieces”.

  • Class balance

    The relative distribution between the number of examples of each class. Models generally perform better if there is a relatively even number of examples for each class. If there are too few of a particular class, that class is “under-represented”. If there are many more instances of a particular class, that class is “over-represented”.

  • Classification

    A type of computer vision task that aims to determine only whether a certain class is present in an image (but not its location).

  • COCO

    The Microsoft Common Objects in Context dataset contains over 2 million images in 80 classes (ranging from “person” to “handbag” to “sink”). MS COCO is a standard dataset used to benchmark different models to compare their performance. Its JSON annotation format has also become commonly used for other datasets.

  • Colab

    Google Colaboratory is a free platform that provides hosted Jupyter Notebooks connected to free GPUs.

  • Computer Vision

    The field pertaining to making sense of imagery. Images are just a collection of pixel values; with computer vision we can take those pixels and gain understanding of what they represent.

  • Confidence

    A model is inherently statistical. Along with its prediction, it also outputs a confidence value that quantifies how “sure” it is that its prediction is correct.

  • Confidence threshold

    We often discard predictions that fall below a certain bar. This bar is the confidence threshold.

  • Convolutional Neural Network (CNN, ConvNet)

    The most common type of network used in computer vision. By combining many convolutional layers, it can learn about more and more complex concepts. The early layers learn about things like horizontal, vertical, and diagonal lines and blocks of similar colors, the middle layers learn about combinations of those features like textures and corners, and the final layers learn to combine those features into identifying higher level concepts like “ears” and “clocks”.

  • CoreML

    A proprietary format used to encode weights for Apple devices that takes advantage of the hardware-accelerated neural engine present on iPhone and iPad devices.

  • CreateML

    A no-code training tool created by Apple that will train machine learning models and export to CoreML. It supports classification and object detection along with several types of non computer-vision models (such as sound, activity, and text classification).

  • CUDA

    NVIDIA’s method of creating general-purpose GPU-optimized code. This is how we are able to use GPU devices originally designed for 3d games to accelerate neural networks.

  • CuDNN

    NVIDIA’s CUDA Deep Neural Network library is a set of tools built on top of CUDA pertaining specifically to efficiently running neural networks on the GPU.

  • curl

    A command-line program commonly used to upload and download files on UNIX-like operating systems (which is now included with Windows 10 as well).

  • Custom dataset

    A set of images and annotations pertaining to a domain-specific problem. In contrast to a research benchmark dataset like COCO or Pascal VOC.

  • Darknet

    A C-based neural network framework created and popularized by PJ Reddie, the inventor of the YOLO family of object detection models.

  • Dataset

    A collection of data and ground truth of outputs that you use to train a machine learning model by example. For object detection, this would be your set of images (data) and annotations (ground truth) that you would like your model to learn to predict.

  • Deploy

    Taking the results of a trained model and using them to do inference on real-world data. This could mean hosting a model on a server or installing it to an edge device.

  • Domain-specific

    Problems or techniques that are not generally applicable. For example, if you’re trying to detect tumors in X-Rays, anything that has to do with the cancer biology is domain-specific because it wouldn’t apply to someone working on measuring traffic flows via satellite imagery.

  • Early Stopping

    Detecting when your model has reached peak performance and terminating the training job prior to “completion.” There are a number of heuristics you can use to determine your model has reached a local maximum; stopping early can prevent overfitting and save you from wasting time and compute resources.

  • Edge deployment

    Deploying to a device that will make predictions without uploading the data to a central server over the Internet. This could be an iPhone or Android device, a Raspberry Pi, an NVIDIA Jetson, a robot, or even a full computer with a GPU located on-site.

  • Epochs

    The number of times to run through your training data.

  • EXIF

    Metadata attached to images (for example, orientation, GPS data, information about the capture device, shutter speed, f-stop, etc).

  • Export

    In Roboflow, an export is a serialized version of a dataset that can be downloaded.

  • F1

    A measure of the efficacy of a prediction system. F1 is a combination of recall (guessing enough times) with precision (guessing correctly when the system does guess). High F1 means guessing correctly when there is a guess to be made.

  • FP16

    16-bit floating point. (Also known as half-precision.)

  • FP8

    8-bit floating-point. (Also known as quarter-precision.) Reducing the precision of your model can improve its speed and accuracy and can also take advantage of features of newer GPUs like tensor cores.

  • FPS

    Frames per second. In real-time inference, this is the measure of how many sequential inference operations a model can perform. A higher number means a faster model.

  • Framework

    Deep learning frameworks implement neural network concepts. Some are designed for training and inference - TensorFlow, PyTorch, FastAI, etc. And others are designed particularly for speedy inference - OpenVino, TensorRT, etc.

  • Generate

    In Roboflow, generating images means processing them into their final form (including preprocessing and augmenting them).

  • GPU

    Graphics processing unit. Originally developed for use with 3d games, they’re very good at performing matrix operations which happen to be the foundation of neural networks. Training on a GPU lets your model’s calculations run in parallel which is vastly faster than the serial operations a CPU performs (for the subset of operations they are capable of).

  • GPU Memory

    The amount of information your GPU can fit on it. A bigger GPU will be able to process more information in parallel which means it can support bigger models (or bigger batch sizes) without running out of memory. If you run out of GPU memory it will crash your program.

  • Ground Truth

    The “answer key” for your dataset. This is how you judge how well your model is doing and calculate the loss function we use for gradient descent. It’s also what we use to calculate our metrics. Having a good ground truth is extremely important. Your model will learn to predict based on the ground truth you give it to replicate.

  • Health Check

    A set of tools in Roboflow that help you understand the composition of your dataset (eg size, dimensions, class balance, etc).

  • Inference

    Making predictions using the weights you save after training your model.

  • IoU

    Intersection over Union (also abbreviated I/U). A metric by which you can measure how well an object detection model performs. Calculated by taking the amount of the predicted bounding box that overlaps with the ground truth bounding box divided by the total area of both bounding boxes.

  • JSON

    A freeform data serialization format originally created as part of JavaScript but now used much more broadly. Many annotation formats use JSON to encode their bounding boxes.

  • Keypoint Detection

    A type of computer vision model that predicts points (as opposed to boxes in object detection). Oftentimes keypoint detection is used for human pose estimation or finger tracking where only the position of an object, not its size, matters.

  • Label

    The class of a specific object in your dataset. In classification, this is the entirety of the prediction. In object detection, it is the non-spatial component of the bounding box.

  • Learning Rate

    A hyperparameter that defines the size of the steps along the gradient that you take after each batch during training. Often the learning rate will change over the course of training (this is called having a “cyclical learning rate”. If your learning rate is too small, your model will converge very slowly. If it’s too large, it might lead your model’s weights to explode and your model to diverge.

  • LiDAR

    Laser imaging detection and ranging. This is a device that uses lasers to Detect depth. Built into many self-driving cars and now included in the iPad Pro for constructing a 3d world map for use in augmented reality.

  • Localization

    Identifying where in an image an object resides. This is the part of object detection and keypoint detection that gives the x/y coordinates (as opposed to the class label).

  • Loss function

    A differentiable calculation of “how far off” a prediction is. This is used to calculate the gradient and in turn, steer which direction your model steps at each iteration of your training loop. The output of the loss function is called the “loss” and is usually calculated on the training set and validation set separately (called “training loss” and “validation loss” respectively). The lower this value, the more accurate the model’s predictions were.

  • Machine Learning

    A field of teaching computers by example. Instead of traditional programming where you write the “rules” by which your program converts inputs to outputs, you instead give it many examples of inputs and the desired output and let it write the rules by (smart) trial and error.

  • Metadata

    Ancillary information stored about your data. For example, the date and time it was collected. Often stored as EXIF.

  • Mixed Precision

    Using both full precision and half-precision floating-point numbers during training. This has been shown to increase speed without degrading performance.

  • Mobile deployment

    Deploying to an edge device like a mobile phone. Considerations like battery usage and heat dissipation come into play.

  • Model Zoo

    A collection of model architectures (and sometimes pre-trained model weights) available for download.

  • Neuron

    Also known as a parameter, a neuron or perceptron is a mathematical function that takes several inputs and outputs, multiplies them together with their weights (which change over time as the network learns), and outputs a single value which is then fed into other neurons as one of their inputs.

  • Non Max Suppression (nms)

    Non max suppression is a technique used mainly in object detection that aims at selecting the best bounding box out of a set of overlapping [bounding] boxes. Implementation: 1. Define a value for Confidence Threshold, and IOU Threshold. 2. Sort the bounding boxes in descending order of confidence. 3. Remove boxes that have a confidence < Confidence_Threshold 4. Loop over all the remaining boxes, starting first with the box that has highest confidence. 5. Calculate the IOU of the current box, with every remaining box that belongs to the same class. 6. If the IOU of the 2 boxes > IOU Threshold, remove the box with lower confidence from our list of boxes. 7. Repeat this operation until we have gone through all the boxes in the list. Acknowledgments:

  • Object Detection

    A category of computer vision models that both classify and localize objects with a rectangular bounding box.

  • Occlusion

    When an object is partially obscured behind another object it is “occluded” by that object. It is important to simulate occlusion so your model isn’t overly dependent on one distinctive feature to identify things (eg by occluding a cat’s ears you force it also to learn about its paws and tail rather than relying solely on the ears to identify it which helps if the cat’s head ends up being hidden by a chair in a real world situation).

  • ONNX

    A cross-platform, cross-framework serialization format for model weights. By converting your weights to ONNX you can simplify the dependencies required to deploy it into production (oftentimes converting to ONNX is a necessary step in deploying to the edge and may be an intermediary step in converting weights to another format).

  • Ontology

    The categorization and hierarchy of your classes. As your project grows it becomes more and more important to standardize on common nomenclature conventions for your team.

  • OpenCV

    A “traditional” computer vision framework popularized before deep learning became ubiquitous. It excels at doing things like detecting edges, image sticking, and object tracking. In recent years it has also started to expand into newer machine learning-powered computer vision techniques as well. OpenCV also operates as a non-profit organization and was originally founded in partnership with Intel.

  • OpenVINO

    Intel’s inference framework. Designed for speedy inference on CPU and VPU devices.

  • Preprocessing

    Deterministic steps performed on all images (training, validation, testing, and production) prior to feeding them into the model. Preprocessing steps are applied to all images (train, valid, and test set) when generating dataset versions on Roboflow.

  • PyTorch

    A popular open-source deep learning framework developed by Facebook. It has a focus on accelerating the path from research prototyping to production deployment.

  • SSD

    Single-shot detector. A model that only does a single pass to both localize and classify objects. The YOLO family famously plays off of this concept in its name: You Only Look Once.

  • Tensor

    A (possibly multi-dimensional) array of numbers of a given type with a defined size. Because they have a defined size and shape it makes it possible to optimize and parallelize operations on them with hardware accelerators.

  • Tensorboard

    A tool used to track and visualize training metrics including graphs of common statistics like loss and mean average precision originally developed for Tensorflow but now compatible with other frameworks like PyTorch.

  • Tensor Core

    NVIDIA’s brand name for the part of their GPUs that is specifically optimized for deep learning (and especially mixed-precision neural networks).

  • Tensorflow

    Google’s popular open-source deep learning framework.

  • Tensorflow Lite

    Model serialization for Tensorflow models to optimize them to run on mobile and edge devices.

  • TensorRT

    NVIDIA’s framework agnostic inference optimization tooling. Helps to optimize models for deployment on NVIDIA-powered edge-devices.

  • TFjs

    Tooling that enable (some) Tensorflow-trained models to perform inference in the web browser with Javascript, WebAssembly, and WebGPU.

  • TFRecord

    A binary data format compatible with Tensorflow. In the object detection API, all of the images and annotations are stored in a single file.

  • TPU

    Tensor Processing Unit. Google’s hardware accelerator for performing operations on tensors. It is much faster than a GPU for some workloads. Most often they are run on Google Cloud or Google Colab but there is also an edge-TPU that can be deployed in the field.

  • Train

    The process iteratively of adjusting your model’s parameters to converge on the weights that optimally mimic the training data.

  • Transfer Learning

    Using pre-trained weights to bootstrap your model’s learning. You are “transferring” the knowledge learned on another dataset and then “fine-tuning” it to learn about your new domain.

  • Two Stage Detector

    A category of (typically older generation) object detection models that first localize, then classify. As opposed to single shot detectors which do both tasks in one pass.

  • Validate

    During the training process of a neural network, the validation set is used to assess how well the model is generalizing. These examples are not used to calculate the gradient; they are the ones used to calculate your metrics and see how well they are improving over time.

  • Version

    A point in time snapshot of your dataset. By keeping track of exactly which images, preprocessing, and augmentation steps were used in each iteration of your model you maintain the ability to reproduce the results and scientifically test across various models and frameworks while remaining confident that the results are attributable to the model changes and not due to a bug in the data pipeline.

  • Weights

    The parameters of your model that neurons use to determine whether or not to fire. The optimal values are learned via backpropagation during training and can then be serialized and deployed for inference.

  • Workflow

    The process you follow. This will be some combination of manual steps, custom code, and third-party tools in any number of environments. A computer vision platform can help you set up an optimal workflow.

  • XML

    A hierarchical data format (HTML, the markup language defining the layout and content of the page you are currently reading, is a subset of XML). In computer vision XML is most commonly used with the Pascal VOC XML annotation format.

  • YAML

    A markup language originally invented by Yahoo! that is now commonly used as a format for configuration files (notably in YOLOv5's YAML configuration).

  • YOLO

    You Only Look Once, a family of single-shot learner object detection models providing state-of-the-art results for object detection as of fall 2020.