Creating an Image Processing Pipeline¶
Welcome to the user guide for building an image processing pipeline using starfish! This tutorial will cover all the steps necessary for going from raw images to a single cell gene expression matrix. If you are wondering what is starfish, check out The Introduction. If you only have a few minutes to try out starfish, check out a pre-built pipeline by following the Guide to Getting Started. If you are ready to learn how to build your own image processing pipeline using starfish then read on!
The data model
This part of the tutorial goes into more detail about why each of the stages in the example are needed, and provides some alternative approaches that can be used to build similar pipelines.
The core functionalities of starfish pipelines are the detection (and decoding) of spots, and the segmentation of cells. Each of the other approaches are designed to address various characteristics of the imaging system, or the optical characteristics of the tissue sample being measured, which might bias the resulting spot calling, decoding, or cell segmentation decisions. Not all parts of image processing are always needed; some are dependent on the specific characteristics of the tissues. In addition, not all components are always found in the same order. Starfish is flexible enough to omit some pipeline stages or disorder them, but the typical order might match the following. The links show how and when to use each component of starfish, and the final section demonstrates putting together a “pipeline recipe” and running it on an experiment.
Sometimes it can be useful subset the images by, for example, excluding out-of-focus images or cropping out edge effects. For sparse data, it can be useful to project the z-volume into a single image, as this produces a much faster processing routine.
These stages are typically specific to the microscope, camera, filters, chemistry, and any tissue handling or microfluidices that are involved in capturing the images. These steps are typically independent of the assay. Starfish enables the user to design a pipeline that matches their imaging system
Enhancing Signal & Removing Background Noise¶
These stages are usually specific to the sample being analyzed. For example, tissues often have some level of autofluorescence which causes cellular compartments to have more background noise than intracellular regions. This can confound spot finders, which look for local intensity differences. These approaches ameliorate these problems.
Tutorial: Removing Autofluorescence
Most assays are designed such that intensities need to be compared between rounds and/or channels in order to decode spots. As a basic example, smFISH spots are labeled by the channel with the highest intensity value. But because different channels use different fluorophores, excitation sources, etc. the images have different ranges of intensity values. The background intensity values in one channel might be as high as the signal intensity values of another channel. Normalizing image intensities corrects for these differences and allows comparisons to be made.
Whether to normalize¶
The decision of whether to normalize depends on your data and decoding method used in the next
step of the pipeline.
ImageStack has approximately the same
range of intensities across rounds and
channels then normalizing may have a trivial effect on pixel values. Starfish provides utility
functions imshow_plane and
intensity_histogram to visualize images and their intensity
Accurately normalized images is important if you plan to decode features with
PixelSpotDecoder. These two algorithms use the
feature trace to construct a vector whose distance from
other vectors is used decode the feature. Poorly normalized images with some systematic or random
variation in intensity will bias the results of decoding.
However if you decode with
PerRoundMaxChannel, which only compares intensities
between channels of the same round, precise normalization is not necessary. As long the intensity
values of signal in all three channels are greater than background in all three channels the
features will be decoded correctly.
How to normalize¶
How to normalize depends on your data and a key assumption. There are two approaches for normalizing images in starfish:
Normalizing Intensity Distributions¶
If you know a priori that image volumes acquired for every channel and/or every round should have
the same distribution of intensities then the intensity distributions of image volumes can be
MatchHistograms. Typically this means the number of spots and amount of
background autofluorescence in every image volume is approximately uniform across channels and/or
Tutorial: Normalizing Intensity Distributions
Normalizing Intensity Values¶
In most data sets the differences in gene expression leads to too much variation in number of
spots between channels and rounds. Normalizing intensity distributions would incorrectly skew the
intensities. Instead you can use
ClipValueToZero to normalize intensity values by clipping extreme values and
Tutorial: Normalizing Intensity Values
Finding and Decoding Spots¶
Finding and decoding bright spots is the unique core functionality of starfish and is necessary in every image-based transcriptomics processing pipeline. The inputs are all the images from a FOV along with a codebook that describes the experimental design. The output after decoding is a DecodedIntensityTable that contains the location, intensity values, and mapped target of every detected feature.
Every assay uses a set of rules that the codewords in the codebook must follow (e.g. each target has one hot channel in each round). These rules determine which decoding methods in starfish should be used. See What Decoding Pipeline Should I Use? to learn about different codebook designs and how to decode them.
There are two divergent decoding approaches, spot-based and pixel-based, used in the image-based transcriptomics community when it comes to analyzing spots in images:
The spot-based approach finds spots in each image volume based on the brightness of regions
relative to their surroundings and then builds a spot trace
using the appropriate TraceBuildingStrategies. The spot
traces can then be mapped, or decoded, to codewords in the codebook using a
When to Use
Images are amenable to spot detection methods
Data is from sequential methods like smFISH
Spots are sparse and may not be aligned across all rounds
The pixel-based approach first treats every pixel as a feature and constructs a corresponding pixel trace that is mapped to codewords. Connected component analysis is then used to label connected pixels with the same codeword as an RNA spot.
Tutorial: Pixel-Based Decoding with DetectPixels
What Decoding Pipeline Should I Use?¶
If you are unsure which spot finding and decoding methods are compatible with your data here is a handy table that summarizes the three major codebook designs and what methods can be used to decode each of them. If your codebook doesn’t fall into any of these categories, make a feature request on github, we would love to hear about unique codebook designs!
One Hot Exponentially Multiplexed
Example 7-round Codebook Diagrams
Codewords have only one round and channel with signal
Codewords are one hot in each round
Each codeword is a combination of signals over multiple rounds
Reference Image Needed?
starfish Pipeline Options
Unlike single-cell RNA sequencing, image-based transcriptomics methods do not physically separate cells before acquiring RNA information. Therefore, in order to characterize cells, the RNA must be assigned into single cells by partitioning the image volume. Accurate unsupervised cell-segmentation is an open problem for all biomedical imaging disciplines ranging from digital pathology to neuroscience.
The challenge of segmenting cells depends on the structural complexity of the sample and quality of images available. For example, a sparse cell mono-layer with a strong cytosol stain would be trivial to segment but a dense heterogeneous population of cells in 3D tissue with only a DAPI stain can be impossible to segment perfectly. On the experimental side, selecting good cell stains and acquiring images with low background will make segmenting a more tractable task.
There are many approaches for segmenting cells from image-based transcriptomics assays. Below are
a few methods that are implemented or integrated with starfish to output a
BinaryMaskCollection, which represents a collection of labeled objects. If you do not
know which segmentation method to use, a safe bet is to start with thresholding and watershed. On
the other hand, if you can afford to manually define ROI masks
there is no better way to guarantee accurate segmentation.
While there is no “ground truth” for cell segmentation, the closest approximation is manual segmentation by an expert in the tissue of interest.
Thresholding and Watershed¶
The traditional method for segmenting cells in fluorescence microscopy images is to threshold the
image into foreground pixels and background pixels and then label connected foreground as
individual cells. Common issues that affect thresholding such as background noise can be corrected
by preprocessing images before thresholding and filtering connected components after. There are
many automated image thresholding algorithms but currently
starfish requires manually selecting a global threshold value in
When overlapping cells are labeled as one connected component, they are typically segmented by using a distance transformation followed by the watershed algorithm. Watershed is a classic image processing algorithm for separating objects in images and can be applied to all types of images. Pairing it with a distance transform is particularly useful for segmenting convex shapes like cells.
A segmentation pipeline that consists of thresholding, connected component analysis, and watershed
is simple and fast to implement but its accuracy is highly dependent on image quality.
The signal-to-noise ratio of the cell stain must be high enough for minimal errors after
thresholding and binary operations. And the nuclei or cell shapes must be convex to meet the
assumptions of the distance transform or else it will over-segment. Starfish includes the basic
functions to build a watershed segmentation pipeline and a predefined
segmentation class that uses the primary images as the cell stain.
Manually Defining Cells¶
The most accurate but time-consuming approach is to manually segment images using a tool such as
ROI manager in FIJI (ImageJ). It
is a straightforward process that starfish supports by importing
ROI sets stored in ZIP archives to be imported as a
BinaryMaskCollection. These masks can then be integrated into the pipeline for
visualization and assigning spots to cells.
Tutorial: Loading ImageJ ROI set
Besides the two classic cell segmentation approaches mentioned above, there are machine-learning methods that aim to replicate the accuracy of manual cell segmentation while reducing the labor required. Machine-learning algorithms for segmentation are continually improving but there is no perfect solution for all image types yet. These methods require training data (e.g. stained images with manually defined labels) to train a model to predict cell or nuclei locations in test data. There are exceptions that don’t require training on your specific data but generally training the model is something to consider when evaluating how much time each segmentation approach will require.
Starfish currently has built-in functionality to support ilastik, a segmentation toolkit that leverages machine-learning. Ilastik has a Pixel Classification workflow that performs semantic segmentation of the image, returning probability maps for each label such as cells and background. To transform the images of pixel probabilities to binary masks, you can use the same thresholding and watershed methods in starfish that are used for segmenting images of stained cells.
Tutorial: Using ilastik in starfish
Assessing Performance Metrics¶
Feature Identification and Assignment¶
Once images have been corrected for tissue and optical aberrations, spot finding can be run to turn those spots into features that can be counted up. Separately, The dots and nuclei images can be segmented to identify the locations where the cells can be found in the images. Finally, the two sets of features can be combined to assign each spot to its cell of origin. At this point, it’s trivial to create a cell x gene matrix.