Quick Start

The purpose of this hands-on tutorial is to acquaint first-time users to all the basic components of a starfish pipeline. Anyone with a computer and internet connection should be able to follow the guide to create and run the example starfish pipeline on the provided experiment data.

Approximate time to complete tutorial: 10 minutes

Prerequisites

Starfish Pipeline

Open a Jupyter notebook or console with IPython. Make sure you’re using the kernel or virtual environment where you installed starfish, and then import the following modules:

Load example dataset

For this tutorial, we use one of the datasets included in the starfish library. It is one FOV from a co-cultured mouse and human fibroblast in situ sequencing (ISS) experiment. The dataset contains primary images and auxiliary images. The primary images consist of 4 rounds that sequence 4 bases of the beta-actin (ACTB) gene. The mouse and human ACTB genes differ by one base, allowing mouse and human cells to be labeled. The auxiliary images consist of a “dots” image, which is an image containing all rolonies stained in one channel, and a dapi image. This data was published in Ke, et al.

from starfish import data

# Download experiment
e = data.MOUSE_V_HUMAN()
# Load ImageStacks from experiment
imgs = e.fov().get_image('primary')  # primary images consisting of 4 rounds and 4 channels
dots = e.fov().get_image('dots')  # auxiliary image with all rolonies stained
nuclei = e.fov().get_image('nuclei')  # auxiliary image with dapi stain

Confirm data is loaded

Evaluating an ImageStack expression will return the shape of the ImageStack (a 5-Dimensional tensor). In this example, the primary images consist of 4 rounds, 4 channels, 1 z-plane, 980 y-pixels, and 1330 x-pixels.

imgs

Out:

<starfish.ImageStack (r: 4, c: 4, z: 1, y: 980, x: 1330)>

Interactively visualize primary images

Using display() you can visualize ImageStacks in napari. This is a very handy tool to examine your data and can help you make pipeline building decisions. If you use the dimension slider to compare channels 1 and 3 of round 1 (zero-based indexing), you can already see which cells express mouse ACTB and which cells express human ACTB.

from starfish import display

%gui qt
viewer = display(imgs)
viewer.layers[0].name = "raw stack" # rename the layer
../../_images/quickstart-napari-screenshot.png

View codebook

This codebook is an example of a one hot exponentially multiplexed codebook where every round is “one hot”, meaning every round has exactly one channel with signal. This codebook has two codewords, which are represented by the 2-Dimensional arrays and are labeled by the gene in the “target” coordinate. The “r” (rounds) and “c” (channels) correspond to the rows and columns, respectively. The size of the codeword arrays should equal the “r” and “c” dimensions of the primary imagestack.

e.codebook
<xarray.Codebook (target: 2, r: 4, c: 4)>
array([[[0, 0, 0, 1],
        [0, 0, 0, 1],
        [0, 1, 0, 0],
        [0, 0, 1, 0]],

       [[0, 0, 0, 1],
        [0, 1, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0]]], dtype=uint8)
Coordinates:
  * target   (target) object 'ACTB_human' 'ACTB_mouse'
  * r        (r) int64 0 1 2 3
  * c        (c) int64 0 1 2 3


Define functions that make up the image processing pipeline

To keep the quick start tutorial quick, only brief descriptions of each function are included here. In-depth discussion of each component can be found in the User Guide and API documentation.

Image Registration:

Translational shifting between rounds of imaging can be corrected for by image registration. For this data, the transforms are learned from the “dots” auxiliary image and then applied to the primary images.

from starfish.image import ApplyTransform, LearnTransform
from starfish.types import Axes


def register(imgs, dots, method = 'translation'):
    mip_imgs = imgs.reduce(dims = [Axes.CH, Axes.ZPLANE], func="max")
    mip_dots = dots.reduce(dims = [Axes.CH, Axes.ZPLANE], func="max")
    learn_translation = LearnTransform.Translation(reference_stack=mip_dots, axes=Axes.ROUND, upsampling=1000)
    transforms_list = learn_translation.run(mip_imgs)
    warp = ApplyTransform.Warp()
    registered_imgs = warp.run(imgs, transforms_list=transforms_list, in_place=False, verbose=True)
    return registered_imgs

Image Filtering:

Depending on the sample, there can be significant background autofluorescence in the image. Clusters of background, with radius greater than expected rolony radius can be removed with a white tophat filter.

from starfish.image import Filter


def filter_white_tophat(imgs, dots, masking_radius):
    wth = Filter.WhiteTophat(masking_radius=masking_radius)
    return wth.run(imgs), wth.run(dots)

Finding Spots:

There are many approaches and algorithms for finding spots in an image. For this pipeline, spots are found in the “dots” image using a laplacian-of-gaussian blob detection algorithm. Then an IntensityTable is constructed by measuring the pixel intensities at the spot locations in each primary image.

from starfish.spots import FindSpots


def find_spots(imgs, dots):

    p = FindSpots.BlobDetector(
        min_sigma=1,
        max_sigma=10,
        num_sigma=30,
        threshold=0.01,
        measurement_type='mean',
    )

    intensities = p.run(image_stack=imgs, reference_image=dots)
    return intensities

Decoding Spots:

The method of decoding spots in an IntensityTable depends on the experimental design that is reflected in the codebook. Since this example uses a one hot exponentially multiplexed codebook, PerRoundMaxChannel is a logical choice for decoding. It decodes the spot by taking the channel with max intensity in each round to construct a barcode, which is then matched to a codeword in the codebook. In the case of in situ sequencing, the barcode is actually the sequence of the gene.

from starfish.spots import DecodeSpots


def decode_spots(codebook, spots):
    decoder = DecodeSpots.PerRoundMaxChannel(codebook=codebook)
    return decoder.run(spots=spots)

Segmenting Cells:

Segmenting the cells in this FOV is done by thresholding and watershed segmentation. The dapi-stained nuclei are first thresholded and watershed segmented to act as seeds for watershed segmentation of the cell stain image. For this dataset, the cell stain image is the maximum-intensity projection of primary images, which has autofluorescence inside the cell and very little background outside the cell. There are also other cell segmentation methods available in starfish.

import numpy as np
from starfish.image import Segment
from starfish.types import Axes


def segment(registered_imgs, nuclei):
    dapi_thresh = .22  # binary mask for cell (nuclear) locations
    stain_thresh = 0.011  # binary mask for overall cells // binarization of stain
    min_dist = 56

    registered_mp = registered_imgs.reduce(dims=[Axes.CH, Axes.ZPLANE], func="max").xarray.squeeze()
    stain = np.mean(registered_mp, axis=0)
    stain = stain / stain.max()
    nuclei = nuclei.reduce(dims=[Axes.ROUND, Axes.CH, Axes.ZPLANE], func="max")

    seg = Segment.Watershed(
        nuclei_threshold=dapi_thresh,
        input_threshold=stain_thresh,
        min_distance=min_dist
    )
    masks = seg.run(registered_imgs, nuclei)

    return seg, masks

Single Cell Gene Expression Matrix:

A single cell gene expression matrix is made by first labeling spots with the cell ID of the cell that the spot is located in. Then the DecodedIntensityTable of spots is transformed into an expression matrix by counting the number of spots for each gene in each cell.

from starfish.spots import AssignTargets


def make_expression_matrix(masks, decoded):
    al = AssignTargets.Label()
    labeled = al.run(masks, decoded[decoded.target != 'nan'])
    cg = labeled[labeled.cell_id != 'nan'].to_expression_matrix()
    return cg

Run pipeline

Now that every function of the pipeline has been defined, we can run the pipeline on this FOV with just a few lines of code. For production scale image processing, this pipeline can be run on multiple FOVs in parallel.

# filter
imgs_wth, dots_wth = filter_white_tophat(imgs, dots, 15)

# register
registered_imgs = register(imgs_wth, dots_wth)

# find spots
spots = find_spots(registered_imgs, dots_wth)

# decode
decoded = decode_spots(e.codebook, spots)

# segment
seg, masks = segment(registered_imgs, nuclei)

# make expression matrix
mat = make_expression_matrix(masks, decoded)

Out:

  0%|          | 0/16 [00:00<?, ?it/s]
 25%|##5       | 4/16 [00:00<00:00, 27.92it/s]
 62%|######2   | 10/16 [00:00<00:00, 43.70it/s]
100%|##########| 16/16 [00:00<00:00, 56.54it/s]

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 61.94it/s]

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 133.30it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
100%|##########| 4/4 [00:00<00:00, 145.86it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
100%|##########| 4/4 [00:00<00:00, 166.40it/s]

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 50.80it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
100%|##########| 4/4 [00:00<00:00, 144.89it/s]

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 177.70it/s]

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 215.99it/s]

  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 163.99it/s]

Number of spots found

spots.count_total_spots()

Out:

5936

Visualize spots and masks on primary images

You can now visualize the primary images (ImageStack), decoded spots (DecodedIntensityTable), and segmented cells (BinaryMaskCollection) as layers in napari to verify the results.

While the accuracy of the segmentation on the left of the FOV may seem suspicious (is that one cell or two?), exploring the primary images and decoded spots demonstrates that there are two cells: one expressing the human ACTB gene and one expressing the mouse.

display(stack=registered_imgs, spots=decoded, masks=masks, viewer=viewer)
../../_images/quickstart-napari-screenshot-2.png

View decoded spots as a table

print(decoded.to_features_dataframe().head(10))

Out:

   intensity  z    y     x  ...       zc      target  distance  passes_thresholds
0   0.000856  0  979   826  ...  0.00005         nan  0.424362              False
1   0.015709  0  970   749  ...  0.00005  ACTB_mouse  0.235442               True
2   0.012817  0  967   778  ...  0.00005  ACTB_mouse  0.240857               True
3   0.003527  0  959    94  ...  0.00005         nan  0.505668              False
4   0.016274  0  955   761  ...  0.00005  ACTB_mouse  0.230975               True
5   0.018487  0  948   819  ...  0.00005  ACTB_mouse  0.265261               True
6   0.003872  0  944  1209  ...  0.00005         nan  0.621855              False
7   0.003816  0  944   695  ...  0.00005         nan  0.641994              False
8   0.010551  0  943   915  ...  0.00005  ACTB_mouse  0.234104               True
9   0.005345  0  938    89  ...  0.00005  ACTB_human  0.364498               True

[10 rows x 19 columns]

Number of mouse and human ACTB spots that pass threshold

An advantage of sequencing or barcoding transcripts in situ is the ability to use the multiple rounds of image acquisition as a way to error check. For example, false positives from the spot finding step are removed if the pixel intensities of the spot across all rounds of images don’t match a codeword in the codebook.

import numpy as np
import pandas as pd
from starfish.types import Features

genes, counts = np.unique(decoded.loc[decoded[Features.PASSES_THRESHOLDS]][Features.TARGET], return_counts=True)
table = pd.Series(counts, index=genes).sort_values(ascending=False)
print(table)

Out:

ACTB_human    190
ACTB_mouse    111
dtype: int64

View single cell gene expression matrix as a table

print(mat.to_pandas())

Out:

genes  ACTB_human  ACTB_mouse
cells
0            17.0         2.0
1             6.0        33.0
2            59.0         0.0
3            93.0         0.0
4             3.0        74.0

View single cell gene expression matrix as a heatmap

The heatmap makes it very clear that based on gene expression, three cells (cells 1, 3, and 4) are mouse cells and two (cells 2 and 5) are human cells.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(30, 10))
sns.set(font_scale=4)
sns.heatmap(mat.data.T,
            yticklabels=['mouse', 'human'],
            xticklabels = ['cell {}'.format(n+1) for n in range(5)],
            cmap='magma')
plot quick start

This is the end of the quick start tutorial! To use starfish on your own data, start with Data Formatting and then follow the User Guide to create a pipeline tailored to your data. If you want to try creating pipelines but don’t have data yet, starfish.data has a number of example datasets you can experiment with.

Total running time of the script: ( 1 minutes 48.928 seconds)

Gallery generated by Sphinx-Gallery