Converting Data to SpaceTx Format

We provide three types of tools to convert data into SpaceTx-Format. One is a Bio-Formats writer which writes SpaceTx-Format experiments using the Bio-Formats converter. Bio-Formats can read a variety of input formats, so might be a relatively simple approach for users familiar with those tools.

Second, we provide a mechanism by which the user organizes the data as 2D tiles with a clearly defined filename schema, and a conversion tool. There is documentation and an example for that mechanism.

If neither of these models fit, then we provide a generalized mechanism where conversion is managed through a set of interfaces where the user provides python code responsible for obtaining the data corresponding to each 2D tile. Example formatters for a variety of datasets are also available. This same interface can also be used to directly load data, although there may be performance implications in doing so.

Advanced data formatting

Example Data Conversion

We provide an example for formatting an in-situ sequencing (ISS) experiment. These data were generated by the Nilsson lab, and the analysis of the results can be found in their publication. In brief, this experiment has 16 fields of view that measure 4 channels on 4 separate imaging rounds.

We have all the data from this experiment including the organization of the fields of view in physical space, but we lack the exact physical coordinates. For the purpose of this tutorial, we fabricate coordinates that are consistent with the ordering of the tiles and store those coordinates in an auxiliary json file.

The fields of view for this dataset are organized in a grid and ordered as follows, with 1 being the first position of the microscope on the tissue slide:

[ 1   2   3   4  ]
[ 8   7   6   5  ]
[ 9   10  11  12 ]
[ 16, 15, 14, 13 ]

To be consistent with this data, we’ve created a json file that contains fabricated coordinates for each of the first two fields of view, which we will use for this tutorial. We set the origin (0, 0, 0) at the top left of the first field of view, so the coordinates for these files are:

{
    "fov_000": {
        "xc": [0, 10],
        "yc": [0, 10],
        "zc": [0, 1]
    },
    "fov_001": {
        "xc": [10, 20],
        "yc": [0, 10],
        "zc": [0, 1]
    }
}

Downloading Data

Like all starfish example data, this experiment is hosted on Amazon Web Services. Once formatted, experiments can be downloaded on-demand into starfish.

For the purposes of this vignette, we will format only two of the 16 fields of view. To download the data, you can run the following commands:

mkdir -p iss/raw
aws s3 cp s3://spacetx.starfish.data.public/browse/raw/20180820/iss_breast/ iss/raw/ \
    --recursive \
    --exclude "*" \
    --include "slideA_1_*" \
    --include "slideA_2_*" \
    --include "fabricated_test_coordinates.json" \
    --no-sign-request
ls iss/raw

This command should download 44 images:

  • 2 fields of view

  • 2 overview images: “dots” used to register, and DAPI, to localize nuclei

  • 4 rounds, each containing:

  • 4 channels (Cy 3 5, Cy 3, Cy 5, and FITC)

  • DAPI nuclear stain

Formatting single-plane TIFF files in SpaceTx Format

We provide some tools to take a directory of files like the one just downloaded and translate it into starfish-formatted files. These objects are TileFetcher and FetchedTile. In brief, TileFetcher provides an interface to get the appropriate tile from a directory for a given set of sptx-format metadata, like a specific z-plane, imaging round, and channel by decoding the file naming conventions. FetchedTile exposes methods to extract data specific to each tile to fill out the remainder of the metadata, such as the tile’s shape and data. This particular example is quite simple because the data are already stored as 2-D TIFF files, however there are several other examples that convert more complex data into a SpaceTx-Format Experiment.

These are the abstract classes that must be subclassed for each set of naming conventions:

class FetchedTile:
    """
    This is the contract for providing the data for constructing a :class:`slicedimage.Tile`.
    """
    def __init__(self, *args, **kwargs):
        pass

    @property
    def shape(self) -> Mapping[Axes, int]:
        """Return Tile shape.

        Returns
        -------
        Mapping[Axis, int]
            The shape of the tile, mapping from Axes to its size.
        """
        raise NotImplementedError()

    @property
    def coordinates(self) -> Mapping[Union[str, Coordinates], CoordinateValue]:
        """Return the tile's coordinates in the global coordinate space..

        Returns
        -------
        Mapping[Union[str, Coordinates], CoordinateValue]
            Maps from a coordinate type (e.g. 'x', 'y', or 'z') to its value or range.
        """
        raise NotImplementedError()

    @property
    def extras(self) -> dict:
        """Return the extras data associated with the tile.

        Returns
        -------
        Mapping[str, Any]
            Maps from a key to its value.
        """
        return {}

    def tile_data(self) -> np.ndarray:
        """Return the image data representing the tile.  The tile must be row-major.

        Returns
        -------
        ndarray :
            The image data
        """
        raise NotImplementedError()
class TileFetcher:
    """
    This is the contract for providing the image data for constructing a
    :class:`slicedimage.Collection`.
    """
    def get_tile(
            self, fov_id: int, round_label: int, ch_label: int, zplane_label: int) -> FetchedTile:
        """
        Given fov_id, round_label, ch_label, and zplane_label, return an instance of a
        :class:`.FetchedImage` that can be queried for the image data.
        """
        raise NotImplementedError()

To create a formatter object for in-situ sequencing, we subclass the TileFetcher and FetchedTile by extending them with information about the experiment. When formatting single-plane TIFF files, we expect that all metadata needed to construct the FieldOfView is embedded in the file names.

For the ISS experiment, the file names are structured as follows

slideA_1_1st_Cy3 5.TIF

This corresponds to

(experiment_name)_(field_of_view_number)_(imaging_round)_(channel_name).TIF

So, to construct a sptx-format FieldOfView we must adjust the basic TileFetcher object so that it knows about the file name syntax.

That means implementing methods that return the shape, format, and an open file handle for a tile. Here, we implement those methods, and add a cropping method as well, to mimic the way that ISS data was processed when it was published.

class IssCroppedBreastTile(FetchedTile):

    def __init__(
            self,
            file_path: str,
            coordinates
    ) -> None:
        self.file_path = file_path
        self._coordinates = coordinates

    @property
    def shape(self) -> Mapping[Axes, int]:
        return {Axes.Y: 1044, Axes.X: 1390}

    @property
    def coordinates(self) -> Mapping[Union[str, Coordinates], CoordinateValue]:
        return self._coordinates

    @staticmethod
    def crop(img):
        crp = img[40:1084, 20:1410]
        return crp

    def tile_data(self) -> np.ndarray:
        return self.crop(imread(self.file_path))

This object, combined with a TileFetcher, contains all the information that starfish needs to parse a directory of files and create sptx-format compliant objects. Here, two tile fetchers are needed. One parses the primary images, and another the auxiliary nuclei images that will be used to seed the basin for segmentation.

class ISSCroppedBreastPrimaryTileFetcher(TileFetcher):
    def __init__(self, input_dir):
        self.input_dir = input_dir
        coordinates = os.path.join(input_dir, "fabricated_test_coordinates.json")
        with open(coordinates) as f:
            self.coordinates_dict = json.load(f)

    @property
    def ch_dict(self):
        ch_dict = {0: 'FITC', 1: 'Cy3', 2: 'Cy3 5', 3: 'Cy5'}
        return ch_dict

    @property
    def round_dict(self):
        round_str = ['1st', '2nd', '3rd', '4th']
        round_dict = dict(enumerate(round_str))
        return round_dict

    def get_tile(
            self, fov_id: int, round_label: int, ch_label: int, zplane_label: int) -> FetchedTile:

        # get filepath
        fov_ = str(fov_id + 1)
        round_ = self.round_dict[round_label]
        ch_ = self.ch_dict[ch_label]
        filename = f"slideA_{fov_}_{round_}_{ch_}.TIF"
        file_path = os.path.join(self.input_dir, filename)

        # get coordinates
        fov_c_id = f"fov_{fov_id:03d}"
        coordinates = {
            Coordinates.X: self.coordinates_dict[fov_c_id]["xc"],
            Coordinates.Y: self.coordinates_dict[fov_c_id]["yc"],
        }

        return IssCroppedBreastTile(file_path, coordinates)
class ISSCroppedBreastAuxTileFetcher(TileFetcher):
    def __init__(self, input_dir, aux_type):
        self.input_dir = input_dir
        self.aux_type = aux_type
        coordinates = os.path.join(input_dir, "fabricated_test_coordinates.json")
        with open(coordinates) as f:
            self.coordinates_dict = json.load(f)

    def get_tile(
            self, fov_id: int, round_label: int, ch_label: int, zplane_label: int) -> FetchedTile:
        if self.aux_type == 'nuclei':
            filename = 'slideA_{}_DO_DAPI.TIF'.format(str(fov_id + 1))
        elif self.aux_type == 'dots':
            filename = 'slideA_{}_DO_Cy3.TIF'.format(str(fov_id + 1))
        else:
            msg = 'invalid aux type: {}'.format(self.aux_type)
            msg += ' expected either nuclei or dots'
            raise ValueError(msg)

        file_path = os.path.join(self.input_dir, filename)

        # get coordinates
        fov_c_id = f"fov_{fov_id:03d}"
        coordinates = {
            Coordinates.X: self.coordinates_dict[fov_c_id]["xc"],
            Coordinates.Y: self.coordinates_dict[fov_c_id]["yc"],
        }

        return IssCroppedBreastTile(file_path, coordinates=coordinates)

Creating a Build Script

Next, we combine these objects with some information we already had about the experiments. On the outset we stated that an ISS experiment has 4 imaging rounds and 4 channels, but only 1 z-plane. These data fill out the primary_image_dimensions of the TileSet. In addition, it was stated that ISS has a single dots and nuclei image. In starfish, auxiliary images are also stored as TileSet objects even though often, as here, they have only 1 channel, round, and z-plane.

We create a dictionary to hold each piece of information, and pass that to write_experiment_json, a generic tool that accepts the objects we’ve aggregated above and constructs TileSet objects:

def format_data(input_dir, output_dir, num_fov):

    primary_image_dimensions = {
        Axes.ROUND: 4,
        Axes.CH: 4,
        Axes.ZPLANE: 1,
    }

    aux_name_to_dimensions = {
        'nuclei': {
            Axes.ROUND: 1,
            Axes.CH: 1,
            Axes.ZPLANE: 1,
        },
        'dots': {
            Axes.ROUND: 1,
            Axes.CH: 1,
            Axes.ZPLANE: 1,
        }
    }

    write_experiment_json(
        path=output_dir,
        fov_count=num_fov,
        tile_format=ImageFormat.TIFF,
        primary_image_dimensions=primary_image_dimensions,
        aux_name_to_dimensions=aux_name_to_dimensions,
        primary_tile_fetcher=ISSCroppedBreastPrimaryTileFetcher(input_dir),
        aux_tile_fetcher={
            'nuclei': ISSCroppedBreastAuxTileFetcher(input_dir, 'nuclei'),
            'dots': ISSCroppedBreastAuxTileFetcher(input_dir, 'dots'),
        },
    )

Finally, we can run the script. We’ve packaged it up as an example in starfish. It takes as arguments the input directory (containing raw images), output directory (which will contain formatted data) and the number of fields of view to extract from the raw directory.

mkdir iss/formatted
python3 examples/format_iss_breast_data.py \
    iss/raw/ \
    iss/formatted \
    2
ls iss/formatted/*.json