Spatial Transcriptomics Tools Workshop

Installing the ST pipeline

We recommend that the ST pipeline is installed in a virtual environment. This has the advantage of allowing multiple installs of different versions of the ST pipeline, without any conflicting with each other. We use anaconda for the virtual environments and it can be downloaded from the following link:

https://www.continuum.io/downloads#linux

Installing anaconda can be done with the following command (version numbers might change):

bash Anaconda2-4.1.1-Linux-x86_64.sh

Creating anaconda virtual environments

Creating virtual environments with anaconda is very easy, as the following command shows:

conda create -n pipeline_v1.0 python=2.7 anaconda

Here we create a new environment, name it pipeline_v1.0, use python version 2.7 and install the default set of packages that come with anaconda.

To activate the environment

source activate pipeline_v1.0

To deactivate the environment

source deactivate

Installing the ST pipeline

To install the pipeline, change directory to the location you want to use for the install and then clone it from GitHub:

git clone https://github.com/SpatialTranscriptomicsResearch/st_pipeline.git

Enter into the st_pipeline directory and activate the virtual environment (if not already active). Then run the setup script:

cd st_pipeline
python setup.py build
python setup.py install

This will install the ST pipeline, including all its dependencies.

 

Running the ST pipeline

The ST Pipeline is recommended to be run on a computer with at least 32GB of RAM and 8 cpu cores and requires that STAR aligner (>= v2.5.0) is installed in the system. In the ST research group, we use a Linux cluster managed by SLURM to process our data. To get detailed descriptions about input arguments, run:

st_pipeline_run.py -h

Input required by the ST pipeline:

  • gzipped fastq files with paired end data
  • StarIndex directory for the reference genome
  • annotation file in gtf format
  • a file with barcode ids (comes with the st_pipeline package)

Optional:

  • StarIndex directory for contaminant dataset. If the contaminant filtering is activated, the data will first be mapped to the contaminant genome and removed from subsequent mapping to the reference genome.
#!/bin/bash
#SBATCH -n 8
#SBATCH --mem=32000
#SBATCH -t 03:00:00
#SBATCH -J CN48C1
#SBATCH --mail-user alexander.stuckey@scilifelab.se
#SBATCH --mail-type=ALL
#SBATCH -e job-%J.err
#SBATCH -o job-%J.out

source activate pipeline_v1.0

#Folder name

# FASTQ reads
FW=/fastdisk/INBOX/YOUR_RUN_FOLDER/YOUR_R1.fastq.gz
RV=/fastdisk/INBOX/YOUR_RUN_FOLDER/YOUR_R2.fastq.gz

# References for mapping, annotation and ribo-filtering
# NOTE this links are for the Mouse genome/annotation
MAP=/fastdisk/mouse/GRCm38_86v2/StarIndex
ANN=/fastdisk/mouse/GRCm38_86v2/annotation/gencode.vM11.annotation_noM.gtf
# Optional contaminant filtering
CONT=/fastdisk/mouse/GRCm38_86v2/ncRNA/StarIndex 

# Barcodes settings
# NOTE: Make sure that you use the right IDs file
ID=/fastdisk/ids/YOUR_IDS_FILE.txtt

# Output folder and experiment name
OUTPUT=/home/your.user/your_folder
# Do not add / or \ to the experiment name
EXP=EXPERIMENT_NAME

# Running the pipeline
# Add this if you want to keep the intermediate files  --no-clean-up
st_pipeline_run.py \
  --output-folder $OUTPUT \
  --ids $ID \
  --ref-map $MAP \
  --ref-annotation $ANN \
  --expName $EXP \
  --remove-polyA 10 \
  --remove-polyT 10 \
  --remove-polyG 10 \
  --remove-polyC 10 \
  --htseq-no-ambiguous \
  --verbose \
  --mapping-threads 8 \
  --log-file $OUTPUT/${EXP}_log.txt \
  --two-pass-mode \
  --umi-filter \
  --filter-AT-content 80 \
  --filter-GC-content 80 \
  --contaminant-index $CONT \
  --min-length-qual-trimming 40 \
  --disable-clipping \
  $FW $RV

rm unzipped*.fastq

QA on raw output from ST pipeline

Before we do any work, we want to check that the output from the ST pipeline is good, and that we can carry on with analysis. This is performed with the st_qa.py script

 

Run QA

python st_qa.py --input-data workshop_stdata.tsv

The output of this script is four pdf files and one text file. The pdf files show the number of genes and transcripts per spot in two different ways, as a bar chart and as a heatmap.The stats file records various stats about the data set, such as the number of features with data, the number of unique transcripts and genes, etc. Each file generated is prefixed with the name of the ST data file.

 

Rotate and resize images

One immediately obvious observation from the QA script is that the orientation of the HE and Cy3 images is not the same as that in the QA images. This is due to images extracted from our microscope are rotated 180 degrees to their actual orientation. One way that this can be identified is in the Cy3 image. There is a square of probes that are used to identify the correct orientation. It should be located in the top left, but can be seen in the bottom right in our example image. We therefore have to rotate the Cy3 image (and by extension the HE image as well) 180 degrees. This can be done in your favourite image program (photoshop, gimp, imagemagick, etc).

It is also a good idea to make a smaller copy of the images if you are going to work with the spot detector on your local computer. This is due to the memory requirements needed to perform the spot detection. 40% or 50% of the initial size is usually small enough to allievate any memory issues.

 

Selecting only the spots under the tissue

It may be of interest to only select array spots that are under the tissue, as this is where the data should be located (note that one can also select only the spots outside the tissue to compare between them). This is done using the ST spot detector tool developed by the ST research group. It is available on GitHub from the following link:

https://github.com/SpatialTranscriptomicsResearch/st_spot_detector

The spot detector can be installed both on a server or your local computer. The local option involves using Singularity as the containerisation software, thus Singularity will need to be installed first. It is available for Windows, Linux and MacOSX from the following link:

http://singularity.lbl.gov/

Building the container on your local computer takes roughly 30 minutes, due to the need to install and compile required software. More detailed instructions on installing the spot detector can be found at the following locations:

 

Practical session with ST spot detector

The spot detector outputs a text file with six columns (if you export only selected spots), or seven columns (if you export all spots). The seventh column is a flag column (1 or 0) showing if the spot is under tissue or not. I would recommend exporting a selection of spots instead of all spots, as there might be artifacts that you would not wish to have in your dataset downstream (spots that are only partially covered by tissue, spots covered in fragmented tissue, folded tissue, etc).

It also provides you with an alignment matrix, which is used by the ST viewer software to correctly align the spots on the tissue image (Fernandez Navarro, submitted). ### Example alignment matrix

144.60178679 -0.40425072 0 -0.13930526999999998 145.03759502 0 929.8014744022836 166.0837697805873 1 

Creating new ST dataset with the selected spots only

The selected spots can be extracted from the dataset (and a new dataset created) using the adjust_matrix_coordinates.py script.

usage: adjust_matrix_coordinates.py [-h] --counts-matrix COUNTS_MATRIX
                                    [--outfile OUTFILE] [--update-coordinates]
                                    --coordinates-file COORDINATES_FILE
                                    [--outformat {array,pixel}]

Redo QA with new selection

Now that we have a selection of spots that is only under the tissue, we can rerun the QA script and see how it compares to the whole dataset.