Running SpaRC on Lawrencium

SpaRC is an Apache Spark-based scalable genomic sequence clustering application. SpaRC has been running successfully on AWS EMR, as well as on the Bridges supercomputer at PSC. In this tutorial, I describe how to run SpaRC on Lawrencium.

Spark On Demand on Lawrencium

Users can run Spark jobs on Lawrencium in Spark On Demand (SOD) fashion, in which a standalone Spark cluster will be created on demand in a Slurm job. Note that the Spark cluster will be running in standalone mode, so there will be no YARN cluster manager, nor HDFS. In lieu of HDFS, we'll use Lustre scratch for storage.

As of this writing, there is only Spark 2.1.0 available on Lawrencium. We may install a more up-to-date version in the near future.

Building SpaRC

You'll build SpaRC against Spark 2.1.0. The source code of SpaRC is hosted on Bitbucket at https://bitbucket.org/LizhenShi/sparc. I don't have write access to the repo, so I imported it to GitHub, at https://github.com/shawfdong/sparc. I've added a new file build.sbt.spark2.1.0 to the GitHub repo, which, as the name suggests, will be used to build SpaRC against Spark 2.1.0.

Note that you can't build SpaRC on the login nodes of Lawrencium, because rsync (which is required by sbt) is disabled there. You'll have to use the data transfer node lrc-xfer.lbl.gov:

$ ssh lrc-xfer.lbl.gov

Download and unpack the Scala build tool sbt:

$ wget https://piccolo.link/sbt-1.3.10.tgz
$ tar xvz sbt-1.3.10.tgz

Load the module for JDK 1.8.0:

$ export MODULEPATH=$MODULEPATH:/global/software/sl-7.x86_64/modfiles/langs
$ module load java
$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Clone and build SpaRC:

$ git clone https://github.com/shawfdong/sparc.git
$ cd sparc
$ cp build.sbt.spark2.1.0 build.sbt
$ ~/sbt/bin/sbt assembly

This will create a fat jar file ~/sparc/target/scala-2.11/LocalCluster-assembly-0.2.jar. Copy it to your Lustre scratch space:

$ cp target/scala-2.11/LocalCluster-assembly-0.2.jar /global/scratch/$USER/

While you are at it, also download a sample sequence data file:

$ cd /global/scratch/$USER
$ curl http://s3.amazonaws.com/share.jgi-ga.org/sparc_example/illumina_02G_merged_with_id.seq.gz -o sample.seq.gz
$ gunzip sample.seq.gz
$ wc -l sample.seq
6343345 sample.seq
$ head -2 sample.seq
6   HISEQ13:204:C8T6VANXX:1:1101:3726:1992  NATATTCCCGTTCTGATATTGCGTTAAGTCGTTCCCCTAAGCCGGCCCTCCTTATCGAGCGCGCCGGCTTTTTTTGCCATGTTCAGCGAATCACAGGACAAGATACTTCACCTAACGTAGTAGATGGTTCTATGCTTAAGGGCAAGGTGTNTTAATCTCGATATCCGCCTGTTTTAATAAATCAGCGACGAAGCGATGGGAGGATAAGCGCTCGTCAAAAACCACGCGCTTTTTTTCTAAGGTGGGTAAGTTCAAGGTAACACCCCCACTATGCCTATGAGTGAATTGGTAACACCTTGCC
60  HISEQ13:204:C8T6VANXX:1:1101:4370:1919  NCGTGCGCCCATCTCCGTGGCTAAACAGCTTGAGGTGGAAATTCGCCAGTGGATACAGCAGCATGCAGCGACAGGCGGGCGTCGCCTCCCTTCGATACGCCATTTAGCAGCAACACATAACGTCAGCCGCAATGCAGTCATTGAAGCTTANGTAAGGTCTTCTCCTTCGCGCCAATCGTTAGGTAACCAGCCGCAGCCCAGTTTCAATGACTGTTCATCGGTGTTAAACACGCCCCATAAGCCATTCGTCACTTCTTCCAATGGCGTTGATGACGCGGGTTGAACCAGTTTCAGCGCGTTA

Now you can exit from lrc-xfer.lbl.gov.

Alternatively, you could build SpaRC on your local computer, then upload the assembled fat jar file to your Lustre scratch space on Lawrencium.

Running SpaRC interactively

SSH to a Lawrencium login node:

$ ssh lrc-login.lbl.gov

For demonstration purpose, request 2 nodes from the lr6 partition for the interactive job:

$ cd /global/scratch/$USER
$ srun -p lr6 --qos=lr_normal -N 2 -t 1:00:0 --account=<NAME_OF_YOUR_PROJECT_ACCOUNT> --pty bash

When the job starts, you'll be given exclusive allocation to 2 compute nodes in the lr6 partition, each one of which has 2x 16-core Intel Xeon Gold 6130 CPUs and 96 GB memory (so in total, there will be 64 cores and 192GB memory available in your Spark cluster). And you'll be dropped to a bash shell on one of the compute nodes. Start Spark On Demand (SOD):

$ source /global/home/groups/allhands/bin/spark_helper.sh
$ spark-start

Run the first Spark job on SOD (you might want to tune the values for --executor-cores, --num-executors and --executor-memory):

$ SCRATCH=/global/scratch/$USER
$ JAR=$SCRATCH/LocalCluster-assembly-0.2.jar
$ OPT1="--master $SPARK_URL --executor-cores 4 --num-executors 16 --executor-memory 12g"
$ OPT2="--conf spark.executor.extraClassPath=$JAR \
    --conf spark.driver.maxResultSize=8g \
    --conf spark.network.timeout=360000 \
    --conf spark.speculation=true \
    --conf spark.default.parallelism=100 \
    --conf spark.eventLog.enabled=false"
$ spark-submit $OPT1 $OPT2 \
    $JAR KmerCounting --wait 1 \
    -i $SCRATCH/sample.seq \
    -o $SCRATCH/test_kc_seq_31 --format seq -k 31 -C

Run the second Spark job:

$ spark-submit $OPT1 $OPT2 \
    $JAR KmerMapReads2 --wait 1 \
    --reads $SCRATCH/sample.seq \
    --format seq -o $SCRATCH/test_kmerreads.txt_31 -k 31 \
    --kmer $SCRATCH/test_kc_seq_31 \
    --contamination 0 --min_kmer_count 2 \
    --max_kmer_count 100000 -C --n_iteration 1

Run the third Spark job:

$ spark-submit $OPT1 $OPT2 \
    $JAR GraphGen2 --wait 1 \
    -i $SCRATCH/test_kmerreads.txt_31 \
    -o $SCRATCH/test_edges.txt_31 \
    --min_shared_kmers 2 --max_degree 50 -n 1000

Run the fourth Spark job:

$ spark-submit $OPT1 $OPT2 \
    $JAR GraphLPA2 --wait 1 \
    -i $SCRATCH/test_edges.txt_31 \
    -o $SCRATCH/test_lpa.txt_31 \
    --min_shared_kmers 2 --max_shared_kmers 20000 \
    --min_reads_per_cluster 2 --max_iteration 10 -n 1000

Run the fifth Spark job:

$ spark-submit $OPT1 $OPT2 \
    $JAR CCAddSeq --wait 1 \
    -i $SCRATCH/test_lpa.txt_31 \
    --reads $SCRATCH/sample.seq \
    -o $SCRATCH/sample_lpaseq.txt_31

Once you are done, don't forget to stop Spark On Demand and exit from the interactive job:

$ spark-stop
$ exit

Running SpaRC in batch mode

To run SpaRC in batch mode, write a Slurm job script and call it sparc.slurm:

#!/bin/bash
#SBATCH --job-name=sparc
#SBATCH --partition=lr6
#SBATCH --qos=lr_normal
#SBATCH --account=<NAME_OF_YOUR_PROJECT_ACCOUNT>
#SBATCH --nodes=2
#SBATCH --time=01:00:00

source /global/home/groups/allhands/bin/spark_helper.sh
# Start Spark On Demand
spark-start

SCRATCH=/global/scratch/$USER
JAR=$SCRATCH/LocalCluster-assembly-0.2.jar
OPT1="--master $SPARK_URL --executor-cores 4 --num-executors 16 --executor-memory 12g"
OPT2="--conf spark.executor.extraClassPath=$JAR \
    --conf spark.driver.maxResultSize=8g \
    --conf spark.network.timeout=360000 \
    --conf spark.speculation=true \
    --conf spark.default.parallelism=100 \
    --conf spark.eventLog.enabled=false"

# 1st Spark job
spark-submit $OPT1 $OPT2 \
    $JAR KmerCounting --wait 1 \
    -i $SCRATCH/sample.seq \
    -o $SCRATCH/test_kc_seq_31 --format seq -k 31 -C

# 2nd Spark job
spark-submit $OPT1 $OPT2 \
    $JAR KmerMapReads2 --wait 1 \
    --reads $SCRATCH/sample.seq \
    --format seq -o $SCRATCH/test_kmerreads.txt_31 -k 31 \
    --kmer $SCRATCH/test_kc_seq_31 \
    --contamination 0 --min_kmer_count 2 \
    --max_kmer_count 100000 -C --n_iteration 1

# 3rd Spark job
spark-submit $OPT1 $OPT2 \
    $JAR GraphGen2 --wait 1 \
    -i $SCRATCH/test_kmerreads.txt_31 \
    -o $SCRATCH/test_edges.txt_31 \
    --min_shared_kmers 2 --max_degree 50 -n 1000
Run the fourth Spark job:

# 4th Spark job
spark-submit $OPT1 $OPT2 \
    $JAR GraphLPA2 --wait 1 \
    -i $SCRATCH/test_edges.txt_31 \
    -o $SCRATCH/test_lpa.txt_31 \
    --min_shared_kmers 2 --max_shared_kmers 20000 \
    --min_reads_per_cluster 2 --max_iteration 10 -n 1000

# 5th Spark job
spark-submit $OPT1 $OPT2 \
    $JAR CCAddSeq --wait 1 \
    -i $SCRATCH/test_lpa.txt_31 \
    --reads $SCRATCH/sample.seq \
    -o $SCRATCH/sample_lpaseq.txt_31

# Stop Spark On Demand
spark-stop

Then submit the job with:

sbatch sparc.slurm

Known issues and future improvements

Presumably, there is a switch -i to spark-start that would enable communications over the IPoIB network. But it doesn't work!
Spark 2.1.0 is a bit old.