Skip to content

Running SpaRC on Google Cloud Dataproc

Building SpaRC:

Open a new Cloud Shell and run:

git clone https://github.com/shawfdong/sparc.git
cd sparc
wget https://piccolo.link/sbt-1.3.10.tgz
tar xvzf sbt-1.3.10.tgz
./sbt/bin/sbt assembly

A jar file should be created at: target/scala-2.11/LocalCluster-assembly-0.2.jar

Upload Data to Google Cloud Storage

  1. Navigate to Storage and select Storage > Browser.

  2. Click Create Bucket.

  3. Specify your project name as the bucket name.

  4. Click Create.

  5. Copy the compiled SpaRC LocalCluster-assembly-0.2.jar and a sample input file sample_small.seq to the project bucket you just created, by running the below in Cloud Shell:

gsutil cp target/scala-2.11/LocalCluster-assembly-0.2.jar gs://$DEVSHELL_PROJECT_ID
cd data/small
cp sample.seq sample_small.seq
gsutil cp sample_small.seq gs://$DEVSHELL_PROJECT_ID

Launch Dataproc

Run SpaRC job on Dataproc

  1. In the Dataproc console, click Jobs.

  2. Click Submit job.

  3. For Job type, select Spark; for Main class or jar and Jar files, specify the location of the SpaRC jar file you uploaded to your bucket. Your bucket-name is your project name: gs://<my-project-name>/LocalCluster-assembly-0.2.jar.

For Arguments, enter each of these arguments separately:

   "args": [
            "KmerCounting",
            "--input",
            "gs://<my-project-name>/sample_small.seq",
            "--output",
            "test.log",
            "--kmer_length",
            "31"
   ]

For Properties, enter these Key-Value pairs separately:

    "properties": {
      "spark.executor.extraClassPath": "gs://<my-project-name>/LocalCluster-assembly-0.2.jar",
      "spark.driver.maxResultSize": "8g",
      "spark.network.timeout": "360000",
      "spark.default.parallelism": "4",
      "spark.eventLog.enabled": "false"
    }
  1. Click Submit