Running SpaRC on Google Cloud Dataproc
Building SpaRC:
Open a new Cloud Shell and run:
git clone https://github.com/shawfdong/sparc.git
cd sparc
wget https://piccolo.link/sbt-1.3.10.tgz
tar xvzf sbt-1.3.10.tgz
./sbt/bin/sbt assembly
A jar file should be created at: target/scala-2.11/LocalCluster-assembly-0.2.jar
Upload Data to Google Cloud Storage
-
Navigate to Storage and select Storage > Browser.
-
Click Create Bucket.
-
Specify your project name as the bucket name.
-
Click Create.
-
Copy the compiled SpaRC
LocalCluster-assembly-0.2.jar
and a sample input filesample_small.seq
to the project bucket you just created, by running the below in Cloud Shell:
gsutil cp target/scala-2.11/LocalCluster-assembly-0.2.jar gs://$DEVSHELL_PROJECT_ID
cd data/small
cp sample.seq sample_small.seq
gsutil cp sample_small.seq gs://$DEVSHELL_PROJECT_ID
Launch Dataproc
Run SpaRC job on Dataproc
-
In the Dataproc console, click Jobs.
-
Click Submit job.
-
For Job type, select Spark; for Main class or jar and Jar files, specify the location of the SpaRC jar file you uploaded to your bucket. Your bucket-name is your project name:
gs://<my-project-name>/LocalCluster-assembly-0.2.jar
.
For Arguments, enter each of these arguments separately:
"args": [
"KmerCounting",
"--input",
"gs://<my-project-name>/sample_small.seq",
"--output",
"test.log",
"--kmer_length",
"31"
]
For Properties, enter these Key-Value pairs separately:
"properties": {
"spark.executor.extraClassPath": "gs://<my-project-name>/LocalCluster-assembly-0.2.jar",
"spark.driver.maxResultSize": "8g",
"spark.network.timeout": "360000",
"spark.default.parallelism": "4",
"spark.eventLog.enabled": "false"
}
- Click Submit