The minimal BDS cluster I could deploy was 5 nodes: one master, one utility and 3 workers. It took around 50 minutes to provision the infrastructure, without taking into consideration the pre work to setup groups, network and etc.
My next challenge was to figure out how to use Big Data Studio, which is the native Notebook interface that comes preinstalled on BDS.
The first step was to map the utility node private ip to a public ip address. The steps to do that are documented here: https://docs.oracle.com/en/cloud/paas/big-data-service/user/establish-connections-nodes-private-ip-addresses.html
Guess what... didn't work >.<

Reading a bit further I found that my NAT setup was interfering with the route for the public ip. The solution to this problem was documented in this page, under "Additional Steps for Clusters with an (...) NAT Gateway": https://docs.oracle.com/en/cloud/paas/big-data-service/user/access-big-data-studio.html
After manually editing the routes I finally got access to Big Data Studio. Success! (Or so I thought....)
It happens that this is not a JupyterHub instance as I was expecting. I don't recognise this product, but seems to be based on Apache Zeppelin.

Nevertheless I was expecting it to support the "oci://" scheme out of the box... but it didn't:
Tbh at this point I can't be bothered to setup this. I already tried to setup Oracle HDFS connector once today... I'm not doing this twice. Just in case you got to this thread right now and you need more info, these are the docs for Data Studio: https://docs.oracle.com/en/cloud/paas/big-data-service/user/access-hdfs-spark-and-pyspark.html
And these are the docs for setting up the HDFS connector for spark: https://docs.cloud.oracle.com/en-us/iaas/Content/API/SDKDocs/hdfsconnectorspark.htm
Which leads me to the conclusion: for the scope of my talk, Oracle Data Science service is the big winner. I'm going to do a compromise and run my examples using pyspark installed on it.
It's not a true cluster experience, but maybe if paired with Oracle Data Flow it would be much more powerful than Big Data Service, which at this point seems to be too much rooted in the old hadoop ways.
Of course if you need streaming you have no other option on OCI other than deploying BDS.
Anyway, for my use case I want to teach Spark, not OCI, so I'm going to load a text file from OCI's object store and do a classic wordcount type of program using pyspark on Oracle Data Science platform.
You can follow @danicat83.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: