New BigData Initiative
We (@tyler and myself) have started to cut new versions of the ASF charms to facilitate our needs to run more recent releases of the software than can be found in the bigtop releases/upstream charms. We quickly set sight on a few additional set of goals in writing these charms in order to accommodate our environment’s needs for multi-homed networking and managed storage within some of the ASF components.
The primary goal of these charms is to let a user deploy spark workloads in a multi-homed network environment + leveraging juju storage to facilitate the software components individual storage needs AND the ability to be able to deploy whatever version of the upstream software the user please by supplying the tarball of the software as a resource.
Some key features:
- New big data development workflow component: Conda charm.
- Usability enhancement: Spark configuration for use with radosgw or aws s3 out of the box.
- Network space support - Zookeeper, Jupyter-Notebook, Spark.
- Juju storage support: Zookeeper
- S3 Support (rough): Jupyter-Notebook, Spark
Note: We had to cut these charms in a time box so they have a quite a bit of room for improvement across the board. I basically just wanted to get them out there and working and then start the process of iterating on them to be more generally useful. They have some rough edges no doubt that we intend to smooth out pretty quickly.
Below are some of the starts. More to come in the near future.
* layer-zookeeper - https://github.com/omnivector-solutions/layer-zookeeper
* interface-zookeeper - https://github.com/omnivector-solutions/interface-zookeeper
* layer-spark - https://github.com/omnivector-solutions/layer-spark
* layer-spark-base - https://github.com/omnivector-solutions/layer-spark-base
* layer-hadoop-base - https://github.com/omnivector-solutions/layer-hadoop-base
* interface-spark - https://github.com/omnivector-solutions/interface-spark
Jupyter-Notebook + Spark
* layer-jupyter-notebook - https://github.com/omnivector-solutions/layer-jupyter-notebook
* layer-conda - https://github.com/omnivector-solutions/layer-conda
* layer-conda-api - https://github.com/omnivector-solutions/layer-conda-api
Jupyter-Notebook + Conda + Spark
Just an example of how these come together. The object storage gateway could be an aws s3 endpoint or ceph object storage gateway endpoint. This stack is primarily used for a spark standalone cluster use case, but the jupyter notebook built with layer-spark make the jupyter-notebook charm alone a great way to interface to deploying spark 2.4.x workloads to k8s via jupyter notebook. Here is the bundle I’ve been beating on.
series: bionic applications: spark: charm: cs:~omnivector/spark constraints: "tags=bdx-test spaces=mgmt,access" num_units: 3 options: object-storage-gateway: "<object-storage-endpoint-url>" aws-access-key: "<s3-access-key>" aws-secret-key: "<s3-secret-key>" bindings: "": mgmt spark: access jupyter-notebook: charm: cs:~omnivector/jupyter-notebook constraints: "tags=bdx-test spaces=mgmt,access" num_units: 1 options: object-storage-gateway: "<object-storage-endpoint-url>" aws-access-key: "<s3-access-key>" aws-secret-key: "<s3-secret-key>" bindings: "": mgmt http: access conda: charm: cs:~omnivector/conda num_units: 0 options: conda-extra-packages: "pyspark=2.4.0 numpy ipykernel pandas pip" conda-extra-pip-packages: "psycopg2 Cython git+https://<oauthkey>:firstname.lastname@example.org/<my-private-org>/<my-private-repo>@master" relations: - - spark:juju-info - conda:juju-info - - jupyter-notebook:juju-info - conda:juju-info
Model Controller Cloud/Region Version SLA Timestamp spark01 pdl-maas pdl-maas 2.5.4 unsupported 03:07:31Z App Version Status Scale Charm Store Rev OS Notes conda-pdlda active 6 conda jujucharms 13 ubuntu jupyter-notebook active 1 jupyter-notebook jujucharms 19 ubuntu pdl-bdx-conda00 active 6 conda jujucharms 13 ubuntu spark 2.4.1 active 5 spark jujucharms 14 ubuntu Unit Workload Agent Machine Public address Ports Message jupyter-notebook/57* active idle 127 10.10.11.29 8888/tcp http://10.100.211.10:8888 conda-pdlda/11 active idle 10.10.11.29 Conda Env Installed: conda-pdlda pdl-bdx-conda00/5 active idle 10.10.11.29 Conda Env Installed: pdl-bdx-conda00 spark/123 active idle 128 10.10.11.35 7078/tcp,8081/tcp Services: worker conda-pdlda/6* active idle 10.10.11.35 Conda Env Installed: conda-pdlda pdl-bdx-conda00/3* active idle 10.10.11.35 Conda Env Installed: pdl-bdx-conda00 spark/124* active idle 129 10.10.11.31 7077/tcp,7078/tcp,8080/tcp,8081/tcp,18080/tcp Running: master,worker,history conda-pdlda/10 active idle 10.10.11.31 Conda Env Installed: conda-pdlda pdl-bdx-conda00/2 active idle 10.10.11.31 Conda Env Installed: pdl-bdx-conda00 spark/125 active idle 130 10.10.11.37 7078/tcp,8081/tcp Services: worker conda-pdlda/9 active idle 10.10.11.37 Conda Env Installed: conda-pdlda pdl-bdx-conda00/1 active idle 10.10.11.37 Conda Env Installed: pdl-bdx-conda00 spark/126 active idle 131 10.10.11.17 7078/tcp,8081/tcp Services: worker conda-pdlda/7 active idle 10.10.11.17 Conda Env Installed: conda-pdlda pdl-bdx-conda00/4 active idle 10.10.11.17 Conda Env Installed: pdl-bdx-conda00 spark/127 active idle 132 10.10.11.40 7078/tcp,8081/tcp Services: worker conda-pdlda/8 active idle 10.10.11.40 Conda Env Installed: conda-pdlda pdl-bdx-conda00/0 active idle 10.10.11.40 Conda Env Installed: pdl-bdx-conda00 Machine State DNS Inst id Series AZ Message 127 started 10.10.11.29 d3-util-03 bionic d3 Deployed 128 started 10.10.11.35 d3-util-04 bionic d3 Deployed 129 started 10.10.11.31 d4-util-05 bionic d4 Deployed 130 started 10.10.11.37 d3-util-01 bionic d3 Deployed 131 started 10.10.11.17 d4-util-06 bionic d4 Deployed 132 started 10.10.11.40 d4-util-03 bionic d4 Deployed
Following deployment of ^ bundle. You should be able to login to the jupyter notebook and start running jobs that have access to your object storage via s3a. In this way you can run distributed spark/pyspark workloads in spark standalone mode using ceph object storage as a backend, eliminating the need for yarn, hadoop, and/or hdfs.
A simple example.
import os os.environ['PYSPARK_PYTHON'] = '/opt/anaconda/envs/conda/bin/python' from pyspark.sql import SparkSession from pyspark import SparkConf conf = SparkConf()\ .setAppName('spark_playground')\ .setMaster('spark://<master-ip-address>:7077') spark = SparkSession.builder.config(conf=conf).getOrCreate() sc = spark.sparkContext sc.textFile("s3a://path/to/your/datafile.txt").take(1)
Now that we have a working zookeeper charm our next step is to circle back around and put more cycles into the spark charm to decouple the node types and make a relation to zookeeper to get spark HA master functionality and shuffle service + shuffle storage working.
Insights, comments, pull requests welcome!