Whitepaper RAPIDS Scaling on Dell EMC PowerEdge Servers Revision: 1.1 Issue Date: 2/4/2020 Issue Date: 2/4/2020 Abstract In this project we tested Dell EMC PowerEdge Servers with end to end (E2E) workflow using New York City - Taxi notebook to accelerate large scale workloads in scale-up and scale-out solutions using RAPID library from NVIDIA™ and DASK-CUDA for parallel computing with XGBoost.
Revisions Date Description 2/4/2020 Initial release Authors This paper was produced by the following: Name Vilmara Sanchez Dell EMC, Advanced Engineering Bhavesh Patel Dell EMC, Advanced Engineering Acknowledgements This paper was supported by the following: Name Josh Anderson Dell EMC Robert Crovella NVIDIA NVIDIA Account Team NVIDIA NVIDIA Developer Forum NVIDIA NVIDIA RAPIDS Dev Team NVIDIA Dask Development Team Library for dynamic task scheduling https://dask.
Table of Contents 1 RAPIDS Overview ............................................................................................................................ 5 1.1 XGBoost ................................................................................................................................. 6 1.2 Dask for Parallel Computing ................................................................................................... 7 1.3 Why RAPIDS, Dask and XGBoost? ...............................
Executive Summary Traditional Machine learning workflows often involve iterative and lengthy steps in data preparation, model training, validating results and tuning models before the final solution can be deployed for production. This cycle can consume a lot of resources, negatively impacting the productivity of the developer’s team toward business transformation. In order to accelerate this, NVIDIA released the Accelerated Data Science pipeline with RAPIDS.
1 RAPIDS Overview RAPIDS is a GPU accelerated data science pipeline, and it consists of open-source software libraries based on python to accelerate the complete workflow from data ingestion and manipulation to machine learning training. It does this by: 1. Adopting the columnar data structure called GPU data frame as the common data format across all GPU-accelerated libraries. 2.
Data Processing Evolution: In a benchmark consisting of aggregating data, the CPU becomes the bottleneck because there is too much data movement between the CPU and the GPU. So, RAPIDS is focused on the full data science workflow and keeping data on the GPU (using same memory format as Apache Arrow). As you lower data movement between CPU & GPU, it leads to faster data processing as shown in Figure 2. Figure 2. Data Processing Evolution.
Figure 3. Average ranking of the ML algorithms. Source: Nvidia/ https://arxiv.org/pdf/1708.05070.pdf 1.2 Dask for Parallel Computing Dask is a distributed computation scheduler built to scale Python workloads from laptops to supercomputer clusters.
1.3 Why RAPIDS, Dask and XGBoost? There are several reasons to bring together these tools: • • • • • • • Freedom to execute end-to-end data science & analytics pipelines entirely on GPU User-friendly Python interfaces Relies on CUDA primitives Faster results make tuning parameters more interactive, leading to more accuracy in predictions and therefore more business value Dask provides advanced parallelism for data science pipelines at scale.
1.5 E2E NYC-Taxi Notebook This is an End to End (E2E) notebook example extracted from Nvidia rapids ai/notebooks-contrib GitHub repo, the workflow consists of three core phases: Extract-Transform-Load (ETL), Machine Learning Training, and Inference operations performed on the NYC-Taxi dataset. The notebook focuses on showing how to use cuDF with Dask & XGBoost to scale GPU DataFrame ETL-style operations & model training out to multiple GPUs on multiple nodes. see below Figure 6.
2 System Configuration Test System Hardware: Servers • C4140-M o 4xV100-SXM2-16GB • R940xa o 4xV100-PCle-32GB o 4xV100-PCle-16GB • Network connection over InfiniBand • R740xd server hosting the NFS with the dataset for remote data Test System Software: • Ubuntu 18.04 Docker CE v19.03+ for Linux distribution • RAPIDS Version: 0.10.0 - Docker Install Nvidia Driver: 418.67+ CUDA: 10.
3 Results on Single Node The following session shows the results on single node mode for each server tested. 2014 Year Dataset SATA Data vs NVMe Data: We started with 2014-year dataset (26.5GB) using PowerEdge servers C4140-M with NVIDIA V100-SXM216GB and R940xa with NVIDIA V100PCle-16GB and V100PCle-32GB. The results shown below were conducted with the feature RMM disabled. See Figure 7 Figure 7.
Remote data on NFS versus Local data on SATA: Another aspect to explore was the effect of using remote data on NFS versus local data on SATA device. To do so, we tested on the server C4140-M 4xV100-SXM2-16GB with local data 2014-year on SATA drive, which was just 3% faster than remote data on NFS. See Figure 9 Figure 9.
4 Results on Multi Node on C4140-M To run RAPIDS in multi-node mode we used Dask CUDA to extend Dask distributed with GPU support. There are different methods to set up the multi-node mode depending on the target cluster, for more options see Dask documentation as reference [5-8].
Figure 12. Performance on Server C4140-M in Multi Node vs R940xa in Single Node Scale Out RAPIDS on C4140-M | Faster Performance on Largest Dataset: In this section, we tested the system C4140-M 4xV100-SXM2-16GB in multi node, increasing the data size gradually by year and month. The system has the capacity to handle up to 51.7GB data size, the highest dataset processing capacity (2014-2015-2016 Jan-Feb dataset with 51.7GB) in the shortest total E2E time (126 seconds).
5 Results with System Profile in “Performance” Mode In order to boost the performance, the System Profile was changed under the Bios Settings from “Performance per Watt (DAPC)”, the default configuration used to run the previous tests, to “Performance” mode; as a result, the performance was boosted between 7% - 9% in terms of Total E2E seconds. For instance, the E2E on PowerEdge C4140-M-4xV100-SXM2-16GB in Multi Node went from 59 seconds to 55 seconds (7% faster), see below Figure 14: Figure 14.
Figure 16.
6 Conclusion and Future Work We have shown how Dell EMC PowerEdge servers with NVIDIA GPUs can be used to accelerate your data science pipeline with RAPIDS. We have compared performance using both NVIDIA NVLINK & PCIE GPUs using scale-up and scale-out server’s solutions using different storage configurations.
Dell EMC PowerEdge Server Specifications A Dell EMC PowerEdge Server Specifications The below table shows the technical specifications of the servers used in this paper Dell EMC PowerEdge Servers C4140-M (Primary Node) C4140-M (Secondary Node) Server Dell EMC PowerEdge C4140 Conf. M Dell EMC PowerEdge C4140 Conf. M R940xa R940xa CPU Model Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Intel(R) Xeon(R) Platinum 8180M CPU @ 2.
Terminology B Terminology RAPIDS: Suite of software libraries, built on CUDA-X AI, that gives the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs End to End workflow: Data science pipeline that includes the three phases of ETL (Extract, Transform, Load), data conversion, and training Dask: Open source freely available that provides advanced parallelism for analytics.
Example; GPU Activity with C4140-M in Multi Node Mode C Example; GPU Activity with C4140-M in Multi Node Mode GPU Activity on C4140-M in Multi Node Mode with RMM enable 20 RAPIDS Scaling on Dell EMC PowerEdge Servers
Example; GPU Activity with C4140-M in Multi Node Mode GPU Activity on C4140-M in Multi Node Mode with RMM disable 21 RAPIDS Scaling on Dell EMC PowerEdge Servers
Dask Diagnostic Dashboard D Dask Diagnostic Dashboard Dask diagnostic dashboard helps to understand the performance of the code running on the cluster among each worker, please watch the video “Dask Dashboard walkthrough” for detailed explanation of each dashboard page [12]. Task and thread activities among the workers over the time.
Dask Diagnostic Dashboard Profile page allows to inspect the code performance at the finest granularity level, each horizontal bar corresponds to a function Dashboard – Profile page The System page provides plots with information about the resource utilization when the scheduler runs processes Dashboard – System page 23 RAPIDS Scaling on Dell EMC PowerEdge Servers
Dask Diagnostic Dashboard Workers page provides information about all the workers running on the cluster Dashboard – Workers page Info page provides more information about each worker running on the cluster.
NVDashboard – Nvidia GPU Dashboard E NVDashboard – Nvidia GPU Dashboard As an alternative monitory tool, NVDashboard is an open-source package for the real-time visualization of NVIDIA GPU metrics on interactive Jupyter environments. The dashboards use pynvml to access GPU information attached to the machine and display the plots in Jupyter Lab environment.
NVDashboard – Nvidia GPU Dashboard GPU Dashboards GPU Utilization GPU Resources 26 RAPIDS Scaling on Dell EMC PowerEdge Servers
NVDashboard – Nvidia GPU Dashboard PCle Throughput Machine Resources 27 RAPIDS Scaling on Dell EMC PowerEdge Servers
Environment set up F Environment set up In this section we explain the steps on how to install RAPIDS through the docker from NVIDIA GPU Cloud (NGC), download the NYC-taxi dataset, and pull the notebooks repo [13] 1. Review the below prerequisites before running the tests: a. NVIDIA Pascal™ GPU architecture or better b. CUDA 9.2 or 10.0+ compatible NVIDIA driver c. Ubuntu 16.04/18.04 or CentOS 7 d. Docker CE v19.03+ for Linux distribution 2.
Notebook NYC-Taxi Set Up G Notebook NYC-Taxi Set Up See below the steps to start the notebook server and the notebook example 1. Once within the container, start the Notebook Server on the host machine (this will run JupyterLab on port 8888 on the host machine): (rapids) root@container:/rapids/notebooks# bash utils/start-jupyter.sh Note: To run JupyterLab on a different port, edit and modify the start-jupyter.
RAPIDS Multi Node Set Up H RAPIDS Multi Node Set Up 1. Run as Docker container on each node On each node, go inside the RAPIDS docker image and start the multi-node configuration as described in the next steps. Below is the command example to go within the docker: docker run --runtime=nvidia --rm -it --net=host -p 8888:8888 -p 8787:8787 -p 8786:8786 -v /home/rapids/notebooks-contrib/:/rapids/notebooks/contrib/ -v /home/rapids/data/:/home/dell/rapids/data/ nvcr.io/nvidia/rapidsai/rapidsai:0.10-cuda10.
Bios Settings to Boost Performance I Bios Settings to Boost Performance 31 RAPIDS Scaling on Dell EMC PowerEdge Servers
Bios Settings to Boost Performance 32 RAPIDS Scaling on Dell EMC PowerEdge Servers
Common Errors J Common Errors During the tests we experimented GPU device memory issues, for more details on the memory performance and issues we encountered please see the section A “Controlling memory usage”.
Technical Resources K Technical Resources https://rapids.ai/ https://www.dellemc.com/en-us/index.htm https://www.dell.com/support/article/us/en/19/sln311501/high-performance-computing?lang=en https://www.dellemc.com/en-us/servers/server-accelerators.htm K.1 Related Resources • [1] RAPIDS Datasets Homepage. https://console.cloud.google.com/storage/browser/anaconda-publicdata/nyc-taxi/csv • [2] RMM: RAPIDS Memory Manager. https://github.com/rapidsai/rmm • [3] Single Node Multi-GPU. https://xgboost.