Administrator Guide

8 RAPIDS Scaling on Dell EMC PowerEdge Servers
1.3 Why RAPIDS, Dask and XGBoost?
There are several reasons to bring together these tools:
Freedom to execute end-to-end data science & analytics pipelines entirely on GPU
User-friendly Python interfaces
Relies on CUDA primitives
Faster results make tuning parameters more interactive, leading to more accuracy in predictions and
therefore more business value
Dask provides advanced parallelism for data science pipelines at scale. It works with the existing
Python ecosystem to scale it to multi-core machines and distributed clusters, sharing their syntaxes
cuML also features multi-GPU and multi-node-multi-GPU operation, using Dask
XGBoost takes advantage of fast parallel processing with GPUs in both single and multi-node
configurations to reduce training times
1.4 New York City (NYC) Taxi Dataset
Description:
The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off
locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
Source:
The data used in the datasets were collected and provided to the NYC Taxi and Limousine Commission
(TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs
(TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the
accuracy of these data.
Size:
The dataset used in this project contains historiacal records accumulated and saved on individual monthly
files from 2014 to 2016 (Total: ~64GB), with the below sizes per year:
Figure 5. NYC Taxi Dataset Size (GB)
2014 year
Datset, 26.5
2015 year
Dataset, 21.9
2016 year
Dataset, 15.7
NYC Taxi Dataset Size per Year (GB)
2014 year Datset
2015 year Dataset
2016 year Dataset