Administrator Guide

8 RAPIDS Scaling on Dell EMC PowerEdge Servers

1.3 Why RAPIDS, Dask and XGBoost?

There are several reasons to bring together these tools:

• Freedom to execute end-to-end data science & analytics pipelines entirely on GPU

• User-friendly Python interfaces

• Relies on CUDA primitives

• Faster results make tuning parameters more interactive, leading to more accuracy in predictions and

therefore more business value

• Dask provides advanced parallelism for data science pipelines at scale. It works with the existing

Python ecosystem to scale it to multi-core machines and distributed clusters, sharing their syntaxes

• cuML also features multi-GPU and multi-node-multi-GPU operation, using Dask

• XGBoost takes advantage of fast parallel processing with GPUs in both single and multi-node

configurations to reduce training times

1.4 New York City (NYC) – Taxi Dataset

Description:

The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off

locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

Source:

The data used in the datasets were collected and provided to the NYC Taxi and Limousine Commission

(TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs

(TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the

accuracy of these data.

Size:

The dataset used in this project contains historiacal records accumulated and saved on individual monthly

files from 2014 to 2016 (Total: ~64GB), with the below sizes per year:

Figure 5. NYC Taxi Dataset Size (GB)

2014 year

Datset, 26.5

2015 year

Dataset, 21.9

2016 year

Dataset, 15.7

NYC Taxi Dataset Size per Year (GB)

2014 year Datset

2015 year Dataset

2016 year Dataset