Reinders James et al. Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

Файл формата zip
размером 39,34 МБ
содержит документ формата mobi

Добавлен пользователем Владимир Семёнович 04.10.2023 00:36
Описание отредактировано 09.04.2024 02:28

Reinders James et al. Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

2nd Edition. — James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xinmin Tian. — Apress Media LLC., 2023. — 648 p. — ISBN-13: 978-1-4842-9691-2.

Learn how to accelerate C++ programs using data parallelism and SYCL.

This book enables C++ programmers to be at the forefront of this exciting and important development that is helping to push computing to new levels. This updated second edition is full of practical advice, detailed explanations, and code examples to illustrate key topics.

SYCL enables access to parallel resources in modern accelerated heterogeneous systems. Now, a single C++ application can use any combination of devices–including GPUs, CPUs, FPGAs, and ASICs–that are suitable to the problems at hand.

This book teaches data-parallel programming using C++ with SYCL and walks through everything needed to program accelerated systems. The book begins by introducing data parallelism and foundational topics for effective use of SYCL. Later chapters cover advanced topics, including error handling, hardware-specific programming, communication and synchronization, and memory model considerations.

Computer hardware development is driven by our needs to solve larger and more complex problems, but those hardware advances are largely useless unless programmers like you and me have languages that allow us to implement our ideas and exploit the power available with reasonable effort. There are numerous examples of amazing hardware, and the first solutions to use them have often been proprietary since it saves time not having to bother with committees agreeing on standards. However, in the history of computing, they have eventually always ended up as vendor lock-in—unable to compete with open standards that allow developers to target any hardware and share code—because ultimately the resources of the worldwide community and ecosystem are far greater than any individual vendor, not to mention how open software standards drive hardware competition.

If you are new to parallel programming that is okay. If you have never heard of SYCL or the DPC++ compilerthat is also okay. Compared with programming in CUDA, C++ with SYCL offers portability beyond NVIDIA, and portability beyond GPUs, plus a tight alignment to enhance modern C++ as it evolves too. C++ with SYCL offers these advantages without sacrificing performance. C++ with SYCL allows us to accelerate our applications by harnessing the combined capabilities of CPUs, GPUs, FPGAs, and processing devices of the future without being tied to any one vendor.

SYCL is an industry-driven Khronos Group standard adding advanced support for data parallelism with C++ to exploit accelerated (heterogeneous) systems. SYCL provides mechanisms for C++ compilers that are highly synergistic with C++ and C++ build systems. DPC++ is an open source compiler project based on LLVM that adds SYCL support. All examples in this book should work with any C++ compiler supporting SYCL 2020 including the DPC++ compiler. If you are a C programmer who is not well versed in C++, you are in good company. Several of the authors of this book happily share that they picked up much of C++ by reading books that utilized C++ like this one. With a little patience, this book should also be approachable by C programmers with a desire to write modern C++ programs.

All source code for the examples used in this book is freely available on GitHub. The examples are written in modern SYCL and are regularly updated to ensure compatibility with multiple compilers.

What You Will Learn:
Accelerate C++ programs using data-parallel programming
Use SYCL and C++ compilers that support SYCL
Write portable code for accelerators that is vendor and device agnostic
Optimize code to improve performance for specific accelerators
Be poised to benefit as new accelerators appear from many vendors

Who This Book Is For:
New data-parallel programming and computer programmers interested in data-parallel programming using C++.

"This book, now in is second edition, is the premier resource to learn SYCL 2020 and is the ONLY book you need to become part of this community." Erik Lindahl, GROMACS and Stockholm University

About the Authors
Preface
Foreword
Acknowledgments

Introduction
Read the Book, Not the Spec
SYCL 2020 and DPC
Why Not CUDA
Why Standard C++ with SYCL
Getting a C++ Compiler with SYCL Support
Hello, World! and a SYCL Program Dissection
Queues and Actions
It Is All About Parallelism
Throughput
Latency
Think Parallel
Amdahl and Gustafson
Scaling
Heterogeneous Systems
Data-Parallel Programming
Key Attributes of C++ with SYCL
Single-Source
Host
Devices
Kernel Code
Asynchronous Execution
Race Conditions When We Make a Mistake
Deadlock
C++ Lambda Expressions
Functional Portability and Performance Portability
Concurrency vs. Parallelism
Summary

Where Code Executes
Single-Source
Host Code
Device Code
Choosing Devices
Method#1: Run on a Device of Any Type
Queues
Binding a Queue to a Device When Any Device Will Do
Method#2: Using a CPU Device for Development, Debugging, and Deployment
Method#3: Using a GPU (or Other Accelerators)
Accelerator Devices
Device Selectors
Method#4: Using Multiple Devices
Method#5: Custom (Very Specific) Device Selection
Selection Based on Device Aspects
Selection Through a Custom Selector
Creating Work on a Device
Introducing the Task Graph
Where Is the Device Code
Actions
Host tasks
Summary

Data Management
Introduction
The Data Management Problem
Device Local vs. Device Remote
Managing Multiple Memories
Explicit Data Movement
Implicit Data Movement
Selecting the Right Strategy
USM, Buffers, and Images
Unified Shared Memory
Accessing Memory Through Pointers
USM and Data Movement
Buffers
Creating Buffers
Accessing Buffers
Access Modes
Ordering the Uses of Data
In-order Queues
Out-of-Order Queues
Choosing a Data Management Strategy
Handler Class: Key Members
Summary

Expressing Parallelism
Parallelism Within Kernels
Loops vs. Kernels
Multidimensional Kernels
Overview of Language Features
Separating Kernels from Host Code
Different Forms of Parallel Kernels
Basic Data-Parallel Kernels
Understanding Basic Data-Parallel Kernels
Writing Basic Data-Parallel Kernels
Details of Basic Data-Parallel Kernels
Explicit ND-Range Kernels
Understanding Explicit ND-Range Parallel Kernels
Writing Explicit ND-Range Data-Parallel Kernels
Details of Explicit ND-Range Data-Parallel Kernels
Mapping Computation to Work-Items
One-to-One Mapping
Many-to-One Mapping
Choosing a Kernel Form
Summary

Error Handling
Safety First
Types of Errors
Let’s Create Some Errors
Synchronous Error
Asynchronous Error
Application Error Handling Strategy
Ignoring Error Handling
Synchronous Error Handling
Asynchronous Error Handling
The Asynchronous Handler
Invocation of the Handler
Errors on a Device
Summary

Unified Shared Memory
Why Should We Use USM
Allocation Types
Device Allocations
Host Allocations
Shared Allocations
Allocating Memory
What Do We Need to Know
Multiple Styles
Deallocating Memory
Allocation Example
Data Management
Initialization
Data Movement
Queries
One More Thing
Summary

Buffers
Buffers
Buffer Creation
What Can We Do with a Buffer
Accessors
Accessor Creation
What Can We Do with an Accessor
Summary

Scheduling Kernels and Data Movement
What Is Graph Scheduling
How Graphs Work in SYCL
Command Group Actions
How Command Groups Declare Dependences
Examples
When Are the Parts of a Command Group Executed
Data Movement
Explicit Data Movement
Implicit Data Movement
Synchronizing with the Host
Summary

C ommunication and Synchronization
Work-Groups and Work-Items
Building Blocks for Efficient Communication
Synchronization via Barriers
Work-Group Local Memory
Using Work-Group Barriers and Local Memory
Work-Group Barriers and Local Memory in ND-Range Kernels
Sub-Groups
Synchronization via Sub-Group Barriers
Exchanging Data Within a Sub-Group
A Full Sub-Group ND-Range Kernel Example
Group Functions and Group Algorithms
Broadcast
Votes
Shuffles
Summary

Defining Kernels
Why Three Ways to Represent a Kernel
Kernels as Lambda Expressions
Elements of a Kernel Lambda Expression
Identifying Kernel Lambda Expressions
Kernels as Named Function Objects
Elements of a Kernel Named Function Object
Kernels in Kernel Bundles
Interoperability with Other APIs
Summary

Vectors and Math Arrays
The Ambiguity of Vector Types
Our Mental Model for SYCL Vector Types
Math Array (marray)
Vector (vec)
Loads and Stores
Interoperability with Backend-Native Vector Types
Swizzle Operations
How Vector Types Execute
Vectors as Convenience Types
Vectors as SIMD Types
Summary

Device Information and Kernel Specialization
Is There a GPU Present
Refining Kernel Code to Be More Prescriptive
How to Enumerate Devices and Capabilities
Aspects
Custom Device Selector
Being Curious: get_info
Being More Curious: Detailed Enumeration Code
Very Curious: get_info plus has()
Device Information Descriptors
Device-Specific Kernel Information Descriptors
The Specifics: Those of “Correctness
Device Queries
Kernel Queries
The Specifics: Those of “Tuning/Optimization
Device Queries
Kernel Queries
Runtime vs. Compile-Time Properties
Kernel Specialization
Summary

Practical Tips
Getting the Code Samples and a Compiler
Online Resources
Platform Model
Multiarchitecture Binaries
Compilation Model
Contexts: Important Things to Know
Adding SYCL to Existing C++ Programs
Considerations When Using Multiple Compilers
Debugging
Debugging Deadlock and Other Synchronization Issues
Debugging Kernel Code
Debugging Runtime Failures
Queue Profiling and Resulting Timing Capabilities
Tracing and Profiling Tools Interfaces
Initializing Data and Accessing Kernel Outputs
Multiple Translation Units
Performance Implication of Multiple Translation Units
When Anonymous Lambdas Need Names
Summary

Common Parallel Patterns
Understanding the Patterns
Map
Stencil
Reduction
Scan
Pack and Unpack
Using Built-In Functions and Libraries
The SYCL Reduction Library
Group Algorithms
Direct Programming
Map
Stencil
Reduction
Scan
Pack and Unpack
Summary
For More Information

Programming for GPUs
Performance Caveats
How GPUs Work
GPU Building Blocks
Simpler Processors (but More of Them)
Simplified Control Logic (SIMD Instructions)
Switching Work to Hide Latency
Offloading Kernels to GPUs
SYCL Runtime Library
GPU Software Drivers
GPU Hardware
Beware the Cost of Offloading
GPU Kernel Best Practices
Accessing Global Memory
Accessing Work-Group Local Memory
Avoiding Local Memory Entirely with Sub-Groups
Optimizing Computation Using Small Data Types
Optimizing Math Functions
Specialized Functions and Extensions
Summary
For More Information

Programming for CPUs
Performance Caveats
The Basics of Multicore CPUs
The Basics of SIMD Hardware
Exploiting Thread-Level Parallelism
Thread Affinity Insight
Be Mindful of First Touch to Memory
SIMD Vectorization on CPU
Ensure SIMD Execution Legality
SIMD Masking and Cost
Avoid Array of Struct for SIMD Efficiency
Data Type Impact on SIMD Efficiency
SIMD Execution Using single_task
Summary

P rogramming for FPGAs
Performance Caveats
How to Think About FPGAs
Pipeline Parallelism
Kernels Consume Chip “Area
When to Use an FPGA
Lots and Lots of Work
Custom Operations or Operation Widths
Scalar Data Flow
Low Latency and Rich Connectivity
Customized Memory Systems
Running on an FPGA
Compile Times
The FPGA Emulator
FPGA Hardware Compilation Occurs “Ahead- of-Time
Writing Kernels for FPGAs
Exposing Parallelism
Keeping the Pipeline Busy Using ND-Ranges
Pipelines Do Not Mind Data Dependences
Spatial Pipeline Implementation of a Loop
Loop Initiation Interval
Pipes
Custom Memory Systems
Some Closing Topics
FPGA Building Blocks
Clock Frequency
Summary

Libraries
Built-In Functions
Use the sycl:: Prefix with Built-In Functions
The C++ Standard Library
oneAPI DPC++ Library (oneDPL)
SYCL Execution Policy
Using oneDPL with Buffers
Using oneDPL with USM
Error Handling with SYCL Execution Policies
Summary

Memory Model and Atomics
What’s in a Memory Model
Data Races and Synchronization
Barriers and Fences
Atomic Operations
Memory Ordering
The Memory Model
The memory_order Enumeration Class
The memory_scope Enumeration Class
Querying Device Capabilities
Barriers and Fences
Atomic Operations in SYCL
Using Atomics with Buffers
Using Atomics with Unified Shared Memory
Using Atomics in Real Life
Computing a Histogram
Implementing Device-Wide Synchronization
Summary
For More Information

Backend Interoperability
What Is Backend Interoperability
When Is Backend Interoperability Useful
Adding SYCL to an Existing Codebase
Using Existing Libraries with SYCL
Using Backend Interoperability for Kernels
Interoperability with API-Defined Kernel Objects
Interoperability with Non-SYCL Source Languages
Backend Interoperability Hints and Tips
Choosing a Device for a Specific Backend
Be Careful About Contexts
Access Low-Level API-Specific Features
Support for Other Backends
Summary

Migrating CUDA Code
Design Differences Between CUDA and SYCL
Multiple Targets vs. Single Device Targets
Aligning to C++ vs. Extending C
Terminology Differences Between CUDA and SYCL
Similarities and Differences
Execution Model
Memory Model
Other Differences
Features in CUDA That Aren’t In SYCL… Yet
Global Variables
Cooperative Groups
Matrix Multiplication Hardware
Porting Tools and Techniques
Migrating Code with dpct and SYCLomatic
Summary

For More Information
Epilogue: Future Direction of SYCL
Index