Module also offered within study programmes:
General information:
Name:
Advanced topics of multi- and many-core software design
Course of study:
2018/2019
Code:
JIS-1-038-s
Faculty of:
Physics and Applied Computer Science
Study level:
First-cycle studies
Specialty:
-
Field of study:
Applied Computer Science
Semester:
0
Profile of education:
Academic (A)
Lecture language:
English
Responsible teacher:
dr hab. inż, prof. AGH Szumlak Tomasz (szumlak@agh.edu.pl)
Academic teachers:
dr hab. inż, prof. AGH Szumlak Tomasz (szumlak@agh.edu.pl)
Module summary

This lecture aims at providing advanced knowledge and skills on parallel programming using GPU. We concentrate mainly on using CUDA.

Description of learning outcomes for module
MLO code Student after module completion has the knowledge/ knows how to/is able to Connections with FLO Method of learning outcomes verification (form of completion)
Social competence
M_K001 A student can present his/her results and discuss them IS1A_K01 Activity during classes
Skills
M_U001 A student can work as a part of a team and can interact properly with his/her team-mates. IS1A_U05, IS1A_U01, IS1A_U04 Project
M_U002 A student can write complete programs, using extended C/C++ language. Can use a professional framework provided by NVIDIA. Can proficiently use IDE environment, nvidia compiler and debugger. IS1A_U06, IS1A_U05 Test,
Project
Knowledge
M_W001 A student gains knowledge on advanced topics related to the massively parallel programming using GPU. IS1A_W01, IS1A_W03, IS1A_W02 Activity during classes
FLO matrix in relation to forms of classes
MLO code Student after module completion has the knowledge/ knows how to/is able to Form of classes
Lecture
Audit. classes
Lab. classes
Project classes
Conv. seminar
Seminar classes
Pract. classes
Zaj. terenowe
Zaj. warsztatowe
Others
E-learning
Social competence
M_K001 A student can present his/her results and discuss them - - - + - - - - - - -
Skills
M_U001 A student can work as a part of a team and can interact properly with his/her team-mates. + - + + - - - - - - -
M_U002 A student can write complete programs, using extended C/C++ language. Can use a professional framework provided by NVIDIA. Can proficiently use IDE environment, nvidia compiler and debugger. - - + + - - - - - - -
Knowledge
M_W001 A student gains knowledge on advanced topics related to the massively parallel programming using GPU. + - + + - - - - - - -
Module content
Lectures:
  1. Introduction (2 h)

    Background and explanation of the main topics of the advanced CUDA design lectures. Why we should care about learning and understand parallel design patterns.

  2. Convolution (3 h)

    One of the most popular patterns used in high-performance computing. We try to understand what is and how to apply and adapt the convolution for your needs. This technique, also dubbed stencil computation, is not only tremendously useful for image and video processing it is also very useful for solving differential equations and simulation. We need to discuss this not trivial aspect of the input data sharing among the output elements. Advanced tiling algorithms must be used to stage and process the data.

  3. Atomic operations and privatisation (4 h)

    At the beginning of each CUDA course, we first learn algorithms that isolate the results of computation – in other words, the tasks are designed in the way that the output element is always assigned to one particular thread. We call these patterns the owner-computes rule – each thread writes into its own bit of memory and there is no concern about any potentially destructive interference from other threads. Now, we can easily imagine a task where each output data element may be updated by any thread performing calculations. This generates the need to coordinate thread execution and make the update process safe – we do not want the output data to be corrupted! As an example, we discuss the histogram pattern. We also learn how important it is to understand your hardware…

  4. Introduction to sorting on GPUs (4 h)

    There is no need to convince you that sorting is a very, very popular and required task. Making efficient sorting on GPU is a pain and very often in practice, the data are shipped back to the host to perform single threaded or concurrent sorting algorithm. Depending on a particular problem we could try to design a sorting algorithm using a parallel pattern called ordered merge operation. Using this pattern we could next build a fully-fledged sorting that can be run on a GPU.

  5. CUDA dynamic parallelism revisited (2 h)

    This feature (dynamic parallelism) is an extension to the CUDA model. It became available with the advent of the Kepler architecture and allows a kernel to create a private grid of threads and launch new kernels. This is a precious capability when implementing more complicated algorithms for GPU, which would require otherwise many data transfers between the device and host and overload CPU. Learning how to exploit this new feature is a lot of fun!

Laboratory classes:
  1. Refresher and convolution (6 h)

    First, we do a short crash-course to remind you how to go about using CUDA. Then we start working with the convolution pattern.
    We introduce a basic convolution algorithm and then add tiling.

  2. Histogram pattern (6 h)

    We start with using atomic operations and then we play a bit with understanding the difference between block and interleaved division of GPU resources. We also look in detail at privatisation and proper aggregation.

  3. Can I make sorting working on GPU (6 h)

    Sequential and parallel merge algorithms. Simple merge kernels and upgraded ones with tiling technique. We also look at a special and somewhat more sophisticated approach with a circular buffer kernels.

  4. Dynamic parallelism (4 h)

    This part is about delving into your hardware features and using it to the limit. We look in detail on to an advanced memory management, synchronisation, streaming and events.

Project classes:
To be defined!

Concrete topics very much depend on students! If you are interested in GPU you could get a more ambitious project. The group will be split into groups (2 up to 4 students). Each group will be asked to select a team leader and work together to finish the project. The topics will be hand out after we finish lectures – you are going to have plenty of time to comfortably complete it.

Student workload (ECTS credits balance)
Student activity form Student workload
Summary student workload 129 h
Module ECTS credits 5 ECTS
Participation in laboratory classes 22 h
Participation in project classes 8 h
Completion of a project 35 h
Preparation for classes 20 h
Realization of independently performed tasks 25 h
Examination or Final test 2 h
Contact hours 2 h
Participation in lectures 15 h
Additional information
Method of calculating the final grade:

The final grade will depend on your performance during the computer laboratories and the final project:
final_grade=0.5 * lab_grade + 0.5 * project_grade. NOTE! You need to have both labs and project passed to get an overall passing grade!!!

Prerequisites and additional requirements:

Since this is the intro lecture there are minimal requirements regarding basic skills with using linux OS and programming in C language.

Recommended literature and teaching resources:

Depertamental Library (Physics and Applied Computer Science) is quite well stocked with books pertaining to parallel programming. We have the following, that can be used to study the problem:
1. "CUDA for Engineers: An Introduction to High-Performance Parallel Computing"
Duane Storti, Mete Yurtoglu
Addison Wesley, ISBN-13: 978-0134177410

2. "CUDA Programming: A Developer's Guide to Parallel Computing with GPUs (Applications of GPU Computing Series)"
Shane Cook
Morgan Kaufmann, ISBN-13: 978-0124159334

3. "Programming Massively Parallel Processors: A Hands-on Approach"
David B. Kirk, Wen-mei W. Hwu
Morgan Kaufmann; 2 edition (14 Dec. 2012), ISBN-13: 978-0124159921

Scientific publications of module course instructors related to the topic of the module:

1. The LHCb Collaboration, “Measurement of the track reconstruction efficiency at LHCb”, JINST 10 (2015) P02007
2. The LHCb VELO Group, “Performance of the LHCb Vertex Locator”, JINST 9 (2014) P09007
3. The LHCb VELO Group, “Radiation damage in the LHCb Vertex Locator”, JINST 8 (2013) P08002

Additional information:

The labs will be quite agressively paced – please attend them! This classes are compulsory and you are allowed to skip only one lab during the whole semester. Also, there will be very hard to catch up if you miss a class.