Course Topic Outline
1. Introduction
-Moore's Law, Why computers are parallel
-Supercomputer's today
-Parallel programming challenges
-Performance measures - strong scaling vs. weak scaling
2. Computer Architectures
-Single Processor Machines
-Idealized and actual costs in modern processors
-the memory hierarchy, caches
-parallelism within a processor
-pipelining
-SIMD
-Case study: matrix multiply.
-Costs/computational intensity for naive version, blocked version
-Communication lower bounds for matrix multiply
-BLAS2 vs. BLAS3
-Parallel machines and programming models
-shared memory
-message passing/distributed memory
-data parallel
-Flynn's taxonomy
3. Shared memory programming
-OpenMP basics
-fork-join parallelism
-typical bugs and performance issues
-Memory consistency and cache-coherence
4. Sources of parallelism and locality in simulations
-Discrete event systems
-synchronous vs. asynchronous
-Particle systems
-external vs. nearby vs. far-field forces and computational/parallelization approaches for each
-Lumped variables (ODEs)
-PDEs
5. Performance Modeling and MPI Communication
-Distributed memory architectures
-properties of communication networks (diameter, bisection BW)
-network topologies
-linear, ring, mesh, torus, hypercube, tree, butterfly
-Performance models
-PRAM
-Bulk synchronous parallel (BSP)
-LogP
-alpha,beta,gamma model
-MPI
-Basic communication in MPI
-Collective communication in MPI
6. Dense Linear Algebra
-Communication lower bounds and derivation
-Parallel data layouts for matrices
-Parallel dense matrix-vector products
-Parallel matrix multiply
-SUMMA algorithm
-Parallel LU factorization
-best data layout
-Parallel QR factorization
-tournament pivoting idea
7. Sparse Linear Algebra
-sparse matrix formats and basic SpMV
-dependency graph, lower bounds
-sequential optimizations
-types of blocking (register, cache)
-reordering
-distributed memory optimizations
-partitioning problem
-higher level kernels
-sparse matrix multiply
-matrix powers computations
-iterative solvers
-pipelined Krylov subspace methods
-communication-avoiding Krylov subspace methods
8. Particle Methods
-Approximation of far-field forces idea
-Data structures - quad/oct trees
-Barnes-Hut algorithm
-Fast multipole methods (FMM)
-Differences between B-H and FMM (e.g., how approximation can be improved)
-Parallelization of Barnes-Hut and FMM
-load balancing schemes
9. Graph partitioning
-Definition and connection with communication
-Partitioning with nodal coordinates
-Planar separator theorem
-Inertial partitioning
-Random spheres
-Partitioning without nodal coordinates
-Breadth-first search method
-Spectral bisection
-Iterative Refinement
-Kernighjan-Lin
-Fiduccia-Mattheyses
-Multilevel acceleration
-Hypergraph models, benefit of hypergraph model
10. Machine Learning and Data Analysis
-implicit versus explicit parallelization
-Supervised learning problems
-neural network basics
-data versus model parallelization approaches
-support vector machines
-parallelization challenges
-Unsupervised learning problems
-nonnegative matrix factorization
-parallelization approach
-spectral and markov clustering
-basic idea behind spectral clustering
-basic idea behind markov clustering (MCL)
11. Fast Fourier Transforms
-Main divide and conquer idea behind FFT
-Sequential algorithm (Cooley and Tukey)
-Lower bounds on communication
-1D FFT Parallelization approaches
-Block versus Cyclic layout
-Transpose approach
-Cost of each in terms of alpha, beta, gamma model
-Higher dimension FFTs (3D case)
-Parallelization approaches
-Packed slabs versus slabs versus pencils
-Tradeoffs in bandwidth and latency costs
-Opportunities for communication/computation overlapping