* Explain the different possibilities for distributing a dense matrix amongst processors. In the context of the distributed memory implementation of LU factorization, explain which distribution is best and why others are suboptimal. 

* What is bisection bandwidth? What is the bisection bandwidth for the following networks: Linear, 2D mesh, 2D torus, tree, and fattree? Is bisection bandwidth a useful metric to describe networks topologies?

* Give examples of deadlock and race conditions as they might occur in a distributed memory parallel program based on message passing. 

* Describe the following types of parallelism and give examples of each: task, data, divide and conquer, pipeline.

* Explain the basics of OpenMP. Give an example of how a data race can occur in shared memory parallel programming. 

* Give an overview of how algorithms like Barnes Hut and FMM speed up n-body calculations. What is the main difference between Barnes-Hut and FMM in terms of accuracy/computation tradeoffs? What other ways are BH and FMM similar and different?

* Given the DFT matrix, we wish to multiply it by a vector to transform the vector from the time domain into the frequency domain. Usually a matrix vector multiply takes O(n^2) steps. Describe how the FFT algorithm performs a matrix vectory multiply in only O(nlogn) steps.

* State the lower bounds on the number of words moved between levels of the memory hierarchy and the number of messages for sequential dense matrix multiply. Now explain how to derive lower bounds for the parallel case. Give a count of the number of messages and number of words moved in the SUMMA algorithm. 

* Explain how graph partitioning can be used to distribute a sparse matrix-vector computation. Give an example that shows that minimizing the graph cut is not equivalent to minimizing the total number of words communicated between processors. Explain how this can be remedied by a hypergraph model. Is using a hypergraph model always worth it in practice?

* Explain the difference between implicit and explicit parallelization. Give an example of an algorithm where implicit parallelization is a good approach, and one algorithm where it is not. 

* Explain the difference between strong scaling and weak scaling. Why are both measurements important? 

* Define the graph partitioning problem. How would the approach you use differ when the nodes have associated coordinate informaion versus when they don't? Explain how "multilevel" approaches work in the context of graph partitioning.

* Explain what is meant by "collective communication" in MPI. Explain the following collectives: 
broadcast, scatter, gather, Alltoall, Reduce, Allreduce. Give brief explanation or pseudocode describing how you could implement a broadcast collective using just MPI_Send and MPI_Recv. 

* Which parallel benchmarks are used to rank supercomputers today? Explain why the HPCG benchmark gets such a small percentage of peak, even on the top computers, compared to the LINPACK benchmark. Can the computational intensity of iterative solvers like CG be improved? At what cost (e.g., potentially decreased accuracy)?