** CS 598: Communication Cost Analysis of Algorithms**

Mon/Wed 9:30-10:45, 1214 Siebel

Instructor: Edgar Solomonik

solomon2@illinois.edu

4229 Siebel Center

Lec1 | Introduction, motivation, course administration; collective communication protocols src |

Lec2 | Optimal packet-based broadcast, communication cost models (LogP, LogGP, BSP) src |

Lec3 | Communication avoiding algorithms for matrix multiplication src |

Lec4 |
Communication avoiding algorithms for LU factorization src notes on rectangular matrix multiplication |

Lec5 | Memory- and communication-efficient LU factorization src video 1 2 |

Lec6 | Communication avoiding algorithms for LU with pivoting and QR intro src video 2 (QR) |

Lec7 | Parallel algorithms for QR factorization src video 1 2 |

Lec8 | Communication avoiding algorithms for rectangular QR src video 1 2 3 |

Lec9 | Ideal cache model and the discrete Fourier transform src video 1 2 3 |

Lec10 | Parallel FFT and intro to communication lower bounds src video 1 2 3 |

Lec11 | Cache-efficient and parallel sorting src video 1 2 3 |

Lec12 | Bitonic sort and single-source shortest path graph algorithms src video 1 2 3 |

Lec13 | All-pairs shortest-paths src video 1 2 3 |

Lec14 | Betweenness centrality and sample sort revisited src video 1 2 3 |

Lec15 | Pipelined parallel sorting, PRAM, tree contraction src video 1 2 3 |

Lec16 | Tree contraction, Euler tour, list ranking, connectivity and MST src video 1 2 3 |

Lec17 | Sparse direct methods, intro to iterative methods src video 1 2 3 |

Lec18 | Avoiding communication in iterative solvers for sparse linear systems src video 1 2 |

Lec19 | Preconditioning, incomplete LU src video 1 2 3 |

Lec20 | Parallel preconditioning, domain decomposition, graph partitioning src video 1 2 |

Lec21 | Approximate low-rank dense matrix and tensor factorizations src video 1 2 |

Lec22 | Randomized algorithms for low-rank matrix factorization and least-squares src video 2 |

Lec23 | Matrix and tensor completion (ALS, SGD, CCD) src video 1 2 3 |

Lec24 | Molecular dynamics src video 1 2 3 |

Lec25 | Fast integral equation methods, hierarchically structured matrices src video 1 |

Lec26 | Hierarchically semi-separable (HSS) matrices src video 1 2 3 |

Lec27 | HSS matrix construction, electronic structure calculations src video 1 2 |

Lec28 | Convolutional neural networks src video 1 2 |

Lec29 | Network interconnect topologies and topology-aware algorithms src video 1 2 3 |

Efficiency and parallel scalability of data-intensive applications are most often constrained by data movement in the memory hierarchy and the network. This course will focus on analysis of algorithms through the lens of communication and synchronization models. Communication lower bounds and algorithms that attain them will be surveyed for fundamental combinatorial and numerical problems. The course will emphasize general analytical techniques, but will also connect to full-scale applications.

algorithms, linear algebra, and basic parallel programming (e.g. CS 473, 420, 450)

Here is a short list of closely relevant courses of which I am aware, with a theoretical focus and available online material. Most of them cover some subset of the topics/material presented in this course plus topics we will not cover. Predominantly they also have a different focus.

Michael Heath: Parallel Numerical Algorithms, 2015

Guy Blelloch: Parallel Algorithms, 2009

James Demmel and Oded Schwarz: Communication-Avoiding Algorithms, 2011

James Demmel: Applications of Parallel Computers, 2015 (other years available)

Satish Rao: Foundations of Parallel and Distributed Systems, 2012

Pavel Tvrdik: Topics in Parallel Computing, 1999

Yousef Saad's *Iterative Methods for Sparse Linear Systems* formed the basis of the preconditioning lectures and has much more information on stability and convergence of the methods.

James Demmel's *Applied Numerical Linear Algebra* is a great reference for information on matrix factorizations and other topics. The presentation of FFT given in the course was partially based on this book.

Joseph JaJa's *Introduction to Parallel Algorithms* covers PRAM complexity including list ranking and tree-based algorithms presented in the course.

- Jia-Wei, Hong, and Hsiang-Tsung Kung. "I/O complexity: The red-blue pebble game." Proceedings of the thirteenth annual ACM symposium on Theory of computing. ACM, 1981.
- Valiant, Leslie G. "A bridging model for parallel computation." Communications of the ACM 33.8 (1990): 103-111.
- Culler, David, et al. "LogP: Towards a realistic model of parallel computation." ACM Sigplan Notices. Vol. 28. No. 7. ACM, 1993.
- Alexandrov, Albert, et al. "LogGP: Incorporating long messages into the LogP model for parallel computation." Journal of parallel and distributed computing 44.1 (1997): 71-79.
- Aggarwal, Alok, Ashok K. Chandra, and Marc Snir. "Communication complexity of PRAMs." Theoretical Computer Science 71.1 (1990): 3-28.
- Aggarwal, Alok, Ashok K. Chandra, and Marc Snir. "On communication latency in PRAM computations." Proceedings of the first annual ACM symposium on Parallel algorithms and architectures. ACM, 1989.
- Aggarwal, Alok, and Jeffrey Vitter. "The input/output complexity of sorting and related problems." Communications of the ACM 31.9 (1988): 1116-1127.
- Frigo, Matteo, et al. "Cache-oblivious algorithms." Foundations of Computer Science, 1999. 40th Annual Symposium on. IEEE, 1999.
- Bilardi, Gianfranco, et al. "BSP vs LogP." Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures. ACM, 1996.

- Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. "Optimization of collective communication operations in MPICH." International Journal of High Performance Computing Applications 19.1 (2005): 49-66.
- Sanders, Peter, Jochen Speck, and Jesper Larsson Träff. "Two-tree algorithms for full bandwidth broadcast, reduction and scan." Parallel Computing 35.12 (2009): 581-594.
- Träff, Jesper Larsson, and Andreas Ripke. "Optimal broadcast for fully connected processor-node networks." Journal of Parallel and Distributed Computing 68.7 (2008): 887-901.
- Johnsson, S. Lennart, and C-T. Ho. "Optimum broadcasting and personalized communication in hypercubes." IEEE Transactions on computers 38.9 (1989): 1249-1268.
- Sanders, Peter, and Jop F. Sibeyn. "A bandwidth latency tradeoff for broadcast and reduction." Information Processing Letters 86.1 (2003): 33-38.

- Cannon, Lynn Elliot. "A cellular computer to implement the Kalman filter algorithm." (1969): 1-229.
- Van De Geijn, Robert A., and Jerrell Watts. "SUMMA: Scalable universal matrix multiplication algorithm." Concurrency-Practice and Experience 9.4 (1997): 255-274.
- Choi, Jaeyoung, David W. Walker, and Jack J. Dongarra. "PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers." Concurrency: Practice and Experience 6.7 (1994): 543-570.
- Johnsson, S. Lennart. "Minimizing the communication time for matrix multiplication on multiprocessors." Parallel Computing 19.11 (1993): 1235-1257.
- Agarwal, Ramesh C., et al. "A three-dimensional approach to parallel matrix multiplication." IBM Journal of Research and Development 39.5 (1995): 575-582.
- Irony, Dror, Sivan Toledo, and Alexander Tiskin. "Communication lower bounds for distributed-memory matrix multiplication." Journal of Parallel and Distributed Computing 64.9 (2004): 1017-1026.
- Chou, C-C., et al. "Parallelizing Strassen's method for matrix multiplication on distributed-memory MIMD architectures." Computers & Mathematics with Applications 30.2 (1995): 49-69.
- Ballard, Grey, et al. "Communication-optimal parallel algorithm for strassen's matrix multiplication." Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2012.
- McColl, William F., and Alexander Tiskin. "Memory-efficient matrix multiplication in the BSP model." Algorithmica 24.3-4 (1999): 287-297.

- Demmel, James, et al. "Communication-optimal parallel and sequential QR and LU factorizations." SIAM Journal on Scientific Computing 34.1 (2012): A206-A239.
- Tiskin, Alexander. "Bulk-synchronous parallel Gaussian elimination." Journal of Mathematical Sciences 108.6 (2002): 977-991.
- S.E., and Demmel, James. "Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms." European Conference on Parallel Processing. Springer Berlin Heidelberg, 2011.
- Ballard, Grey, et al. "Reconstructing Householder vectors from tall-skinny QR." Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014.
- Tiskin, Alexander. "Communication-efficient parallel generic pairwise elimination." Future Generation Computer Systems 23.2 (2007): 179-188.
- Marek, Andreas, et al. "The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science." Journal of Physics: Condensed Matter 26.21 (2014): 213201.
- S.E., Grey Ballard, and James Demmel. "A communication-avoiding parallel algorithm for the symmetric eigenvalue problem." arXiv preprint arXiv:1604.03703 (2016).
- Ballard, Grey, James Demmel, and Ioana Dumitriu. "Minimizing communication for eigenproblems and the singular value decomposition." arXiv preprint arXiv:1011.3077 (2010).

- Batcher, Kenneth E. "Sorting networks and their applications." Proceedings of the April 30--May 2, 1968, spring joint computer conference. ACM, 1968.
- Aggarwal, Alok, and Jeffrey Vitter. "The input/output complexity of sorting and related problems." Communications of the ACM 31.9 (1988): 1116-1127.
- Shi, Hanmao, and Jonathan Schaeffer. "Parallel sorting by regular sampling." Journal of Parallel and Distributed Computing 14.4 (1992): 361-372.
- Kale, Laxmikant V., and Sanjeev Krishnan. "A comparison based parallel sorting algorithm." Parallel Processing, 1993. ICPP 1993. International Conference on. Vol. 3. IEEE, 1993.
- Cole, Richard. "Parallel merge sort." SIAM Journal on Computing 17.4 (1988): 770-785.
- Goodrich, Michael T. "Communication-efficient parallel sorting." SIAM Journal on Computing 29.2 (1999): 416-432.

- Buluç, Aydın, and John R. Gilbert. "The Combinatorial BLAS: Design, implementation, and applications." International Journal of High Performance Computing Applications (2011): 1094342011403516.
- Azad, Ariful, et al. "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication." arXiv preprint arXiv:1510.00844 (2015).
- Tiskin, Alexander. "All-pairs shortest paths computation in the BSP model." International Colloquium on Automata, Languages, and Programming. Springer Berlin Heidelberg, 2001.
- Brandes, Ulrik. "A faster algorithm for betweenness centrality." Journal of mathematical sociology 25.2 (2001): 163-177.
- S.E., Maciej Besta, Flavio Vella, and Torsten Hoefler. "Betweenness centrality is more parallelizable than dense matrix multiplication." arXiv preprint, arXiv:1609.07008 (2016).

- Leiserson, Charles E., and Bruce M. Maggs. "Communication-efficient parallel algorithms for distributed random-access machines." Algorithmica 3.1-4 (1988): 53-77.
- Blelloch, Guy E. "Scans as primitive parallel operations." IEEE Transactions on computers 38.11 (1989): 1526-1538.
- Miller, Gary L., and John H. Reif. Parallel Tree Contraction and Its Application. No. TR-18-85. 1985.
- Gazit, Hillel, Gary L. Miller, and Shang-Hua Teng. "Optimal tree contraction in the EREW model." Concurrent Computations. Springer US, 1988. 139-156.
- Shiloach, Yossi, and Uzi Vishkin. "An O (logn) parallel connectivity algorithm." Journal of Algorithms 3.1 (1982): 57-67.
- Awerbuch, Baruch, and Yossi Shiloach. "New connectivity and MSF algorithms for shuffle-exchange network and PRAM." IEEE Transactions on Computers 100.10 (1987): 1258-1263.

- Demmel, James, et al. "Avoiding communication in sparse matrix computations." Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008.
- Leiserson, Charles E., Satish Rao, and Sivan Toledo. "Efficient out-of-core algorithms for linear relaxation using blocking covers." Foundations of Computer Science, 1993. Proceedings., 34th Annual Symposium on. IEEE, 1993.
- S.E., Erin Carson, Nicholas Knight, and James Demmel. "Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations." Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures. ACM, 2014.
- Baudet, Gérard M. "Asynchronous iterative methods for multiprocessors." Journal of the ACM (JACM) 25.2 (1978): 226-244.
- Chow, Edmond, and Aftab Patel. "Fine-grained parallel incomplete LU factorization." SIAM Journal on Scientific Computing 37.2 (2015): C169-C193.
- Ballard, G., Siefert, C., & Hu, J. (2016). Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing, 38(3), C203-C231.

- Quintana-Ortí, G., Sun, X. and Bischof, C.H., 1998. A BLAS-3 version of the QR factorization with column pivoting. SIAM Journal on Scientific Computing, 19(5), pp.1486-1494.
- Demmel, J.W., Grigori, L., Gu, M. and Xiang, H., 2015. Communication avoiding rank revealing QR factorization with column pivoting. SIAM Journal on Matrix Analysis and Applications, 36(1), pp.55-89.
- Bischof, C.H., 1991. A parallel QR factorization algorithm with controlled local pivoting. SIAM Journal on Scientific and Statistical Computing, 12(1), pp.36-57.
- Yu, H.F., Hsieh, C.J., Si, S. and Dhillon, I., 2012, December. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining (pp. 765-774). IEEE.
- Halko, N., Martinsson, P.G. and Tropp, J.A., 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2), pp.217-288.
- Drineas, P., Magdon-Ismail, M., Mahoney, M. W., & Woodruff, D. P. (2012). Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec), 3475-3506.
- Mahoney, M.W. and Drineas, P., 2009. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3), pp.697-702.
- Rokhlin, V., Szlam, A. and Tygert, M., 2009. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3), pp.1100-1124.
- Rokhlin, V. and Tygert, M., 2008. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Sciences, 105(36), pp.13212-13217.

- Plimpton, S., 1995. Fast parallel algorithms for short-range molecular dynamics. Journal of computational physics, 117(1), pp.1-19.
- Snir, Marc. "A note on n-body computations with cutoffs." Theory of Computing Systems 37.2 (2004): 295-318.
- Shaw, David E. "A fast, scalable method for the parallel evaluation of distance‐limited pairwise particle interactions." Journal of computational chemistry 26.13 (2005): 1318-1328.
- Driscoll, Michael, et al. "A communication-optimal n-body algorithm for direct interactions." Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 2013.
- Bowers, K.J., Chow, D.E., Xu, H., Dror, R.O., Eastwood, M.P., Gregersen, B.A., Klepeis, J.L., Kolossvary, I., Moraes, M.A., Sacerdoti, F.D. and Salmon, J.K., 2006, November. Scalable algorithms for molecular dynamics simulations on commodity clusters. In SC 2006 Conference, Proceedings of the ACM/IEEE (pp. 43-43). IEEE.
- Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale, L. and Schulten, K., 2005. Scalable molecular dynamics with NAMD. Journal of computational chemistry, 26(16), pp.1781-1802.

- Greengard, L. and Rokhlin, V., 1987. A fast algorithm for particle simulations. Journal of computational physics, 73(2), pp.325-348.
- Cheng, H., Greengard, L. and Rokhlin, V., 1999. A fast adaptive multipole algorithm in three dimensions. Journal of computational physics, 155(2), pp.468-498.
- White, C.A. and Head‐Gordon, M., 1994. Derivation and efficient implementation of the fast multipole method. The Journal of Chemical Physics, 101(8), pp.6593-6605.
- Greengard, L. and Gropp, W.D., 1990. A parallel version of the fast multipole method. Computers & Mathematics with Applications, 20(7), pp.63-71.
- Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T.A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D. and Biros, G., 2012. A massively parallel adaptive fast multipole method on heterogeneous architectures. Communications of the ACM, 55(5), pp.101-109.
- Xia, J., Chandrasekaran, S., Gu, M. and Li, X.S., 2010. Fast algorithms for hierarchically semiseparable matrices. Numerical Linear Algebra with Applications, 17(6), pp.953-976.
- Hackbusch, W., 2015. Hierarchical matrices: Algorithms and analysis (Vol. 49). Berlin: Springer.
- Martinsson, P.G., 2011. A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM Journal on Matrix Analysis and Applications, 32(4), pp.1251-1274.
- Gunnar Martinsson's slides from CBMS/NSF conference on Fast Direct Solvers: http://amath.colorado.edu/faculty/martinss/2014_CBMS/
- Andreas Kloeckner's CS 598 course: https://relate.cs.illinois.edu/course/cs598apk-f15/
- Jack Poulson's presentation slides: http://libelemental.org/pub/slides/MIT-HMat-Sep22-2014.pdf

- Sholl, D. and Steckel, J.A., 2011. Density functional theory: a practical introduction. John Wiley and Sons.
- Crawford, T.D. and Schaefer, H.F., 2000. An introduction to coupled cluster theory for computational chemists. Reviews in computational chemistry, 14, pp.33-136.
- Head-Gordon, M., Pople, J.A. and Frisch, M.J., 1988. MP2 energy evaluation by direct methods. Chemical Physics Letters, 153(6), pp.503-506.
- Bartlett, R. J. (1981). Many-body perturbation theory and coupled cluster theory for electron correlation in molecules. Annual Review of Physical Chemistry, 32(1), 359-401.
- Chow, E., Liu, X., Smelyanskiy, M. and Hammond, J.R., 2015. Parallel scalability of Hartree–Fock calculations. The Journal of chemical physics, 142(10), p.104103.
- S.E., Devin Matthews, Jeff Hammond, John F. Stanton, and James Demmel. "Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions." Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 2013.

- Kolda, T.G. and Bader, B.W., 2009. Tensor decompositions and applications. SIAM review, 51(3), pp.455-500.
- Austin, W., Ballard, G. and Kolda, T.G., 2015. Parallel Tensor Compression for Large-Scale Scientific Data. arXiv preprint arXiv:1510.06689.
- Rajbhandari, Samyam, et al. "A communication-optimal framework for contracting distributed tensors." SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014.
- S.E., James Demmel, and Torsten Hoefler. "Communication lower bounds for tensor contraction algorithms." (2015).
- Karlsson, Lars, Daniel Kressner, and André Uschmajew. "Parallel algorithms for tensor completion in the CP format." Parallel Computing (2015).
- Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B. and Shelhamer, E., 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.

- Leiserson, Charles E. "Fat-trees: universal networks for hardware-efficient supercomputing." IEEE transactions on Computers 100.10 (1985): 892-901.
- Watts, Jerrell, and Robert Van De Geijn. "A pipelined broadcast for multidimensional meshes." Parallel Processing Letters 5.02 (1995): 281-292.
- Kim, J., Dally, W.J., Scott, S. and Abts, D., 2008, June. Technology-driven, highly-scalable dragonfly topology. In ACM SIGARCH Computer Architecture News (Vol. 36, No. 3, pp. 77-88). IEEE Computer Society.
- Dally, W.J., 1992. Virtual-channel flow control. IEEE Transactions on Parallel and Distributed systems, 3(2), pp.194-205.
- Dally, W.J. and Aoki, H., 1993. Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE transactions on Parallel and Distributed Systems, 4(4), pp.466-475.
- Faraj, A., Kumar, S., Smith, B., Mamidala, A. and Gunnels, J., 2009, August. MPI collective communications on the Blue Gene/P supercomputer: Algorithms and optimizations. In 2009 17th IEEE Symposium on High Performance Interconnects (pp. 63-72). IEEE.
- Watts, J. and Van De Geijn, R., 1995. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(02), pp.281-292.
- S.E., Bhatele, A. and Demmel, J., 2011. Improving communication performance in dense linear algebra via topology aware collectives. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 77). ACM.
- Besta, Maciej, and Torsten Hoefler. "Slim fly: a cost effective low-diameter network topology." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 2014.

Disclaimers: The above list of references is not a complete assortment of important publications on these topics, but rather the ones that will be touched upon in the course. Some papers are relevant to multiple sections, but all appear only once. Citations obtained via google scholar and may contain errors.