hpctoolkit

[ Home | Overview | Publications | Software/Downloads ] • [ Documentation/Questions | Training Videos and Slides ] • [ People | Acks ]


HPCToolkit and Related Publications

Selected Overview Paper

[1]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685–701, 2010. (doi:10.1002/cpe.1553)

HPCToolkit Papers

[1]
Xiaozhu Meng, Jonathon M. Anderson, John Mellor-Crummey, Mark W. Krentel, Barton P. Miller, and Srdan Milakovic. Parallel binary code analysis. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '21, page 76–89, New York, NY, USA, 2021. Association for Computing Machinery. (doi:10.1145/3437801.3441604)
[2]
Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. GPA: A GPU performance advisor based on instruction sampling. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 115–125, 2021. (doi:10.1109/CGO51591.2021.9370339)
[3]
K. Zhou, Y. Hao, J. Mellor-Crummey, X. Meng, and X. Liu. GVPROF: A value profiler for GPU-based clusters. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–16, Los Alamitos, CA, USA, nov 2020. IEEE Computer Society. (doi:10.1109/SC41405.2020.00093)
[4]
Ryuichi Sai, John Mellor-Crummey, Xiaozhu Meng, Mauricio Araya-Polo, and Jie Meng. Accelerating high-order stencils on gpus. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 86–108, 2020. (doi:10.1109/PMBS51919.2020.00014)
[5]
Keren Zhou, Mark W. Krentel, and John Mellor-Crummey. Tools for top-down performance analysis of GPU-accelerated applications. In Proceedings of the 34th ACM International Conference on Supercomputing, ICS '20, New York, NY, USA, 2020. Association for Computing Machinery. (doi:10.1145/3392717.3392752)
[6]
Keren Zhou, Mark Krentel, and John Mellor-Crummey. A tool for top-down performance analysis of GPU-accelerated applications. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '20, page 415–416, New York, NY, USA, 2020. Association for Computing Machinery. (doi:10.1145/3332466.3374534)
[7]
Lai Wei and John Mellor-Crummey. Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '20, pages 144–159, New York, NY, USA, 2020. Association for Computing Machinery. (doi:10.1145/3332466.3374538)
[8]
P. Taffet and J. Mellor-Crummey. Lightweight, packet-centric monitoring of network traffic and congestion implemented in p4. In 2019 IEEE Symposium on High-Performance Interconnects (HOTI), pages 54–58, 2019. (doi:10.1109/HOTI.2019.00026)
[9]
Philip Taffet and John Mellor-Crummey. Understanding congestion in high performance interconnection networks using sampling. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19, New York, NY, USA, 2019. Association for Computing Machinery. (doi:10.1145/3295500.3356168)
[10]
John Mellor-Crummey. Piper: Performance insight for programmers and exascale runtimes: Guiding the development of the exascale software stack. 10 2017. (doi:10.2172/1400393)
[11]
Milind Chabbi and John Mellor-Crummey. Contention-conscious, locality-preserving locks. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, New York, NY, USA, 2016. Association for Computing Machinery. (doi:10.1145/2851141.2851166)
[12]
Chaoran Yang and John Mellor-Crummey. A practical solution to the cactus stack problem. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '16, page 61–70, New York, NY, USA, 2016. Association for Computing Machinery. (doi:10.1145/2935764.2935787)
[13]
Philip Taffet and Laksono Adhianto. Addressing challenges in visualizing huge call-path traces. In 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pages 319–328, Aug 2016. (doi:10.1109/ICPPW.2016.53)
[14]
Xu Liu and John Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 259–272, New York, NY, USA, 2014. ACM. (doi:10.1145/2555243.2555271)
[15]
Xu Liu and John Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. SIGPLAN Not., 49(8):259–272, February 2014. (doi:10.1145/2692916.2555271)
[16]
Xu Liu and John Mellor-Crummey. A data-centric profiler for parallel programs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 28:1–28:12, New York, NY, USA, 2013. ACM. (doi:10.1145/2503210.2503297)
[17]
Milind Chabbi, Karthik Murthy, Mike Fagan, and John Mellor-Crummey. Critically missing pieces on accelerators: A performance tools perspective. SC '13: Birds of a Feather Session: Critically Missing Pieces in Heterogeneous Accelerator Computing, Pavan Balaji (Organizer), November 2013.
[18]
Nathan R. Tallent, John M. Mellor-Crummey, Michael Franco, Reed Landrum, and Laksono Adhianto. Scalable fine-grained call path tracing. In ICS '11: Proc. of the 25th International Conference on Supercomputing, pages 63–74, New York, NY, USA, 2011. ACM. (doi:10.1145/1995896.1995908)
[19]
Xu Liu and John Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In CGO '11: Proc. of the 2011 IEEE/ACM International Symposium on Code Generation and Optimization, pages 171–180, 2011. (doi:10.1109/CGO.2011.5764685)
[20]
Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In SC '10: Proc. of the 2010 ACM/IEEE Conference on Supercomputing, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society. (doi:10.1109/SC.2010.47)
[21]
Laksono Adhianto, John Mellor-Crummey, and Nathan R. Tallent. Effectively presenting call path profiles of application performance. In PSTI 2010: Workshop on Parallel Software Tools and Tool Infrastructures, in conjunction with the 2010 International Conference on Parallel Processing, 2010.
[22]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685–701, 2010. (doi:10.1002/cpe.1553)
[23]
Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. Analyzing lock contention in multithreaded applications. In PPoPP '10: Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 269–280, New York, NY, USA, 2010. ACM. (doi:10.1145/1693453.1693489)
[24]
Nathan R. Tallent and John M. Mellor-Crummey. Identifying performance bottlenecks in work-stealing computations. Computer, 42(12):44–50, 2009. (doi:10.1109/MC.2009.396)
[25]
Nathan R. Tallent, John M. Mellor-Crummey, Laksono Adhianto, Michael W. Fagan, and Mark Krentel. Diagnosing performance bottlenecks in emerging petascale applications. In SC '09: Proc. of the 2009 ACM/IEEE Conference on Supercomputing, pages 1–11, New York, NY, USA, 2009. ACM. (doi:10.1145/1654059.1654111)
[26]
Nathan R. Tallent, John Mellor-Crummey, and Michael W. Fagan. Binary analysis for measurement and attribution of program performance. In PLDI '09: Proc. of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 441–452, New York, NY, USA, 2009. ACM. Distinguished Paper. (doi:10.1145/1542476.1542526)
[27]
Robert Fowler, Laksono Adhianto, Bronis de Supinski, Michael Fagan, Todd Gamblin, Mark Krentel, John Mellor-Crummey, Martin Schulz, and Nathan Tallent. Frontiers of performance analysis on leadership-class systems. Journal of Physics: Conference Series, 180:012041 (6pp), 2009.
[28]
Nathan R. Tallent and John Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In PPoPP '09: Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 229–240, New York, NY, USA, 2009. ACM. (doi:10.1145/1504176.1504210)
[29]
L. Adhianto, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Performance measurement and analysis for supercomputers with node-level parallelism. In Workshop on Node Level Parallelism for Large Scale Supercomputers, in conjunction with Supercomputing 2008, November 2008.
[30]
Nathan Tallent, John Mellor-Crummey, Laksono Adhianto, Mike Fagan, and Mark Krentel. HPCToolkit: Performance tools for scientific computing. Journal of Physics: Conference Series, 125:012088 (5pp), 2008.
[31]
John Mellor-Crummey and Nathan R. Tallent. A methodology for accurate, effective and scalable performance analysis of application programs. In Workshop on Tools, Infrastructures and Methodologies for the Evaluation of Research Systems, in conjunction with the 2008 IEEE International Symposium on Performance Analysis of Systems and Software, pages 4–11, February 2008.
[32]
John Mellor-Crummey, Nathan R. Tallent, Mike Fagan, and Jan Odegard. Application performance profiling on the Cray XD1 using HPCToolkit. In Proc. of the Cray User's Group, May 2007.
[33]
Cristian Coarfa, John Mellor-Crummey, Nathan Froyd, and Yuri Dotsenko. Scalability analysis of SPMD codes using expectations. In ICS '07: Proc. of the 21st International Conference on Supercomputing, pages 13–22, New York, NY, USA, 2007. ACM. (doi:10.1145/1274971.1274976)
[34]
Nathan Froyd, Nathan Tallent, John Mellor-Crummey, and Robert Fowler. Call path profiling for unmodified, optimized binaries. In GCC Summit '06: Proc. of the GCC Developers' Summit, 2006, pages 21–36, 2006.
[35]
Nathan Froyd, John Mellor-Crummey, and Rob Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of the 19th International Conference on Supercomputing, pages 81–90, New York, NY, USA, 2005. ACM. (PDF) (doi:10.1145/1088149.1088161)
[36]
John Mellor-Crummey, Robert Fowler, Gabriel Marin, and Nathan Tallent. HPCView: A tool for top-down analysis of node performance. The Journal of Supercomputing, 23(1):81–104, 2002. (PDF) (doi:10.1023/A:1015789220266)
[37]
John Mellor-Crummey, Robert Fowler, and David Whalley. Tools for application-oriented performance tuning. In ICS '01: Proc. of the 15th International Conference on Supercomputing, pages 154–165, New York, NY, USA, 2001. ACM. (PDF) (doi:10.1145/377792.377826)

HPCToolkit Talks and Posters

[1]
Milind Chabbi, Karthik Murthy, Mike Fagan, and John Mellor-Crummey. Critically missing pieces on accelerators: A performance tools perspective. SC '13: Birds of a Feather Session: Critically Missing Pieces in Heterogeneous Accelerator Computing, Pavan Balaji (Organizer), November 2013.
[2]
John Mellor-Crummey. Hpctoolkit: Sampling-based performance tools for leadership computing. Productivity Tools for Leadership Science Workshop, Argonne Leadership Computing Facility Winter Workshop Series, January 2011.
[3]
Nathan R. Tallent. Performance analysis for parallel programs: From multicore to petascale. Supercomputing 2010 George Michael HPC Fellow Presentation, November 2010.
[4]
John Mellor-Crummey. Gaining insight into parallel program performance using sampling. IBM T. J. Watson Research Center, October 2010.
[5]
John Mellor-Crummey. A slice of CScADS: Performance tools for petascale platforms. SciDAC 2010, July 2010.
[6]
Nathan R. Tallent. Identifying scalability bottlenecks in large-scale parallel programs using HPCToolkit. In Jesus Labarta, Barton P. Miller, Bernd Mohr, and Martin Schulz, editors, Program Development for Extreme-Scale Computing, number 10181 in Dagstuhl Seminar Proceedings, Dagstuhl, Germany, 2010. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany.
[7]
John Mellor-Crummey. Hpctoolkit: Sampling-based performance tools for leadership computing. INCITE Getting Started Workshop, Argonne Leadership Computing Facility, January 2010.
[8]
Nathan R. Tallent. Performance analysis of parallel programs: From multicore to petascale. Supercomputing 2009 Doctoral Research Showcase, November 2009.
[9]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Performance tools for scientific computing. In SC '08: Proc. of the 2008 ACM/IEEE Conference on Supercomputing, New York, NY, USA, 2008. ACM.
[10]
John Mellor-Crummey, Robert Fowler, and Nathan R. Tallent. Practical application performance analysis on Linux systems. Supercomputing 2004 Tutorial, November 2004.
[11]
John Mellor-Crummey. HPCToolkit: Multi-platform tools for profile-based performance analysis. 5th International Workshop on Automatic Performance Analysis (APART), November 2003. (PDF)
[12]
Nathan Froyd, John Mellor-Crummey, and Nathan R. Tallent. A sample-driven call stack profiler. 4th Symposium of the Los Alamos Computer Science Institute (LACSI 2003), October 2003.
[13]
Nathan R. Tallent. HPCToolkit: Top-down analysis of node performance. 2003 MCS Divisional Seminars and Colloquia, Argonne National Laboratory, August 2003.
[14]
John Mellor-Crummey, Robert Fowler, and David Whalley. On providing useful information for analyzing and tuning applications. In SIGMETRICS '01: Proc. of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 332–333, New York, NY, USA, 2001. ACM. (PDF) (doi:10.1145/378420.378828)

Performance Modeling and Prediction Papers

[1]
G. Marin and J. Mellor-Crummey. Application insight through performance modeling. In IPCCC 2007: Proc. of the 26th IEEE International Performance, Computing, and Communications Conference, pages 65 –74, apr. 2007. (doi:10.1109/PCCC.2007.358880)
Apan Qasem, Ken Kennedy, and John Mellor-Crummey. Automatic tuning of whole applications using direct search and a performance-based transformation system. J. Supercomput., 36(2):183–196, 2006. (PDF) (doi:10.1007/s11227-006-7957-2)
[3]
Gabriel Marin and John Mellor-Crummey. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In Proc. of the Sixth Annual Los Alamos Computer Science Institute Symposium, 2005.
[4]
Gabriel Marin and John Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. In SIGMETRICS '04: Proc. of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 2–13, New York, NY, USA, 2004. ACM. (PDF) (doi:10.1145/1005686.1005691)

[Made with bib2xhtml.]

[Page last updated: 2024/05/09]


Copyright © HPCToolkit Project a Series of LF Projects, LLC
For web site terms of use, trademark policy and other project policies please see https://lfprojects.org.