CASPER: Embedding Power Estimation and Hardware-Controlled Power Management in a Cycle-Accurate Micro-Architecture Simulation Platform for Many-Core Multi-Threading Heterogeneous Processors
Despite the promising performance improvement observed in emerging many-core architectures in high performance processors, high power consumption prohibitively affects their use and marketability in the low-energy sectors, such as embedded processors, network processors and application specific instruction processors (ASIPs). While most chip architects design power-efficient processors by finding an optimal power-performance balance in their design, some use sophisticated on-chip autonomous power management units, which dynamically reduce the voltage or frequencies of idle cores and hence extend battery life and reduce operating costs. For large scale designs of many-core processors, a holistic approach integrating both these techniques at different levels of abstraction can potentially achieve maximal power savings. In this paper we present CASPER, a robust instruction trace driven cycle-accurate many-core multi-threading micro-architecture simulation platform where we have incorporated power estimation models of a wide variety of tunable many-core micro-architectural design parameters, thus enabling processor architects to explore a sufficiently large design space and achieve power-efficient designs. Additionally CASPER is designed to accommodate cycle-accurate models of hardware controlled power management units, enabling architects to experiment with and evaluate different autonomous power-saving mechanisms to study the run-time power-performance trade-offs in embedded many-core processors. We have implemented two such techniques in CASPER– Chipwide Dynamic Voltage and Frequency Scaling, and Performance Aware Core-Specific Frequency Scaling, which show average power savings of 35.9% and 26.2% on a baseline 4-core SPARC based architecture respectively. This power saving data accounts for the power consumption of the power management units themselves. The CASPER simulation platform also provides users with complete support of SPARCV9 instruction set enabling them to run a full operating system software stack, and hence a wide variety of benchmarking applications.
References
[1]
Netronome Heterogeneous Reference Architecture. Netronome Inc., 2010. 2010. Available online: http://www.netronome.com/pages/heterogeneous-architecture (accessed on 1 Februray 2012).
[2]
Cisco Inc. The Cisco QuantumFlow Processor: Cisco’s Next Generation Network Processor. 2010. Available online: http://www.cisco.com/en/US/prod/collateral/routers/ps9343/solution_overview_c22-448936.html (accessed on 1 Februray 2012).
[3]
Lindholm, E.; Nickolls, J.; Oberman, S.; Montrym, J. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 2008, 28, 39–55, doi:10.1109/MM.2008.31.
[4]
OpenSPARC T1/T2. 2007. Available online: http://www.opensparc.net (accessed on 1 Februray 2012).
[5]
Oracle’s SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B Server Architecture. Oracle Corp, 2009. 2009. Available online: http://www.opensparc.net (accessed on 1 Februray 2012).
[6]
Stackhouse, B.; Bhimji, S.; Bostak, C.; Bradley, D.; Cherkauer, B.; Desai, J.; Francom, E.; Gowan, M.; Gronowski, P.; Krueger, D.; et al. A 65 nm 2-Billion Transistor Quad-Core Itanium Processor. IEEE J. Solid-State Circ. 2009, 44, 18–31, doi:10.1109/JSSC.2008.2007150.
[7]
Spracklen, L.; Abraham, S.G. Chip Multithreading: Opportunities and Challenges. In Proceedings of 11th International Symposium on High-Performance Computer Architecture (HPCA-11), San Francisco, CA, USA, 12–16 February, 2005; pp. 248–252.
[8]
Tullsen, D.M.; Eggers, S.J.; Levy, H.M. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, 22–24 June 1995; pp. 392–403.
SimplyRISC S1 Core on FPGA. 2007. Available online: http://www.opensparc.net/projects/directory.html (accessed on 1 Februray 2012).
[11]
Bell, S.; Edwards, B.; Amann, J.; Conlin, R.; Joyce, K.; Leung, V.; MacKay, J.; Reif, M.; Bao, L.; Brown, J. TILE64—Processor: A 64-Core SoC with Mesh Interconnect. In Proceedings of the IEEE International Solid-StateCircuits Conference (ISSCC), San Francisco, CA, USA, 3-7 February 2008; pp. 88–89.
[12]
Beavers, B. The story behind the intel atom processor success. IEEE Des. Test Comput. 2009, 26, 8–13, doi:10.1109/MDT.2009.44.
[13]
Kongetira, P.; Aingaran, K.; Olukotun, K. Niagara: A 32-way multithreaded Sparc processor. IEEE Micro 2005, 25, 21–29.
[14]
Kumar, R.; Farkas, K.I.; Jouppi, N.P.; Ranganathan, P.; Tullsen, D.M. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In Proceedings of 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-36), San Diego, CA, USA, 3–5 December 2003; pp. 81–92.
[15]
Kumar, R.; Tullsen, D.M.; Ranganathan, P.; Jouppi, N.P.; Farkas, K.I. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st International Symposium on Computer Architecture, München, Germany, 19–23 June 2004; pp. 64–75.
[16]
Zhao, W.; Li, X.; Nowak, M.; Cao, Y. Predictive Technology Modeling for 32nm Low Power Design. In Proceedings of 2007 International Semiconductor Device Research Symposium, College Park, MD, USA, 12–14 December 2007; pp. 1–2.
[17]
Brooks, D.; Tiwari, V.; Martonosi, M. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture, Vancouver, Canada, 14 June 2000; pp. 83–94.
[18]
Oracle Corperation. Oracle Solaris Studio 11 Overview. 2011. Available online: http://www.oracle.com/technetwork/server-storage/solarisstudio/downloads/index.html (accessed on 1 February 2012).
[19]
Sun Microsystems Inc. OpenSPARC T1 Micro-Archiecture Specification. 2006. Available online: http://www.opensparc.net/opensparc-t1/index.html (accessed on 1 February 2012).
[20]
Sun Microsystems Inc. UltraSPARC Architecture 2007, Privileged and Non-Privileged Instructions. 2007. Available online: http://www.opensparc.net/opensparc-t1/index.html (accessed on 1 February 2012).
[21]
Brooks, D.; Martonos, M. Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance. ACM Trans. Comput. Syst. 2000, 18, 89–126, doi:10.1145/350853.350856.
[22]
Leon, A.S.; Langley, B.; Jinuk Luke, S. The UltraSPARC T1 Processor: CMT Reliability. In Proceedings of CICC '06 IEEE Custom Integrated Circuits Conference, San Jose, CA, USA, 10–13 September 2006; pp. 555–562.
[23]
Sun Microsystems Inc. OpenSPARC T2 System-On-Chip (SOC) Microarchitecture Specification. 2008. Available online: http://www.opensparc.net/opensparc-t2/index.html (accessed on 1 February 2012).
[24]
Synopsys Inc. DFT Compiler Datasheet. 2009. Available online: http://www.synopsys.com/tools/implementation/rtlsynthesis/pages/dftcompiler.aspx (accessed on 1 February 2012).
[25]
Zhao, W.; Cao, Y. New generation of predictive technology model for sub-45 nm design exploration. ACM Trans. Comput. Syst. 2007, 3, 585–590.
[26]
Cadence Encounter. 2009. Available online: http://www.cadence.com/products/ld/rtl_compiler/ (accessed on 1 February 2012).
[27]
Tarjan, D.; Thoziyoor, S.; Jouppi, N.P. CACTI 4.0; HP Laboratories: Palo Alto, CA, USA, 2006. June.
[28]
Gifi, A. Nonlinear Multivariate Analysis; John Wiley & Sons: Hoboken, NJ, USA, 1989.
[29]
SPSS Statistical Tool. 2011. Available online: http://www-01.ibm.com/software/analytics/spss/ (accessed on 1 February 2012).
[30]
Isci, C.; Buyuktosunoglu, A.; Cher, C.-Y.; Bose, P.; Martonosi, M. An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, FL, USA, 9–13 December 2006; pp. 347–358.
[31]
Benini, L.; Micheli, G.D. Dynamic Power Management: Design Techniques and CAD Tools; Kluwer Academic: USA, 1997.
[32]
Kushal Datta, Yue Liu, Arindam Mukherjee, Arun Ravindran and Bharat Joshi, Hardware Techniques for Autonomous Power Saving in Embedded Many-Core Processors. In Multi-Core Embedded Systems; Kornaros, G., Ed.; CRC Press and Taylor & Francis Group: Boca Raton, FL, USA, 2009; Volume Chapter 10.
[33]
Wolf, T.; Franklin, M. CommBench-a Telecommunications Benchmark for Network Processors. In Proceedings of 2000 IEEE International Symposium on Performance Analysis of Systems and Software ISPASS, Austin, TX, USA, 24–25 April 2000; pp. 154–162.
[34]
Tee, A.; Cleveland, J.R.; Chang, J.W. Implication of End-user QoS requirements on PHY & MAC. Technical Report from IEEE 802 Executive Committee Study Group on Mobile Broadband Wireless Access; C802.2-03/106;2003. Available online: http://www.ieee802.org/20/Contribs/C802.20-03-106.ppt (accessed on 1 February 2012).
[35]
Rosewarne, C. Network Processors; Calyptech Ltd.: Melbourne, Australia, 2004. Available online: http://www.calyptech.com/resources/resources/#networkprocessor (accessed on 1 February 2012).
[36]
Regnier, G.; Minturn, D.; McAlpine, G.; Saletore, V.; Foong, A. ETA: Experience with an Intel?; XeonTM; Processor as a Packet Processing Engine. In Proceedings of 11th Symposium on High Performance Interconnects; Palo Alto, CA, USA: 20–22 August 2003; pp. 76–82.
[37]
Roberts, L.G. A radical new router. IEEE Spectrum 2009, 46, 34–39, doi:10.1109/MSPEC.2009.5109450.
[38]
Grochowski, E.; Ronen, R.; Shen, J.; Hong, W. Best of Both Latency and Throughput. In Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors2004, ICCD 2004, San Jose, CA, USA, 11–13 October 2004; pp. 236–243.
[39]
Annavaram, M.; Grochowski, E.; Shen, J.P. Mitigating Amdahl’s Law through EPI Throttling. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, Madison, WI, USA, 4–8 June 2005; pp. 298–309.
[40]
Rakesh, K.; Keith, I.F.; Norman, P.J.; Ranganathan, P.; Tullsen, D.M. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, San Diego, CA, USA, 3–5 December 2003; pp. 81–92.
[41]
Balakrishnan, S.; Rajwar, R.; Upton, M.; Lai, K. The Impact of Performance Asymmetry in Emerging Multicore Architectures. In Proceedings of 32nd International Symposium on Computer Architecture, 2005. ISCA ’05, Madison, WI, USA, 4–8 June 2005; pp. 506–517.
[42]
Morad, T.Y.; Weiser, U.C.; Kolodnyt, A.; Valero, M.; Ayguade, E. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Comput. Archit. Lett. 2006, 5, 14–17, doi:10.1109/L-CA.2006.14.
[43]
Ye, W.; Vijaykrishnan, N.; Kandemir, M.; Irwin, M.J. The Design and Use of simplepower: A Cycle-Accurate Energy Estimation Tool. In Proceedings of 37th Design Automation Conference, Los Angeles, CA, USA, 5–9 June 2000; pp. 340–345.
[44]
Benjamin, C.L.; David, M.B. Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, USA, December 2006.
[45]
Amol, B.; Jingzhao, O.; Viktor, K.P. Towards Automatic Synthesis of a Class of Application—Specific Sensor Networks. In Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, Grenoble, France, 8–11 October 2002.
[46]
Virtutech. Virtutech Simics Multi-Processor Simulator Software, 2008. Available online: https://www.simics.net/forum/about.html (accessed on 1 February 2012).
[47]
Herbert, S.; Marculescu, D. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Proceedings of the 2007 International Symposium on Low Power Electronics and Design, Portland, OR, USA, 27–29 August 2007; pp. 38–43.
[48]
RSIM. Available online: http://rsim.cs.illinois.edu/rsim/dist.html (accessed on 1 February 2012).
[49]
Multifacet GEMS. Available online: http://www.cs.wisc.edu/gems/ (accessed on 1 February 2012).
[50]
SimFlex. Available online: http://si2.epfl.ch/~parsacom/projects/simflex/ (accessed on 1 February 2012).
[51]
Zeng, H.; Yourst, M.; Ghose, K.; Ponomarev, D. MPTLsim: A simulator for X86 multicore processors. In Proceedings of 46th ACM/IEEE Design Automation Conference (DAC ’09), San Francisco, CA, USA, 26–31 July 2009; pp. 226–231.
[52]
Yourst, M.T. PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator. In Proceedings of IEEE International Symposium on Performance Analysis of Systems & Software, 2007 (ISPASS 2007), San Jose, CA, USA, 25–27 April 2007; pp. 23–34.
[53]
Luo, Y.; Yang, J.; Bhuyan, L.N.; Zhao, L. NePSim: A network processor simulator with a power evaluation framework. IEEE Micro 2004, 24, 34–44.
[54]
Li, S.; Ahn, J.H.; Strong, R.D.; Brockman, J.B.; Tullsen, D.M.; Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-42, New York, NY, USA, 12–16 December 2009; pp. 469–480.