For half a century, supercomputers have performed the tasks that spark people’s imaginations – analyzing enough data to simulate nuclear tests, mapping the human genome, and targeting precise locations to drill new oil wells. And they take on even bigger roles today. These high-performance computing (HPC) systems are fueling a new wave of data-intensive applications that rely on artificial intelligence (AI), machine learning (ML), 3D imaging (GPU) and the Internet of Things. objects (IoT).
HPC systems are big and powerful, but sometimes things can go wrong. They perform billions of calculations per second, and HPC clusters typically consist of thousands of networked compute servers. If they go down or incorrectly map connections to major data stores, they can delay important projects. In addition, they require intensive maintenance, updates and system checks.
Given the great importance of HPC – as a foundation for technical and societal progress – it is essential that these systems operate at the highest level. Many organizations are devoting the kind of attention their HPC infrastructure needs. However, some choose to invest more in hardware and software than in optimizing and supporting the system.
This happens for a variety of reasons. HPC users in some large organizations are known to appreciate the cache that comes with using one of the most powerful computers in the world. When they want to go faster, they invest in “horsepower” – additional knots and accelerators. Some IT managers running HPC systems find having multiple high-power systems with redundancy so much that it is less necessary to invest in support. Others are reluctant to regularly update long-running HPC systems because they believe the changes could inject unnecessary doses of complexity and risk.
Savvy organizations realize that HPC needs to be continuously nurtured, improved, updated, and optimized to get the most out of their investments. This can be done in-house, if they have the expertise, or by contracting with an external vendor specializing in third-party support if they don’t have the resources or prefer to devote them to more strategic activities for their business. Here are some ways that organizations can optimize their HPC systems and increase the return on investment of their HPC environments.
Perform regular health checks
Basic surveillance often falls low on organizations’ priority list. Make sure you get checkups at least twice a year – every quarter, if possible. Check the software and firmware versions. Check interdependencies to avoid introducing incompatibilities that could infiltrate the environment and affect availability and performance. Proactive maintenance of the liquid cooling system can also prevent many problems. In the event of a failure, the system could heat up and damage processors and other components.
Regular health checks make sure the system is working as it is supposed to be today. But what about the future? Does the organization consider different types of downstream projects that might require new applications, new workloads, and new configurations? Rather than waiting for a project to come through, consider new versions of hardware, software, storage, and networking that might create more flexibility when the time comes and keep compute performance at peak levels.
Follow good practices
In terms of performance, organizations should review industry best practices based on their use of the system. Businesses need to look at the cases they are working on, how they configure their systems, the applications they run, and the system architecture on which the applications run. If they change a set of workloads, the old configuration may not be as efficient anymore. If the data they initially had is on-premises and they want to include more data from the edge, they may need to adjust their system architecture.
Keep critical spare parts on hand
Organizations looking for short response times for critical issues can arrange with a support provider to keep important spares on site. For example, compute blades. If a node malfunctions for a few hours, tasks can be distributed to other nodes. But the management servers that host the cluster management software are critical – if one of them fails, it impacts the entire system.
Don’t forget about network switching. If systems don’t have the right connection between their compute blades and data storage, they lose performance. Businesses need to maintain communication and data flow between where their data is stored and where it needs to go. Any problem with network components and switches will impact system performance, which could delay the introduction of a new product.
Let a third-party expert troubleshoot
A reliable HPC services expert can help organizations spend fewer resources on system maintenance and optimization and focus on higher value activities. HPC systems can integrate IP from many different vendors. From a sourcing perspective, an organization can handle all interactions with external vendors – or it can contract with a support vendor to manage the process for them. The third party can fix technological issues or, if that is the preferred option, simply identify the problem and let the customer deal directly with the supplier. Additionally, on-site customer engineers can serve as an extension to customer staff and help meet performance SLAs through proactive maintenance.
Think of HPC as a service
The purchase and maintenance of HPC environments requires huge initial capital expenditures (capex) and ongoing operational costs (opex) to cover energy expenses, human resources, and maintenance and repair costs. Shifting to an operating model, where the organization moves to the cloud and pays month-to-month, removes that initial blow that many small organizations cannot afford. Large organizations can also maximize their flexibility through this model. For example, an automaker that has a budget for a data center but needs to double its computing power to accelerate the development of autonomous driving can take advantage of cloud models – and cloud-like models providing on-premise services. – to continue the search on Track.
“[In] in the last five or six years there has been a constant transition from HPC to the cloud“Srini Chari, Managing Partner of Cabot Partners Group, a Connecticut IT analyst firm, said.” This is the fastest growing segment of the HPC market. What is happening here is that many companies find that it is a real headache for them to manage the infrastructure due to the rate and pace of technological change and the skills required to operate the CHP on premise. So instead of buying technology, they seek to use it as a cloud service. ”
HPC systems are among the most impactful drivers of technology today. They continue to run complex simulations, and they will be used to run the AI applications that companies will use to grow their businesses in the future. To get the most out of AI, organizations will need to optimize their HPC environments. Making sure they’re running smoothly and avoiding risk will help organizations drive business value.
About Cyrille Schulz
Cyrille Schulz is product manager for the HPE Pointnext service portfolio focused on high-end HPC products such as HPE Cray supercomputers and liquid-cooled Apollos. His responsibilities include defining and executing the overall service management strategy, developing new service capabilities and building the end-to-end value chain, from portfolio to delivery. Cyrille has service experience across the entire technology spectrum with a passion for developing services that deliver successful business results while being profitable and across the entire portfolio.
Copyright © 2021 IDG Communications, Inc.