|
One Giant Leap
How NASA, SGI and Intel managed to build and deploy history's most powerful supercomputer in 15 blistering weeks.
"The best word is adrenaline," says Harkness, vice president of SGI's manufacturing facilities in Chippewa Falls, Wis. "The thought of building a 10,240-processor system in a little over three months was an exhilarating prospect, especially since we still had to maintain our normal manufacturing pace. We honestly wondered what people were thinking." What people were thinking, it turns out, was to spectacularly revitalize NASA's computing resources with a single system - one that would put more supercomputing power into the Agency's hands than anyone, anywhere had ever seen before. But getting there quickly meant overcoming colossal challenges, from congressional approvals to the breakneck delivery and deployment of new products. And yet it worked. NASA's "Columbia" supercomputer, so named to honor the crew lost in the 2003 shuttle accident, may have been born of necessity. But it was brought to life by NASA, SGI and Intel in a dramatic sprint to a finish line that at first seemed all but unreachable. Here's how it happened. A Modest Proposal
"What we proposed," he says, "was a relatively modest investment to stay vital in high-end computing." The idea: NASA would purchase 15 Teraflops (trillion operations per second) of computing power over three years. Brooks and his counterparts continued to sell their concept through the spring. Meanwhile, the Agency considered more ambitious approaches. Particularly intriguing was the notion of building a world-class supercomputer by November. One idea was to link thousands of dual-processor commodity servers into a sprawling cluster, but NASA quickly dismissed that approach. "We're trying to solve some of the toughest scientific problems in the world," says Jim Taft, task lead for the NAS Division's Terascale Applications Group. "We needed a system designed to efficiently execute the algorithms used in NASA's premier science codes, rather than one that would merely do well on artificial benchmarks." Brooks and his team instead pointed to Kalpana, an Intel® Itanium® 2-based, 512-processor SGI® Altix® 3000 system in use at NASA Ames since November 2003 and named to honor Kalpana Chawla, a NASA scientist lost in the Columbia accident.. In less than six months, Taft says, the Kalpana system - the first 512-processor Linux® system ever to operate under a single Linux kernel - had revolutionized the rate of scientific discovery at NASA for a number of disciplines. On NASA's previous supercomputers, simulations showing five years worth of changes in ocean temperatures and sea levels were taking 12 months to model. But on the SGI® Altix® system, scientists could simulate decades of ocean circulation in just days, while producing simulations in greater detail than ever before. And the time required to assess flight characteristics of an aircraft design, which involves thousands of complex calculations, dropped from years to a single day. "That kind of leap is incredible," says Taft. "What took a year on the best computing technology previously available, we could now accomplish in days on the Altix system." NASA scientists began to imagine what an SGI® supercomputer built from 20 nodes, each with the power of Kalpana, could offer. "We could easily do all the benchmarking anyone could want," Taft says, "but we're more interested in a system capable of doing useful science." The entire NASA team envisioned the science that would be possible on such a system: detailed hurricane predictions, global warming studies, electronic wind tunnel simulations, galaxy formation and supernova analysis, and experiments leading to safer space exploration. Thirty Days to Yes Another crucial challenge was to prove NASA could pull it off without spending another dime over its approved budget. "Once Congress saw how we could acquire four times the computing resources for the same money," says Brooks, "it was hard to refuse." In just 30 days, NASA received the green light on Project Columbia. "People both inside and outside the agency were inspired," says Brooks. In light of Return to Flight initiatives following the loss of the Columbia shuttle and crew, the need was increasingly urgent. Lawmakers also took note of the impact the new supercomputer would have on other national science projects "The Columbia Accident Investigation Board spent three months conducting analysis to seek the root cause of the accident. If they had this new system then, it would have been possible to do this in a matter of days."
| |||||||||||||