View the World Through the Eyes of Wikipedia
What can be done in a day on the New SGI® UV™ 2000 - the world's largest in-memory data-mining system?
Hamburg, Germany — June 18, 2012 — SGI (NASDAQ: SGI), the trusted leader in technical computing has partnered with Kalev H. Leetaru of the University of Illinois to create the first-ever historical mapping and exploration of the full text contents of the English-language edition of Wikipedia, in time and space. The results include visualizations of modern history captured in under a day utilizing in-memory data-mining techniques. Loading the entire English language edition of Wikipedia into SGI® UV™ 2000, Mr. Leetaru was able to show how Wikipedia's view of the world unfolded over the past two centuries. Location, year and the positive or negative sentiment have been tied to those references.
While several previous projects have mapped Wikipedia entries with manually assigned location metadata by an editor, these previous attempts only accounted for a tiny fraction of Wikipedia's location information. This project unlocked the contents of the articles themselves, identifying every location and date in all four million pages and the connections among them to create a massive network.
"This analysis allows the world to take a step back from the individual articles and text to gain a forest view of the tremendous knowledge captured in Wikipedia, not just a page by page tree view. We can watch how one of the largest collections of human knowledge has evolved and see what we could never see before, such as global sentiment at a certain time and place, or where there might be blind spots in the knowledge coverage, " said Franz Aman, chief marketing officer and head of strategy, SGI. "We love to use Google Earth because we can zoom out and get the big picture view. With SGI UV 2, we can apply the same concept to Big Data to get the big picture on our Big Data."
From this analysis, Wikipedia is seen to have four periods of growth in its historical coverage: 1001-1500 (Middle Ages), 1501-1729 (Early Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia Era) and its continued growth appears to be focused on enhancing its coverage of historical events, rather than increased documenting of the present. The average tone of Wikipedia's coverage of each year closely matches major global events, with the most negative period in the last 1,000 years being the American Civil War, followed by World War II. The analysis also shows that the "copyright gap" that blanks out most of the twentieth century in digitized print collections is not a problem with Wikipedia where there is steady exponential growth in its coverage from 1924 to today.
Enabling researchers to data-mine Big Data at the speed of Big Data
"The one-way nature of connections in Wikipedia, the lack of links, and the uneven distribution of Infoboxes, all point to the limitations of metadata-based data mining of collections like Wikipedia," said Mr. Leetaru. "With SGI UV 2, the large shared memory available allowed me to ask questions of the entire dataset in near-real time. With a huge amount of cache-coherent shared memory at my fingertips, I could simply write a few lines of code and run it across the entire dataset, asking whatever questions came to mind. This isn't possible with a scale-out computing approach. It's very similar to using a word processor instead of using a typewriter - I can conduct my research in a completely different way, focusing on the outcomes, not the algorithms."
The analytical approach
Loaded into SGI® UV™ 2000, the Big Brain computer, this massive dataset underwent full text geocoding and complete date-coding, using algorithms that identified every mention of every location and every date across the text of every entry on Wikipedia. More than 80 million locations and 42 million dates between 1000 AD and 2012 were extracted, averaging 19 locations and 11 dates per article (every 44 words and every 75 words, respectively). The connections between every date and every location were captured into a massive network representing Wikipedia's view of history. With this instrumentation, Mr. Leetaru was able to perform near-real time analysis over the entire dataset on the SGI UV 2 to create visual maps throughout space and time to see not only how history unfolded but also the overall tone of the world throughout the last thousand years, and interactively testing a wide array of theories and research questions, all in less than a day's work.
The New SGI UV: The Big Brain computer
SGI UV 2 product family enables users to find answers to the world's most difficult problems on a system as easy to administer as a workstation. Built with Intel® Xeon® processor E5 family, running standard Linux, and supporting a wide range of storage options, SGI UV 2 offers a complete, industry-standard solution for no-limit computing.
With as little as 16 cores and 32 gigabytes of memory, SGI UV 2 can start small and seamlessly expand. This next generation platform doubles the number of cores (up to 4096 cores) and quadruples the amount of coherent main memory (up to 64 terabytes) from the previous generation, available for in-memory computing in a single-image system. SGI UV 2 can scale to eight petabytes of shared memory and at a peak I/O rate of four terabytes per second (14 PB/hour) it could ingest the entire contents of the U.S. Library of Congress print collection in less than three seconds.
SGI UV 2000 is available immediately. SGI UV 20 can be ordered today and will start shipping in August 2012. Pricing starts at $30,000 USD.
Ogilvy Public Relations
© 2012 Silicon Graphics International Corporation. SGI and the SGI logo are trademarks or registered trademarks of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. Intel and Xeon are registered trademarks of Intel Corporation. All other trade names and marks are the property of their respective owners.
Images provided courtesy of Kalev Leetaru