SGI Wikipedia Project

Videos  |  Connectivity Structures  |  Graphs & Charts

What can be done in a day on the new SGI® UV™ 2000 - the world's largest in-memory data-mining system? That's what SGI asked Kalev Leetaru, the researcher who recently published Culturomics 2.0 where he used an archive of 100 million global news articles spanning a quarter-century, a 2.4 petabyte network of 10 billion people, places, and things, and 100 trillion relationships to forecast the Arab Spring, pinpoint Bin Laden's location, and visualize human society's evolution.

He turned to Wikipedia and together with SGI has created the first-ever historical mapping and exploration of the full text contents of the English-language edition of Wikipedia, in time and space, with visualizations of modern history captured in under a day. Loading the entire English language edition of Wikipedia into SGI UV 2000, Mr. Leetaru was able to show how Wikipedia's view of the world unfolded over the past two centuries. Location, year and the positive or negative sentiment have been tied to those references.

While several previous projects have mapped Wikipedia entries with manually assigned location metadata by an editor, these previous attempts only accounted for a tiny fraction of Wikipedia's location information. This project unlocked the contents of the articles themselves, identifying every location and date in all four million pages and the connections among them to create a massive network.

[+] Read More


Videos about the Project

A View of World History Through Wikipedia

See history unfold through Wikipedia through space and time. Every mention of a location or date anywhere in any article across all of Wikipedia was extracted and each location connected to the closest date to place it on a map. All locations mentioned in an article together with the same year are connected together. Thus, what you see is what Wikipedia had to say about each year 1800-present, which locations were mentioned the most, which locations were mentioned alongside each other, etc.
The Sentiment of the World Throughout History Through Wikipedia

See the positive or negative sentiments unfold through Wikipedia through space and time. Each location is plotted against the date referenced and cross referenced when mentioned with other locations. The sentiment of the reference is expressed from red to green to reflect negative to positive.
View the World Through the Eyes of Wikipedia

What can you do in a day on SGI UV 2 - the world's largest in-memory data mining system? In collaboration with SGI, watch Kalev H. Leetaru of the University of Illinois, discuss creation of the first-ever historical mapping and exploration of the fulltext contents of the English-language edition of Wikipedia, in time and space, with visualization of modern history captured in its four million pages, in under a day. All done on SGI UV 2000, the Big Brain computer.
Top of Page


Connectivity Structures of the Data

The following visualizations show the very different connectivity structures of different types of data in Wikipedia. This approach allows the macro-level structure to be viewed at-a-glance. An analyst then could zoom into each node interactively to see the detail on each data point. Labels were removed in these images for readability.
Categories




All categories plotted with cross references within the Wikipedia universe.
Download full resolution image (ZIP 7M)
Persons




All persons mentioned in Wikipedia plotted and cross referenced when mentioned in the same article.
Download full resolution image (ZIP 8M)
Organizations




All organizations mentioned in Wikipedia plotted and cross referenced when mentioned in the same article.
Download full resolution image (ZIP 6M)
Dates




Years 1000 AD to 2012 referenced in Wikipedia plotted and cross referenced when mentioned in the same article.
Download full resolution image (ZIP 7M)
Top of Page


Graphs and Charts


Total Number of Mentions Across Wikipedia of Dates in Each Year 1001 to 2011 -- Timeline of World History

Immediately it becomes clear that the copyright gap that blanks out most of the twentieth century in digitized print collections is not a problem with Wikipedia where there is steady exponential growth in its coverage from 1924 to today. This matches intuition about the degree of surviving information about each decade. For the purposes of this project, references to decades and centuries were coded as a reference to the year starting that time period ("1500's is coded as the year 1500), which accounts for the majority of the spikes in the data. Major events like the American Civil War and World Wars I and II are readily visible.

Log Scale View of the Wikipedia Timeline, 1001 - 2011

Instead of displaying the raw number of mentions each year, a log scale displays the exponential growth of that dataset, making it easier to spot the large-scale patterns in how a dataset has grown over time. In this case, the data shows Wikipedia's historical knowledge 1001AD-2011 largely falls into four time periods: 1001-1500 (Middle Ages), 1501-1729 (Early Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia Era) with a sudden massive growth rate far in excess of the previous periods.

Wikipedia Timeline, 1950 - 2011

This graph shows just the period 1950-2011, showing that the initial spike of coverage leading to the Wikipedia Era begins in 2001, the year Wikipedia was first released, followed by three years of fairly level coverage, with the real acceleration beginning in 2004. Equally interesting is the leveling-off that begins in 2008 and that there are nearly equal numbers of mentions of the last three years: 2009, 2010, and 2011. Does this reflect that Wikipedia is stagnating, or perhaps it has finally reached a threshold at which all human knowledge generated each year is now recorded on its pages and there is simply no more to record? One possible theory is that most edits to Wikipedia today focus on contemporary knowledge, adding in events as they happen, turning Wikipedia into a daybook of modern history. However, the next graph provides another possible answer.

Total Number of Articles 2001-2011 and Total Number of Mentions of Dates of the Same Year

To better understand the nature of the expansion of Wikipedia in recent years, this graph shows the total number of articles in the English-language Wikipedia by year 2001-2011, along with the total number of mentions of dates of that same year. It turns out that in 2007, there were nearly as many mentions of the year 2007 as there were pages in Wikipedia (note: this does not mean that every page mentioned that year, since a single page mentioning a year multiple times will account for multiple entries in this graph). The size of Wikipedia continued to grow after 2007, while the number of mentions of the corresponding years leveled off. This suggests that Wikipedia's continued growth is not necessarily focused on current history, but rather is distributed elsewhere across Wikipedia, enhancing its coverage of the past.

Emotional Context of Wikipedia from 1001 - 2011

This chart visualizes how "positive" or "negative" each year was according to Wikipedia. To normalize the raw tonal scores, the Y-axis shows the number of standard deviations from mean, known as the Z-score. Annual tone is calculated through a very simplistic measure, computing the average tone of every article on Wikipedia and then computing the average tone of all articles mentioning a given year (if a year is mentioned multiple times in an article, the article's tone is counted multiple times towards this average). This is the macro-level context of a year: at the scale of Wikipedia, if a year is mentioned primarily in negative articles, that suggests something important about that year.

Looking at the figure, one of the most striking features is the dramatic shift towards greater negativity between 1499 and 1500. Tone had been moving steadily more negative from 1001AD to 1499, shifting an entire standard deviation over this period, but there is a sudden sharp shift of one full standard deviation between those two years, with tone remaining more negative until the most recent half-century. The suddenness of this shift suggests this is likely due to an artifact in Wikipedia or the analysis process, rather than a genuine historical trend, such as a reflection of increasing scholarly questioning of worldly norms during that period. However, looking at the rest of the chart, another striking plunge towards negativity occurs from 1861-1865, during the American Civil War, with similar plunges around World Wars I and II. Of interest, World War II shows nearly double the negativity that World War I did, and nearly 75% that of the Civil War.




Comparison of Wikipedia Tone vs. News Media Tone from 1979 to 2010

This chart shows detail of the period 1979-2010, corresponding to Figure 11 in Culturomics 2.0 that traced the average monthly tone of global news coverage over this period. News media tone has become 3 times more negative over this period, while Wikipedia tone has become 1.5 times more positive. There is also two-year shift towards negativity in Wikipedia from 2004-2005, which deserves further exploration.


Images courtesy of Kalev Leetaru
Top of Page