This website does not display correctly in Internet Explorer 8 and older browsers. The commenting system does not work either in these browsers. Sorry.

The Softvis Collection

A collection of beautiful and useful software visualizations

A Thousand Words

This essay was originally published in The ThoughtWorks Anthology, Volume 2.

Data is no longer scarce. Insight is. Twitter clocks over 36,000 tweets a minute. Tesco generates over 2 million records of transaction data every day. By the time you’re done reading this page, YouTubers will have uploaded over 20 hours of video online. And as more companies integrate their systems or embrace the semantic web, it’s becoming increasingly difficult to make sense of all the information out there.

A lot of what we do at ThoughtWorks is help our customers integrate, simplify, and capitalize on these systems and the massive amounts of data they hold, and we rely on information visualization, or infovis. Information visualization will play an increasingly important role in dealing with the dataglut. The potent combination of images, words, and numbers—done right—can provide the most meaningful insight into our data. The question is, how do we determine that magic combination? The common misconception is that this task is best left to the designers or aesthetically inclined on the team. But while creative thinking is required, there needs to be method to the designers’ madness: a structured way of approaching the visualization problem, in which form follows function. A team that employs visualizations and a visualization design process is well on its way to building more innovative and valuable software.

The goal of this essay is to help demystify information visualization and share some of the structured thinking that goes into designing visualizations. I hope it motivates you to understand infovis and develop a shared vocabulary with others to discuss and create better visualizations.

Smelling the Coffee

Much infovis work is driven by scientific and academic research. That’s fortunate, because infovis is directly linked to human visual perception, which science has learned a lot about in the last century.

What’s unfortunate is that the same can’t be said wholeheartedly about the IT industry, as its progress in this space is driven less by the objective research conducted in the field and more by the deals between software vendors and CIOs on golf courses. The IT industry has made real advances in data-centric activities like collecting, cleaning, transforming, integrating, and storing data, but it still lags behind in the human-centric analysis of that data. The brutal truth is that many of the BI tools out there won’t withstand the slightest scrutiny when evaluated from the lens of current research in visual perception, design, and effective visual communication.

To fork Daniel Pink’s famous quote, there is a big gap between what science knows and what businesses are doing. As a result, many of us have given in to the status quo of tabular applications, pie chart presentations, and paginated data. But not all of us. Stephen Few, a contemporary expert in business-infovis, says, “Few situations frustrate me more than good infovis research that could solve real problems but remains unknown and unused because it has never been presented to those who need it or has been presented only in ways that people can’t understand.”

But there are encouraging signs. Infovis is not a new field. Visually communicating information has been around for as long as humans have been telling stories. And there is a resurgence taking place right now toward leveraging infovis in new and deeper ways to address the plethora of data we’re overloaded with in this Information Age. Many industries and organizations already understand the value that infovis brings to the table and are using it to solve challenging problems.

When ThoughtWorks is brought in for assessments of large-scale IT projects facing difficulties, we first use the latest research in internal software quality measurement and infovis to diagnose the current “health” of the system. As you can imagine, information that comes in such a complex form—a spaghetti of architectures, hundreds of thousands of lines of code, and years of human decisions and implementation history—is not easy to analyze. This is why we rely on good infovis practices to excavate the insight that helps top-level decision-makers make the right choices.

Similarly, the New York Times has earned a reputation for telling insightful stories about the world by revealing patterns and relationships in the light of current events. Today’s cutting-edge retail analytics companies, such as those working with loyalty programs, use infovis extensively to help retailers “perfectly” price, promote, and place products in their stores based on consumer-buying habits and loyalty. And now with the widespread adoption of touch-screen media and small-screen ubiquitous interfaces, we’re being pushed to find even more innovative ways to visualize information.

So what are some of the winning design principles in information visualization?

Visualization Design Principles

“Evidence is evidence, whether words, numbers, images, diagrams, still or moving,” says Edward Tufte, the design and infovis guru. “The information doesn’t care what it is, the content doesn’t care what it is. It is all information.” The goal of information visualization is to help us think more productively and analyze information more efficiently. In discussing the ways to reach these goals, it’s worth keeping the following principles in mind:

The Visualization Design Process

For the most part, the process of designing visualizations touches on all parts of the software development value stream. That’s because at the end of the day, information visualization is simply the process of transforming data into an interactive visual representation, either through code or a tool. Several people have spent the time to define structured processes for creating visualizations—Pipeline Model (Agrawala), Cyclical Model (Wijk), Nested Model (Munzner). Because some of these processes were initially described in research papers, they can be unnecessarily cryptic (at least for most of us) and give the impression that their authors were trying really hard to impress the reader (one of them even uses calculus!)

Cutting through the cruft of unnecessary complexity, the essence of the visualization design processes is shown in figure fig.ali.process. Visualization design process Figure fig.ali.process

Define Domain Tasks

A good visualization always starts by articulating a need in the business’s natural domain language. Borrowing from the Agile methodology, one way to articulate those needs can be in the form of user stories, such as: “As a school teacher, I want to know how well my students are doing in my class, so that I can plan the mid-term review session accordingly.”

Task Abstraction

Obviously, knowing how well students are doing in class can be measured in many ways. I may want to determine the average performance of the class. I may want to know the extent to which those grades vary across students, topics or over time. I may want to see who has been absent when I taught things that aren’t covered in their textbook. I may want to know the topics that students are having most difficulty with. Do you see a similarity between the tasks the teacher is trying to perform and the tasks a project manager, financial analyst or anybody else performs on a daily basis? That’s because analytical tasks are usually just instances or a mix of a few well-known abstract tasks performed on one or more metrics. Most of those are:

The goal of task abstraction is to break the domain task down into a set of low-level abstract tasks/operations, preferably ranked by priority if possible. As we’ll see later, visually encoding your data effectively relies heavily on the analytical task(s) being addressed.

Data Abstraction

How many different ways can you give me the temperature of water? Taps give you two: hot and cold. You could also say it’s boiling, hot, luke warm, cold, or frozen. Or you could just call it out in degrees Fahrenheit/Celsius. How would you arrange the words “hot” and “cold” in order? How about boiling, cold, luke warm, hot, and frozen—which one comes first? Try -1°C, 10°C, and 4°C. The way in which we represent our data has a profound implication on how we cognitively process it, especially when visualized. Before we start to visually encode data, we need to understand the nature of each metric (aka data type) in it. There are three essential data types we need to know about:

Sometimes we get the choice of transforming data types from our dataset based on our task abstractions. You might be wondering why or when you would want to do that. Two common scenarios are:

  1. A task requires making some assumptions about the data set in order to aggregate information, like calculating averages and sums. For example, an Agile team might assign quantitative values to t-shirt-sized effort estimates of user stories based on geometric progression (e.g. S=1, M=2, L=4, XL=8) in order to quantify the project scope.

  2. A task does not require precision in order to be performed effectively. For example, the task of determining which employees submitted their timesheets late might not need to know how late the submissions were. This is where design perfection comes in and why it’s so important to clearly articulate your tasks. Companies like Apple get this because they share the following sentiment:

A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. -Antoine de Saint-Exupery

Understanding your data types and choosing the right level of abstraction (i.e. data types) from your dataset based on tasks is one of the key ingredients to effectively visualizing data. For example, the following visualizations are two different ways of showing your company’s brand power (weak, medium, strong) and revenue by city. Without giving you any extra information about the bar chart on the left (A), can you guess in which city you’re generating the most revenue with the weakest brand power:

Chances are you guessed Greenville because it has the longest bar among the lightest ones. There are a couple of things at play here. First, humans intuitively associate variations in color intensity (how light or dark something is) with ordinal data (brand power). Plus, our vision system is very good at differentiating color intensity (up to a limit). Similarly, our vision system is tuned to discerning the smallest of differences in lengths of objects, which makes length a great choice for visually representing (aka visually encoding) quantitative data types—revenue in this case. If I were to switch the visual encodings around—shown in (B)—and use length for brand power and color intensity for revenue, it would be a lot more difficult, if not impossible, to perform the same task.

In fact, you can try to ask any question for this data and the first visualization will invariably do a better job at answering it, even if I gave you a detailed legend for the second one. Leveraging our visual powers to rapidly perceive information based on the task and data abstractions allows us to pack more information in less space as well as detect patterns that would otherwise have to be processed by the left side of our brain (the analytical and sequential side). This is precisely what the next step of visual encoding tries to achieve.

Visual Encoding

Simply put, visual encoding is the process of mapping data onto our visual field, usually on 2-D surfaces. Effective visual encoding requires some understanding of how vision works in the first place. As with many of the topics in this essay, I can’t do justice to the depth of this topic in a few paragraphs. But for the purpose of this essay, here’s all we need to understand: visual perception is essentially a three-stage process:

  1. Feature extraction

  2. Pattern perception

  3. Goal-directed processing

In the first stage of feature extraction, millions of neurons work in parallel to detect primitive visual features like color, form, and motion. This is best explained by a typical example: how many 3s do you see in the following set of numbers?

How long did that take you? There should be five. Now try it again:

Obviously that was a lot easier, thanks to the first stage of feature extraction during which your vision system did some of the thinking for you, also known as preattentive processing. Making something lighter or darker (i.e. changing its color intensity) is one way of visually encoding information for preattentive processing, which is why color intensity is also referred to as a “preattentive attribute.” The reason it was more difficult the first time is that the shapes of numbers are complex objects that your brain can’t preattentively process. In his book Information Visualization: Visual Thinking for Design, Colin Ware suggests 17 preattentive attributes. The most relevant ones, narrowed down by Stephen Few in Information Dashboard Design, are shown in figure fig.ali.encodings. Preattentive attributes (Stephen Few) Figure fig.ali.encodings

In the second stage of pattern perception, our brain segments the visual world into distinct regions and discovers the structure of objects and the connections between them. It’s only in the third stage that the information is processed by our brain’s attentive processes to perform the analytical task at hand.

Our goal is to make the most of the first two stages; to let the preattentive processing do some thinking on our behalf and to convey information in big visual gulps so that we can understand our data more efficiently.

Encoding for Feature Extraction

It turns out that the effectiveness of these attributes varies based on the data type you want to encode. So how do we know which ones to pick for what? I find Mackinlay’s rankings (figure fig.ali.mackinlay) easiest to elucidate the concept because they also show how the effectiveness varies by data type. Mackinlay's Rankings - (Mackinlay 1986) Figure fig.ali.mackinlay

For novices like myself, this is the holy grail of visualization design. Take a minute to explore the rankings and relate them to your own experience. As you can see, 2-D position is at the top of the list for all types. This is why traditional X-Y graphs are so effective at conveying so many kinds of information. Also notice how length and density (previously called color intensity) vary for quantitative and ordinal data types, a fact used in the example of brand power and cities given earlier.

Let’s use these rankings to evaluate a common myth about the effectiveness of pie charts for quantitative comparisons. Pie charts use area and angle as means to convey quantitative data. However, based on the rankings, both length and position trump area and angle. Let’s see for ourselves: using the pie chart, can you tell which software feature is costing you the most to build?

Probably not. It’s clearly easier to answer that question using the bar chart instead. Arguably, pie charts are effective for visualizing part-to-whole relationships, such as the fact that three of the above features cost approximately 25% each, but that’s about it. If the task at hand requires more information that’s just as important for your tasks, such as comparing or ranking the parts, you should encode your data differently.

Picking the most effective encodings relies heavily on the task at hand, so even the encoding rankings should be seen in that light. Also keep in mind that you can encode a data type with more than one visual channel to help perform the task more efficiently. For example you could encode a quantitative temperature with length, hue (blue or red), and intensity (diverging from light to dark in both directions—hot and cold.)

Encoding for Pattern Recognition

The Gestalt principles of visual perception are very useful for grouping, linking, and distinguishing information in visualizations using our vision system’s second stage of pattern recognition. For example, you might want to guide the user toward scanning information horizontally versus vertically. You could do this simply by putting slightly more vertical than horizontal spacing to help the viewer group the information preattentively. This phenomenon is explained by the Gestalt principle of proximity, since the objects placed closer to each other are perceived as belonging to the same group. Or you could just group the information with lines or borders, using the principle of enclosure. The six Gestalt principles are best explained using visual examples (figure fig.ali.gestalt): Gestalt principles of pattern perception Figure fig.ali.gestalt

The reason scatter plots are so effective in helping us spot correlation in our information is that the laws of proximity, continuity and similarity allow us to group and fill in the blanks in our data. figure fig.ali.scatterplot shows a visualization inspired by Hans Rosling’s famous TED talk on “New insights into poverty.”link Gestalt principles used to spot correlations in scatterplots Figure fig.ali.scatterplot

Notice how the arrangement of the dots appears to us naturally as a line. Despite the “noise” from variance, we’re still able to visually establish some correlation between income per person and life expectancy at birth. The correlation is even more evident when the visualization is played as a motion chart, showing change over time. If you haven’t already, I highly recommend watching the talk and asking yourself if you could tell the same story as effectively with just numbers and words.

Evaluate and Refine

As with any good software development process, feedback loops in the process are critical. It’s beyond the scope of this essay to delve into the feedback process for software in general. But in infovis, some testing can be conducted fairly objectively with users. Some things to keep in mind:

Visualization Design Patterns

Now that we have a basic understanding of the low-level visualization design process, it’s worth taking a look at a small sample of common visualizations for common tasks and get a feel for what’s out there. I’m using the word “design patterns” loosely here, from the perspective that they provide a reusable visualization skeleton for common tasks and data sets by providing some predefined encodings, and leave the rest up to the designer.

As with any design pattern, I want to caution you that there is no silver bullet. Before using any of these, it is imperative to articulate the tasks and let those tasks guide you. It can be very tempting to slap a beautiful-looking horizon graph for a small data set on your page when all you need is a simple time series chart. Similarly, always ask yourself if a table can help perform some task (e.g. lookups) more effectively.

Exploring Change over Time

Asking our data if something is increasing, decreasing, or remaining stable over time is a very common task. In general, lines do a good job of visually representing time because they adhere to our general perception of time being a continuous entity. Changes through time can be detected by the (preattentively processed) angle of upward and downward slopes that give the data shape.

Horizon Graph image ©2012 Panopticon Software AB Figure fig.ali.horizongraph

Sparklines used in Google Analytics dashboard Figure fig.ali.ganalytics

Exploring Correlations

We’re not always trying to measure change over time. Sometimes we need to explore the correlation between nominal and ordinal data types. Often times the analysis involves many more conditions and variables too (aka multivariate data). Two commonly used patterns for this are:

Heat matrix showing educational performance of Australian students by state/territory (tessera.com.au) Figure fig.ali.heatmatrix

Exploring Hierarchical and Part-to-whole Relationships

Thinking in terms of hierarchies and part-to-whole relationships is something we’re all good at because the physical world is filled with analogies for our mind to use. Your cell phone’s battery doesn’t drain like liquid in a container. Your computer’s folders of folders and files aren’t actually organized neatly that way on your hard drive. But the metaphors make it easy for us to work with the information. Likewise, visualizing information this way is just as helpful, especially for the tasks of clustering, finding anomalies and set operations. Aside from pie charts, two useful patterns for exploring part-to-whole relationships and hierarchical data are:

Bullet graph Figure fig.ali.bulletgraph

Exploring Connections and Networks

If you think about it, hierarchies and part-to-whole relationships express a specific type of connection between two or more things. For visualizing arbitrary interconnected relationships (including hierarchies and part-to-whole relationships), network graphs are a great option.

Edge bundling (Danny Holten 2006) Figure fig.ali.edgebundling

Heatmap showing changes stocks on a given day Figure fig.ali.heatmap

There are many other useful patterns in the wild. For example, you’ll find Heatmaps commonly used in the financial services industry to view the activity of the stock market in real time, as shown in figure fig.ali.heatmap. Keep an eye out for these patterns. Knowing their objectives and deconstructing the encodings and how they work will help you tailor/modify them for your own purpose. For the patterns I mentioned, you’ll find many tools and frameworks out there to will help you to quickly implement them too.

Tools and Frameworks

It’s becoming increasingly important, and easy, for us to implement our own custom visualizations and the more common patterns described above. I believe that the ease with which people can implement their own visualizations will play a key role in the adoption of good visualization design practices.

Visualization Libraries

Browsers are getting better at accommodating visually heavy interfaces, and the tools for making such interfaces are evolving in step. The things you can do with HTML5 canvas and JavaScript were once a perk that only Flash and Java Applets enjoyed, but now the landscape is changing. Some of those libraries are listed below. The best way to learn about your options is by visiting their websites, all of which have demo “galleries” and sample code:

Graphical Tools

When it comes to graphical tools, my observation is that there aren’t as many generic infovis tools as there are more pattern-driven tools that let you create very specific visualizations. So you obviously don’t get as much flexibility here as you do with the infovis libraries discussed above. My experience with and knowledge of graphical tools is quite limited, so I can only recommend the following:

The landscape of tools and frameworks is evolving fairly quickly and will easily date this essay. Whatever tool or framework you use, try your best to stick to the basic concepts and avoid getting distracted by seemingly cool features that add unnecessary bells and whistles. I often find myself stripping away many default properties just so I can build the simplest visualization that works.

Wrapping Up

Infovis is a broad and re-emerging field and we’ve only scratched the surface in this essay. Nonetheless, I hope you got a feel for the depth of this topic and the significant value it can bring to today’s projects and organizations.

If you’re the kind who tends to avoid visual design, remember that there is an objective side to design as well, as outlined in this essay. You don’t need to be a Monet to be effective with and opinionated about visual communication. What you do need is to educate yourself on the objective facets of design and develop an appreciation for the subjective facets. For starters, I highly recommend reading books and articles by Stephen Few, Edware Tufte, and Colin Ware. Find a recurring analytical task that involves sifting through boring or multivariate data and give the visualization design process a shot.

The 21st century will continue to see unprecedented growth in information, especially as we find better ways of connecting disparate information. We’re seeing it already with more blogs than we can follow, more articles than we can digest, more trends than we can keep up with, and more emerging markets than we can tap.

So whether you’re trying to understand your customers, make better-informed decisions for your organization, or convey a message to society, remember that you’re really just capturing and presenting the information around you. You’re trying to tell a story. In light of this fundamental premise of communication, try not to forget how much a picture is really worth.

Starting commenting system. This can take a moment.