A method to extract exact values from printed charts and graphs

Screen Shot 2019-09-08 at 4.03.50 PM.png

It is often the case that data needs to be extracted from a printed chart or graph. One possibility is to contact the author who created the work and ask if they can share the underlying data. However, this is not always feasible and the authors may not always be willing to provide the data.

A recent example from our own work is the chart at right which comes from the book “Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science” by Raymond Hubbard. This chart shows the increase in the use of null hypothesis significance testing (NHST) in the management and marketing sciences (the data was generated by picking several top management and marketing journals and counting the number of articles referencing NHST in a single issue for each year between 1945 and 2007).

While we wanted to add this chart using to our site (built in Squarespace), we did not have access to the underlying data. Instead we used a method to manually extract the data which we describe below.

We first learned of this method from “A Longitudinal Examination of Net Promoter and Firm Revenue Growth” by Timothy L. Keiningham, Bruce Cooil, Tor Wallin Andreassen, and Lerzan Aksoy. The authors describe a tracing process to extract NPS data from charts created by Frederick Reichheld. The relevant passage is reproduced below with added emphasis.

In addition to the NCSB examination we reported previously, we attempted to analyze the Net Promoter and firm growth data used by Reichheld (2003) and Satmetrix (2004) with other variables used in their investigation. Without access to their raw data, it is impossible to compare Net Promoter with the other satisfaction/loyalty metrics collected in their surveys. However, there is an opportunity to compare Net Promoter with the ACSI, a metric that was investigated and found not to correlate with growth (Reichheld 2004)…

Given that a Bain team examined the relationship between the ACSI and growth, it appears reasonable to presume that a comparison of the ACSI and Net Promoter was conducted. Fortunately, actual data reported in The Ultimate Question (Reichheld 2006c) offer such an opportunity. The Appendix to the book presents charts for six industries to demonstrate the relationship between Net Promoter and firm growth (four U.S. industries, one U.K. industry, and one Korean industry). Three of the U.S. industries are also tracked by the ACSI: airlines, life insurance, and computers. To present a fair comparison between the ACSI and Net Promoter, we replicated the data. On the basis of their scatterplots, we reconstructed data that showed the relationship between Net Promoter scores and growth (Reichheld 2006c, pp. 192–94). To ensure accuracy, we enlarged and scanned the tables and imported the corresponding graphics into a charting software package in which they were used as background images. Over the images, we created a new scatterplot using the exact dimensions of the scanned image. We input data until each point in the chart corresponded to the scanned images. As a final check of the data, we compared the coefficient of determination (R-squared) of the recreated data with the reported R-square. All R-square values were the same, indicating a successful replication of the data.

A similar method can be used to extract data from line and bar charts. Of course, one suitable option is to simply trace the graph in Adobe Illustrator or a similar program. The graph can then be positioned in existing Adobe InDesign spread for a print or web report. The tracing method is sufficient if the graph is meant to be reproduced in a report and referenced without needing the underlying data points. However, it is sometimes necessary to obtain the underlying raw data. For example, a researcher may desire to use the raw data as a weighted component of a larger index, perform statistical analysis on the original data, or reproduce the chart in a program that requires raw data. Our use case for the graph above fell into the later category. This website is built using Squarespace and we wanted to leverage its native charting functionality, which requires raw data points to be entered before the graph is produced.

The first step of the data extraction process is to lay down a grid. A screenshot of the grid overlay from “Corrupt Research” is shown below. A grid can be created by placing a vertical line on the first and last x-axis data points and having Adobe Illustrator create equally spaced lines equal to the number of data points in between the first and last. However, in our experience manual adjustment of each grid line is needed so that it more precisely passes through the center of the charted data points.

In the past our process involved laying down a very high density grids as in the image at right from Google’s “It’s Lit!” report (using horizontal lines rather than vertical). The purpose of the fine grid was to extract very precise data down to two or even three decimal places. However, a grid of that density is cumbersome to deal with in practice. We have progressed to a less dense base grid, with additional density added on an ad hoc basis when more precision is needed.

Fine grid overaly used to extract data from Google “It’s lit!” Gen Z report.

Fine grid overaly used to extract data from Google “It’s lit!” Gen Z report.

Grid overlay chart from “Corrupt Research.”

Grid overlay chart from “Corrupt Research.”

A single horizontal line is used as seen in the grid overlay in the “Corrupt Research” example. The horizontal line is moved from data point to data point, positioned manually until it passes as closely through the center of the data point as possible. The x-axis component is easy to record by simply noting which horizontal line represents the current data point.

The y-axis component is recorded by noting the intersection of the single horizontal line with the y-axis. We aid this process by making small horizontal tick marks at each major axis point. In the “Corrupt Research” example that means 100 lines, 100 for each percentage.

As we progress from left to right (or vice versa) each vertical line is changed from red to green to easily keep track of the current data point being examined.

To add an additional decimal point of precision another set of 10 lines can be added between each major axis point as seen below.The below horizontal line would then be read as 16.75%. The additional 0.05% comes the fact that the horizontal line roughly passes between two of the minor gridlines.

Screen Shot 2019-10-13 at 9.26.27 PM.png

As noted in the NPS article, after the raw data has been extracted the original chart can be independently reproduced using that data and overlaid on top of the original chart to ensure consistency. If additional statistical summaries of the original data are reported, such as R-squared, they can also be reproduced and compared to the originally reported values. Chart reproductions using the raw data extracted from tracing can be seen in the graph below overlaid on the original chart from “Corrupt Research.” The two reproduced lines match the original almost exactly.

In the end we created a single line measure that was a weighted average of the management and marketing trends in NHST usage (last chart below). Squarespace currently has a rather small maximum limit on the number of data points so we had to plot the data using every third year.

Screen Shot 2019-10-13 at 9.33.00 PM.png

Growth of Significance Test Usage in Management and Marketing Articles