Data presentation fascinates me because it's both art and science. There's no right way to do it; it depends on both hard data, good intentions, and interpretive ability. Data can be manipulated and misinterpreted, both honestly and dishonestly. And any chart is potentially yet another step removed from whatever "truth" the hard data has.
Where I'm going isn't exactly technical, but there's no point here other than data presentation and honest graph making (and also crime being f*cking up in Baltimore after the riots, but that's not my main point). If that doesn't interest you, stop here. [Update: Or jump to the next post.]
I took reported robberies (all), aggravated assaults, homicides, and shootings from open data from 2012 to last month. I then took a simple count of how many happen per day (which is strangely not simple to simple to analyze, at least with my knowledge of SPSS and excel). You get this.
It takes a somewhat skilled eye to see what is going on. Also, since the day of riot is so high (120), the y axis is too large. With some rejiggering and simply letting that one day go off the scale unnoticed, you get this.
It's still messy, but is the kind of thing you might see on some horrible powerpoint. Things bounce up and down too much day-to-day. And there are too many individual data points. Nobody really cares that there were more than 60 one day in July 2016 and less than 5 in early 2016 (I'm guessing blizzard). It's true and accurate, but it's a bad chart because it does poor job of what it's supposed to do: present data. Again, a skilled eye might see there's a big rise in crime in 2015, but the chart certainly doesn't make it easy.
Here's crimes per day, with a two-week moving average. A moving average means that for, say September 7, you take Sep 1 through Sep 14 and divide by 14. Why take an average at all? Because it smooths out the chart in a good way. It's a little less accurate literally but much more accurate in terms of what you, the reader, can understand. One downside is that the number of crimes listed for September 7th isn't actually that number of major crimes that happened on that day. You can see why that might be a big deal in another context. But here it isn't.
For a general audience it's not clear what exactly the point is. You still have lots of little ups and downs, and the seasonal changes are an issue. (Crimes always go up in summer and down in winter. And it's not because of anything police do. And it's nothing do to with the non-fiction story I'm trying to tell.) On the plus side, you do see a big spike in late April, 2015, after the riots and the absurd criminal prosecution of innocent Baltimore cops. But it needs explaining.
Also, you need some buffer for the data. The bigger the average, the more of a buffer you need. But for this I think this is one perfectly fine way to present these data, at least for an academic crowd used to charts and tables.
Another tactic is to take the average for the past year. Jeff Asher on twitter over at 538.com does good work with NOLA crime and is a fan of this. It totally eliminates seasonal issues (that's huge) and gives you a smooth line of information (and that's nice).
You can see a drop in crime pre-riot (true) and a rise in crime post-riot (also true). That's important. Baltimore saw a drop in crime pre-2015 that wasn't seasonal. It was real. And the rise afterward is very real. But there are two problems with this approach: 1) you need a year of data before you get going and 2) everything is muted. What looks like a steady rise (the slope since 2015) is actually a huge rise. But it looks less severe than it is because it takes an average from the previous year. But that's not exactly true. Crime went up on April 27, 2015. And basically stayed up, with a slight increase over time.
Here's my problem. I want to show the rise in crime post-riot. But I want to do so honestly and without deception. But yes, for the purpose of this data presentation, I have a goal. (My previous attempts were pretty shitty.)
Also, you need at least a year of data before you can graph anything. That's a downside.
Here's my latest idea. If one is looking at a specific date at which something happened -- in this case the April 27, 2015 -- and trying to eliminate seasonal fluctuations, why not take the yearly average for the previous year before that time and the yearly average after that date for dates after that time? I think it's kosher, but I'm not certain.
Here's how that works out:
This shows the the increase that was real and immediate. And as minor point I like the white line on the day of the riot, which I got from removing April 27 from the data (because it was an outlier).
Now if I wanted to show the increase in more stark form, I would move the y axis to start at 20. But being the guy I am, I always like to have the y-axis cross the x-axis at 0. That said, if the numbers were higher and it helped the presentation of data, I have no problem with a y-axis starting at some arbitrary point.
Take into account that graphs are like maps. While very much based on truth, they exist to simplify and present selected data. I mean, you can have my data file, if you want it. But I do the grunt work so you don't have to. But of course my reputation as an academic depends on presenting the data honestly, even though there's always interpretation (e.g.: in the case of a map, the world, say scientists, isn't flat). The point, rather, is if the interpretation honest and/or does the distortion serve a useful purpose (In the case of the Mercator Projection it was sea navigation; captains didn't gave a shit about the comparative size of the landmass of Greenland and Africa.)
So taking an average smooths out the line of a chart, which is a small step removed from the "truth," but a good stop toward a better chart. It's not a bad approach. But it tends to mask quick changes in a slow slope, since each data point in the average for a lot of days. A change in slope in the graph actually indicates a rather large change in day-to-day crime. There are always pluses and minuses.
If you're still with me, here's what you get when just looking at murder. Keep in mind everything up to this point has been the same data on the same time frame. This is different. But homicides matter because, well, along with people being killed, it's gone up much more than reported crime.
[My data set for daily homicides (which is a file I keep up rather than from Baltimore Open Data) only goes back to January, 2015. So I don't have the daily homicide count pre-2015. 2014 is averaged the same for every day (0.5781). This makes the first part of the line (pre April 27, 2015) straighter than it should be. This matters, and I would do better for publication, but it doesn't change anything fundamentally, I would argue. At least not in the context of the greater change in homicide. Even this quick and imperfect methods gets the major point across honestly. ]
Update and spoiler alert: Here's a better version of that chart, from my next post.