Tech Corner
The HRDAG Tech Corner is where we collect the deeper and geekier content that we create for the website. Click the accordion blocks below to reveal each of the Tech Corner entries.
Sifting Massive Datasets with Machine Learning
- Indexing selectors from a collection of chat messages
Important evidence about serious human rights abuses may be hidden among a much larger volume of mundane material within the communications and records of the perpetrators. Analysts can’t always review all available records because there may be too much of it. It turns out that we can use machine learning to identify the most relevant materials for the analysts to review as they investigate and build a case.
In this post, we describe how we built a model to search hundreds of thousands of text messages from the phones of the perpetrators of a human rights crime. These are short messages, often containing slang, variant spellings, and other unconventional language, most of which are irrelevant to the crime we are investigating. Our solution required modeling for different categories of interest to analysts, and dealing with high levels of sparsity in both labels and features.
Read the full post here: Indexing selectors from a collection of chat messages.
- Database Deduplication to Identify Victims of Human Rights Violations
part 1 | part 2 | part 3 | part 4 | part 5
In our work, we merge many databases to figure out how many people have been killed in violent conflict. Merging is a lot harder than you might think.
Many of the database records refer to the same people–the records are duplicated. We want to identify and link all the records that refer to the same victims so that each victim is counted only once, and so that we can use the structure of overlapping records to do multiple systems estimation.
Merging records that refer to the same person is called entity resolution, database deduplication, or record linkage. For definitive overviews of the field, see Scheuren, Herzog, and Winkler, Data Quality and Record Linkage Techniques (2007), and Peter Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (2012).
Database deduplication has been an active research area in statistics and computer science for decades. If records lack a unique identifier, like a social security number, finding the same person among many records may be hard because records that refer to the same person may have slightly different information. Sometimes names are spelled a little bit differently, sometimes birth or death dates are slightly different, sometimes different people have the same names, and so forth. In our work studying lists of people killed in violent conflicts, the records usually have imprecise information.
Database deduplication is also hard because there can be a lot of records involved. In our work on homicides in Colombia 2003-2011, we processed over 460,000 records. In our current work on Syria, we’re managing about 360,000 records. It’s a lot of data.
We’ve written about our database deduplication framework before. In 2013, Jule Krügerwrote this post, Amelia Hoover Green wrote this post, and Anita Gohdes wrote this post.
Database deduplication is so hard that in my five-part post, I’m only going to talk about the first step in record linkage, called blocking or indexing. In blocking, the goal is to reduce the number of records we consider as possible pairs so that we can calculate the likely matches without running for weeks. I show how we reduce the comparisons among the Syria datasets from 65 billion possible pairs of records to about 43 million pairs, a thousand-fold reduction. It’s a very geeky post in which I dig into the technical details, link to some of the key academic and applied work in this area, and I show actual code (in python and pandas) that does the work. It’s written in the jupyter notebook, which is a computing environment I’m in love with right now. Let’s dig in!
part 1 | part 2 | part 3 | part 4 | part 5
- Clustering and Solving the Right Problem
In our database deduplication work, we’re trying to figure out which records refer to the same person, and which other records refer to different people.
We write software that looks at tens of millions of pairs of records. We calculate a model that assigns each pair of records a probability that the pair of records refers to the same person. This step is called pairwise classification.
However, there may be more than just one pair of records that refer to the same person. Sometimes three, four, or more reports of the same death are recorded.
So once we have all the pairs classified, we need to decide which groups of records refer to the same person; together, the records that refer to a single person are called a cluster.
There may be 1, 2, or lots of records in a cluster. But heres’ a complication: if record A matches to record B, and record B matches record C, do all three match (A, B, C)? When you look at the cluster, you’ll find that maybe they do, and maybe they don’t.
Database Deduplication
Principled Data Processing
- The task is a quantum of workflow
This post describes how we organize our work over ten years, twenty analysts, dozens of countries, and hundreds of projects: we start with a task. A task is a single chunk of work, a quantum of workflow. Each task is self-contained and self-documenting; I’ll talk about these ideas at length below. We try to keep each task as small as possible, which makes it easy to understand what the task is doing, and how to test whether the results are correct.
In the example I’ll describe here, I’m going to describe work from our Syria database matching project, which includes about 100 tasks. I’ll start with the first thing we do with files we receive from our partners at the Syrian Network for Human Rights (SNHR).
- .Rproj Considered Harmful
An interactive programming environment, such as RStudio or Jupyter Notebooks, is an indispensable tool for a data analyst. But code prototyped in such environments, left unedited, can be brittle, difficult to maintain, or come to depend on some hidden aspect of the context in which it was developed. Our projects tend to go on for significant amounts of time and involve multiple programmers working in different languages. Furthermore, given the subject matter, we know our calculations and assumptions will be closely scrutinized.
So we aim to produce code that is clear, replicatable across machines and operating systems, and that leaves an easy-to-follow audit trail allowing us to review every step in the data processing pipeline. In this post, we look at several illustrations of how code prototyped in an interactive environment falls short of those standards, and how to effectively make such code production-ready.
Read the full post here: .Rproj Considered Harmful
- From Scripts to Projects: Learning A Modular, Auditable And Reproducible Workflow
The file structure of each task is a microcosm of the entire project. Inputs enter the input folder, are acted upon by the code in the src folder, and are output to the output folder. In that same way, projects start with inputs, are acted upon by tasks in the workflow, and are output as results overall. Breaking up work into tasks facilitates the team’s collaboration across time zones and programming languages, one task could be performed in Python and another in R and neither analyst would have to even know. The task structure also results in projects which are auditable and reproducible. Anyone looking at the code can determine exactly how output in a given task was produced and anyone working on a task at anytime can pick it up and recreate the same output.
Read the full post here: From Scripts to Projects: Learning A Modular, Auditable And Reproducible Workflow
- Using MSE to Estimate Unobserved Events
At HRDAG, we worry about what we don’t know. Specifically, we worry about how we can use statistical techniques to estimate homicides that are not observed by human rights groups. Based on what we’ve seen studying many conflicts over the last 25 years, what we don’t know is often quite different from what we do know.
The technique we use most often to estimate what we don’t know is called “multiple systems estimation.” In this medium-technical post, I explain how to organize data and use three R packages to estimate unobserved events.
Click here for Computing Multiple Systems Estimation in R.
Multiple Systems Estimation