How to build a data project worthy of an oscar

The Panama Papers revolutionised the way collaborative investigative journalism is done – and it could not have been delivered without technology, explains Neo Technology’s Emil Eifrem.

The Pulitzer Prize is the highest accolade in journalism and writing, the reporting equivalent of the Oscars or Grammys. It’s hugely competitive, and only the best of the best are in the running. As a result, winning means what you’ve achieved is recognised as an example of real excellence.

A Neo Technology customer has just won the Pulitzer, for ‘Explanatory Reporting’. It’s another example of the growing power and relevance of a new way of working with complex data.

The project is The Panama Papers – a probe that has, in one year, led to legal investigations of tax avoidance by the 1% in no less than 79 countries.

>See also: The UK’s top 50 data leaders 2017

With more than 4,700 individual stories run by news sites globally, with $110 and rising in unpaid taxes or asset seizures, being recouped, the Papers has been a success for citizens, and at massive scale. Getting there involved the largest collaborative team in journalism ever – 376 journalists from 80 countries from over 100 media organisations.

No wonder the Pulitzer Jury was convinced this was a real achievement.

Even three years back, no real technology in-house at all

The team behind the breakthrough was The International Consortium of Investigative Journalists (ICIJ), a global network of investigative journalists in more than 65 countries who have set up a partnership at major newspapers for reporters willing to try and break the hardest investigative scoops.

The ICIJ provides a range of central data and technology services to help, as well as developing technology to help support international, cross-organisational collaborations such as the Panama Papers. Those tools are delivered by a small team, but even three years ago there was no in-house data team at work.

The fact that this has completely changed speaks to the growing importance of data and technology in public discourse and the journalism that represents it. One of the key challenges of the Panama Papers was the sheer volume of data that had to be crunched.

>See also: The benefits of graph databases in relation to supply chain transparency

At the time of publication in April 2016, it was the largest data leak in journalism history, at 2.6 terabytes and 11.5 million individual files. And to give you some idea of scale, in 2013 the ICIJ published a set of stories off a 260 gigabyte source, ten times smaller a source.

The power of graph and open source in lock-step

That step change demanded the ICIJ develop tools from scratch to enable collaboration to happen. It decided to invest in whatever methods it could find to process large amounts of information – ideally open source, as the organisation is a non-profit with no technology budget to speak of at all.

Extensive research led the ICIJ to believe graph database technology, with its ability to spot relationships and connections hard or even impossible to see with conventional relational technology, was its best option in the shape of Neo4j.

The team brought in Apache Solr for indexing needs and Apache Tika for document processing, while to scan and digitally capture document mages, it set up 30–40 temporary servers in Amazon Web Services that allowed it to process hundreds of documents in parallel.

>See also: Businesses are at a database crossroads

For a user interface, it adapted Project Blacklight, Open Source software normally used by librarians. On the ETL (Extract, Transform, and Load) front, Talend was brought in to transform the source data from SQL to graph database format. Once the data was transformed, it was plugged in to Linkurious visualisation software, where it was turned into a visual form in a networked way so anyone can log in from anywhere in the world.

The ICIJ found the combination of the Linkurious data visualisation front end and Neo4j graph database to be very rapid in terms of modelling data, with visualisations produced quickly that were easy to understand for the non-data scientist reporters working on the initiative.

That mattered, as it meant not very tech-savvy reporters could expand the networks, while more technically expert reporters and programmers could use the Neo4j query language, Cypher, to do more complex queries (e.g. ‘Show me everybody within two degrees of separation of this person, or show me all the connected dots’). Finally, for communication needs, the ICIJ used Global I-Hub, a secure Open Source platform they developed based on Oxwall.

As a result, more than 100 media organisations worked together and successfully met the agreed secret publishing deadline worldwide at the same time, April 3, 2016.

>See also: Graphs and smart cities: a neat combination

That means the ICIJ had one of the highest impacts a journalistic story has had in journalism history, and it’s an impact that continues; the work continues to extend beyond the team of specialist journalists and is encouraging citizen journalism in the shape of access to the contents of three million separate Panama Paper documents, delivering one of the largest tax haven data sources for public scrutiny ever – with over 6 million concerned citizens interrogating 50 million pages since its May 2016 re-launch.

A new future for journalism, and anyone working with complexity

For the ICIJ, not only has the Panama Papers revolutionised the way collaborative investigative journalism is carried out, showing the power of partnership between big media organisations globally, while enabling the assembly of big teams including members of the public to do powerful data work at low cost, thanks to graph databases and open source.

For Bastian Obermayer, Investigative Journalist, Süddeutsche Zeitung ,“Data journalism is more and more what today’s journalism is all about. Being able to work on data is one of the most important [skills needed], and increasingly so.”

The ICIJ has clearly managed ground-breaking work with technology – and should be congratulated for winning the most important prize in global journalism, the Pulitzer.

 

Sourced by Emil Eifrem, CEO of Neo Technology, the company behind the world’s leading graph database Neo4j

 

The UK’s largest conference for tech leadership, Tech Leaders Summit, returns in September with 40+ top execs signed up to speak about the challenges and opportunities surrounding the most disruptive innovations facing the enterprise today. Secure your place at this prestigious summit by registering here

Avatar photo

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and...