Harnessing the Power of Open Source #PanamaLeaks
Here is a 5-minute read and walkthrough about how the CIP - ICIJ journalists handle 2.6 TB of data in Panama Paper leaks?
The first thing they did was look for open source tool to solve their problems:
1. Problem they faced - communication of journalists who do not belong to the same newsroom. They used Oxwall and tweaked it to make their own private social network for their reporters.
2. Security was key for them - so they looked into it. Governments were not a threat to them, but they enabled things like 2-step authentication, every data served over HTTPS etc.
3. TOOLS USED:
- They used Amazon cloud instances with Ubuntu + Tesseract (for OCR) + Extract (parallel processing inhouse software) and Solr for harnessing cloud power.
- They used Extract (present in github account) to do parallel processing, get a file try to see if they recognized the content, if they didn't do OCR, used SolR for indexing. Solr didn’t have a user interface, so they used Project Blacklight
- Apache Tika for document processing. Tika interacts with Tesseract,
- They used Nuix to search unstructured metadata from Big Data and Analytics perspective.
- They used Lucene for syntax queries with proximity matching (search on API, search on regular expressions etc, batch searching with list of names etc).
- They used Neoj4 to do graph searches.
- Different database were processed in SQL server and they needed a platform to unify all these.
- Used Linkurious for visualising graphs very easily and Neo4j database to plug in and they had the graph visualisation setup.
- They used Talend* for transform their database to Neo4j, plug in Linkurious and get reporters to search. (see a dot and expand those dots for example).
- They can use key Cypher queries to look in-depth at the data! Neo4j query language, Cypher can do more complex queries.
- Possibly used / will be using https://tabula.technology/ for extracting Tabular information on pdf documents.
- They have fuzzy searching, public widgets which allowed them to publish it in the web.
- Embedding these in their stories harnessing the power of collaborative viral off-shore leaks stories.
- They generated automated reporting to make it easier for people to share.
3. Full video of the Panama Papers unravelling, Mar Cabra below:
https://youtu.be/S20XMQyvANY
Avatar-based Scalable Digital Outbound Lead & Appointment Generation
8 年Interesting. Revealing.