Harnessing the Power of Open Source #PanamaLeaks
https://upload.wikimedia.org/wikipedia/commons/a/ae/Panama_papers_sz_chat.jpg

Harnessing the Power of Open Source #PanamaLeaks

Here is a 5-minute read and walkthrough about how the CIP - ICIJ journalists handle 2.6 TB of data in Panama Paper leaks?
The first thing they did was look for open source tool to solve their problems:

1. Problem they faced - communication of journalists who do not belong to the same newsroom. They used Oxwall and tweaked it to make their own private social network for their reporters.
2. Security was key for them - so they looked into it. Governments were not a threat to them, but they enabled things like 2-step authentication, every data served over HTTPS etc.


3. TOOLS USED:

- They used Amazon cloud instances with Ubuntu + Tesseract (for OCR) + Extract (parallel processing inhouse software) and Solr for harnessing cloud power.
- They used Extract (present in github account) to do parallel processing, get a file try to see if they recognized the content, if they didn't do OCR, used SolR for indexing.  Solr didn’t have a user interface, so they used Project Blacklight
-  Apache Tika for document processing. Tika interacts with Tesseract,
- They used Nuix to search unstructured metadata from Big Data and Analytics perspective.
- They used Lucene for syntax queries with proximity matching (search on API, search on regular expressions etc, batch searching with list of names etc).
- They used Neoj4 to do graph searches.
- Different database were processed in SQL server and they needed a platform to unify all these.
- Used Linkurious for visualising graphs very easily and Neo4j database to plug in and they had the graph visualisation setup.
- They used Talend* for transform their database to Neo4j, plug in Linkurious and get reporters to search. (see a dot and expand those dots for example).
- They can use key Cypher queries to look in-depth at the data! Neo4j query language, Cypher can do more complex queries.
- Possibly used / will be using https://tabula.technology/ for extracting Tabular information on pdf documents.
- They have fuzzy searching, public widgets which allowed them to publish it in the web.
- Embedding these in their stories harnessing the power of collaborative viral off-shore leaks stories.
- They generated automated reporting to make it easier for people to share.

3. Full video of the Panama Papers unravelling, Mar Cabra below: 

https://youtu.be/S20XMQyvANY

Henk Jekel

Avatar-based Scalable Digital Outbound Lead & Appointment Generation

8 年

Interesting. Revealing.

回复

要查看或添加评论,请登录

Shamma R.的更多文章

  • Dear Bangladesh, Before It Is Too Late

    Dear Bangladesh, Before It Is Too Late

    Dear Bangladesh, Desperate times call for desperate pleas and this is mine. The government is doing what they can with…

  • What It Means To Be a Data-driven Organisation

    What It Means To Be a Data-driven Organisation

    "Data-driven" a term that is being used so widely and loosely nowadays in top discussion tables that it is almost as…

    4 条评论
  • Inspiring and Inspired

    Inspiring and Inspired

    I got an email from my MIT professor for the FinTech Future Commerce studies where professor David Shrier at the end of…

  • Data Pioneers in London!

    Data Pioneers in London!

    We were so honoured to host amazing data professionals in our Data Pioneers meetup in the WeWork Moorgate venue in…

    7 条评论
  • Announcement: Women in Data - Global Launch!

    Announcement: Women in Data - Global Launch!

    Happy Women's Day to all the amazing women out there! So happy to announce the launch of this volunteer driven niche…

    7 条评论
  • Why You Should Care, When You Hear "Women in (Insert-awesome-here)"

    Why You Should Care, When You Hear "Women in (Insert-awesome-here)"

    Update: Sign in here for Data Pioneers Brussels meetup. Nov 21st https://www.

    3 条评论
  • Collibra - the Best-of-breed in Data Governance closes $50M in Series C!

    Collibra - the Best-of-breed in Data Governance closes $50M in Series C!

    "Collibra, a software company that helps big organizations -- from banks to manufacturing firms to healthcare providers…

  • Future Commerce Block-talks

    Future Commerce Block-talks

    Welcome to Industrial Revolution 4.0.

  • Collibra is hiring!

    Collibra is hiring!

    Be a part of fast growth company who are the market leaders and at the forefront of innovation in Data Governance. We…

    3 条评论
  • Education Innovation Matters: www.swleuven.com

    Education Innovation Matters: www.swleuven.com

    Firstly, ?#?BeTech? ecosystem, thank you so much for supporting SW Leuven in the past pioneer events for 2014…

社区洞察

其他会员也浏览了