

To run, Tabula-Extractor requires JRuby-1.7 installed. In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs. Tabula-Extractor: A Command Line Interface to Tabula.Introducing Tabula: Upload a PDF, get back tabular CSV data.Having said the above now let me add this:įor an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) - contradicting what I said in my introductionary paragraphs! - check out TabulaPDF. but doing so with TabulaPDF works very well! Why Updating Dollars for Docs Was So Difficult (ProPublica-Website). Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.įor a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article: The only distinction that the syntax provides is the distinctions between vector elements (lines, fills.), images and text. Standard PDFs do not provide any hints about the semantics of what they draw on a page: Having said that, I'll have to add: Extracting even 'nice' tables from PDFs in general is extremely difficult. So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with. It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned: After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.
