Extract table from PDF
I have few PDF files, each consist of bunch tables which I want to extract for further processing.
Started my journey from Azure Document Intelligence - which is pretty cool project
So after creating the service, here is an snippet of code that may be used:
import { writeFileSync, createReadStream } from "fs"
import { AzureKeyCredential, DocumentAnalysisClient } from "@azure/ai-form-recognizer"
var key = "xxxxxxxxxxxxxxxxxx"
var endpoint = "https://helloworld.cognitiveservices.azure.com/"
var formUrl = "https://example.com/docs/2023/3/report.pdf"
var client = new DocumentAnalysisClient(endpoint, new AzureKeyCredential(key))
var poller = await client.beginAnalyzeDocument("prebuilt-layout", "https://example.com/docs/2023/3/report.pdf")
// var poller = await client.beginAnalyzeDocument("prebuilt-layout", createReadStream("report.pdf"))
var { tables, pages } = await poller.pollUntilDone()
console.log({pages: pages.length, tables: tables.length})
console.dir(tables, {depth: null, colors: true})
The only note I have here is that in free plan it does not go further than second page, but in paid mode everything seems to be working.
I do like how easy it is to get it up and running, as well as the fact it is managed solution.
My next attempt was to check what other tools there are, and after quick googling I had found following example
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf("report.pdf", pages="all")
print(tabulate(df))
But in my case it did not worked out and failed with error: subprocess.CalledProcessError: Command '['java', '-Djava.awt.headless=true', '-Dfile.encoding=UTF8', '-jar', '/Users/mac/Downloads/npf_report/venv/lib/python3.11/site-packages/tabula/tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'report.pdf']' returned non-zero exit status 1.
What the heck, tels do it in Java directly then.
But first here is an example of CLI utility usage:
# wget https://github.com/tabulapdf/tabula-java/releases/download/v1.0.5/tabula-1.0.5-jar-with-dependencies.jar
# java -jar tabula-1.0.5-jar-with-dependencies.jar --help
java -jar tabula-1.0.5-jar-with-dependencies.jar --pages all --guess --format JSON --outfile report.json report.pdf
After that I have created simplest spring boot service ever with following controller
package ua.org.macblog.tables;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;
import technology.tabula.ObjectExtractor;
import technology.tabula.Page;
import technology.tabula.PageIterator;
import technology.tabula.Table;
import technology.tabula.extractors.SpreadsheetExtractionAlgorithm;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
@RestController
public class DefaultController {
@PostMapping("/tables")
public List<List<List<String>>> Extract(MultipartFile file) throws IOException {
List<List<List<String>>> result = new ArrayList<>();
PDDocument document = PDDocument.load(file.getInputStream());
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
PageIterator pi = new ObjectExtractor(document).extract();
while (pi.hasNext()) {
Page page = pi.next();
List<List<String>> tableData = new ArrayList<>();
for (Table table : sea.extract(page)) {
for (var row : table.getRows()) {
List<String> cells = new ArrayList<>();
for (var cell : row) {
cells.add(cell.getText());
}
tableData.add(cells);
}
}
result.add(tableData);
}
return result;
}
}
For dependency I have added following to pom.xml
<dependency>
<groupId>technology.tabula</groupId>
<artifactId>tabula</artifactId>
<version>1.0.5</version>
</dependency>
And to not bother with Java on server here is docker
FROM openjdk:21
WORKDIR /app
COPY target/tables-0.0.1-SNAPSHOT.jar tables.jar
CMD ["java", "-jar", "tables.jar"]
# colima start
# docker buildx build --platform linux/amd64 -t mac2000/tables .
With that in place and simple index.html like this
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
<title>Tables</title>
</head>
<body>
<fieldset>
<legend>Extract Tables from PDF</legend>
<form>
<input type="file" name="file" id="file" accept="application/pdf" required />
<input type="submit" value="submit">
</form>
</fieldset>
<div id="output"></div>
<script>
document.querySelector("form").addEventListener("submit", async event => {
event.preventDefault();
const formData = new FormData();
for (const file of document.getElementById("file").files) {
formData.append("file", file);
}
const tables = await fetch("/tables", {method: "POST", body: formData}).then(res => res.json());
let html = ""
for(const table of tables) {
html += "<table cellpadding='5' cellspacing='0' border='1' style='margin: 1em 0'>"
html += table.map(row => "<tr>" + row.map(cell => "<td>" + cell + "</td>").join("") + "</tr>").join("")
html += "</table>"
}
document.getElementById("output").innerHTML = html;
});
</script>
</body>
</html>
We now have an service that accepts pdf files and returns json of parsed tables - profit
Here is an example
curl -X POST http://localhost:8080/tables -F file=@report.pdf
And response will be something like:
[
[
["Id","Name"],
[1,"One"],
[2,"Two"]
],
[
["Movie","Year","Rating"],
["Die Hard 4",2007,7.1],
["Prey",2022,7.1],
]
]