vkuzel.com

Finding a text in a PDF file with iText

2018-11-16

Every document in a collection of similar PDF files contains an area with a text that should be searched. The goal is to detect, whether the area lying at the bottom right corner of the page contains specified text.

PDF document example

iText library provides a few classes suitable for text extraction. For example SimpleTextExtractionStrategy, LocationTextExtractionStrategy or PdfTextExtractor. Unfortunately none of these allows to specify an area where the text should be searched for. These classes usually return all texts on a page mixed together into one big string.

This can be overcome by implementing RenderListener interface to collect texts from specific area of a PDF page. The listener can be passed into iText document parser's content processor that goes through all text objects on a page.

public static class TextCollector implements RenderListener {

    private static final float BOX_WIDTH = 200;
    private static final float BOX_HEIGHT = 100;

    private final Rectangle pageSize;
    private final Set<String> collectedTexts = new HashSet();

    public TextCollector(Rectangle pageSize) {
        this.pageSize = pageSize;
    }

    @Override
    public void beginTextBlock() {
    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        LineSegment baseLine = renderInfo.getBaseline();
        Vector startPoint = baseLine.getStartPoint();
        // Collect all texts in a box at the bottom right corner of the page.
        // The bottom left corner of the page is page's origin (0, 0).
        if (startPoint.get(0) > pageSize.getWidth() - BOX_WIDTH && startPoint.get(1) < BOX_HEIGHT) {
            collectedTexts.add(renderInfo.getText());
        }
    }

    @Override
    public void endTextBlock() {
    }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) {
    }

    public Set<String> getCollectedTexts() {
        return collectedTexts;
    }
}

The collector should be then called for every page of a PDF document.

public boolean findTextInPdf(PdfReaderContentParser parser, String text) {
    for (int pageNo = 1; pageNo <= reader.getNumberOfPages(); pageNo++) {
        TextCollector textCollector = parser.processContent(pageNo, new TextCollector(reader.getPageSize(pageNo)));
        if (Set<String> collectedTexts = textCollector.getCollectedTexts()) {
            return collectedTexts.contains(text);
        }
    }
    return false;
}