Every document in a collection of similar PDF files contains an area with a text that should be searched. The goal is to detect, whether the area lying at the bottom right corner of the page contains specified text.
iText library provides a few classes suitable for text extraction. For example SimpleTextExtractionStrategy
, LocationTextExtractionStrategy
or PdfTextExtractor
. Unfortunately none of these allows to specify an area where the text should be searched for. These classes usually return all texts on a page mixed together into one big string.
This can be overcome by implementing RenderListener
interface to collect texts from specific area of a PDF page. The listener can be passed into iText document parser's content processor that goes through all text objects on a page.
public static class TextCollector implements RenderListener {
private static final float BOX_WIDTH = 200;
private static final float BOX_HEIGHT = 100;
private final Rectangle pageSize;
private final Set<String> collectedTexts = new HashSet();
public TextCollector(Rectangle pageSize) {
this.pageSize = pageSize;
}
@Override
public void beginTextBlock() {
}
@Override
public void renderText(TextRenderInfo renderInfo) {
LineSegment baseLine = renderInfo.getBaseline();
Vector startPoint = baseLine.getStartPoint();
// Collect all texts in a box at the bottom right corner of the page.
// The bottom left corner of the page is page's origin (0, 0).
if (startPoint.get(0) > pageSize.getWidth() - BOX_WIDTH && startPoint.get(1) < BOX_HEIGHT) {
collectedTexts.add(renderInfo.getText());
}
}
@Override
public void endTextBlock() {
}
@Override
public void renderImage(ImageRenderInfo renderInfo) {
}
public Set<String> getCollectedTexts() {
return collectedTexts;
}
}
The collector should be then called for every page of a PDF document.
public boolean findTextInPdf(PdfReaderContentParser parser, String text) {
for (int pageNo = 1; pageNo <= reader.getNumberOfPages(); pageNo++) {
TextCollector textCollector = parser.processContent(pageNo, new TextCollector(reader.getPageSize(pageNo)));
if (Set<String> collectedTexts = textCollector.getCollectedTexts()) {
return collectedTexts.contains(text);
}
}
return false;
}