June 24, 2022

Java PDFBox Example - Read Text And Extract Image From PDF

In this post we’ll see a Java program to read text from a PDF document using PDFBox library and a Java program to extract image from a PDF document using PDFBox library.

To know more about PDFBox library and PDF examples in Java using PDFBox check this post- Generating PDF in Java Using PDFBox Tutorial

Reading PDFs using PDFBox

For reading text from a PDF using PDFBox you need to perform the following steps.

  1. Load the PDF that has to be read using PDDocument.load method.
  2. For reading text from PDF using PDFBox, PDFTextStripper class is used. This class takes a PDF document and strip out all of the text.
  3. getText() method of the PDFTextStripper class is used for reading the PDF document.
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ReadPDF {
  public static final String CONTENT_PDF = "F://knpcode//result//PDFBox//Content.pdf";
  public static void main(String[] args) {	
    try {
      PDDocument document = PDDocument.load(new File(CONTENT_PDF));
      PDFTextStripper textStripper = new PDFTextStripper();
      // Get total page count of the PDF document
      int numberOfPages = document.getNumberOfPages();
      //set the first page to be extracted 
      textStripper.setStartPage(1);
      // set the last page to be extracted 
      textStripper.setEndPage(numberOfPages);
      String text = textStripper.getText(document);
      System.out.println(text);
      document.close();
    } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }		
  }
}

Extracting image from PDF using PDFBox

If you want to extract images from a PDF document that can be done using the PDResources class in PDFBox library. Using this class you can get all the resources available at page level.

From those resources you can check if any of the resource is image (that can be checked by verifying if resource object is of type PDImageXObject).

import java.io.File;
import java.io.IOException;
import javax.imageio.ImageIO;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;

public class ReadPDF {
  public static final String CONTENT_PDF = "F://knpcode//result//PDFBox//Image.pdf";
  public static void main(String[] args) {	
    try {	
      PDDocument document = PDDocument.load(new File(CONTENT_PDF));
      // get resources for a page
      PDResources pdResources = document.getPage(0).getResources();
      int i = 0;
      for(COSName csName : pdResources.getXObjectNames()) {
        System.out.println(csName);
        PDXObject pdxObject = pdResources.getXObject(csName);	
        if(pdxObject instanceof PDImageXObject) {
          PDStream pdStream = pdxObject.getStream();
          PDImageXObject image = new PDImageXObject(pdStream, pdResources);
          i++;
          // image storage location and image name
          File imgFile = new File("F://knpcode//result//PDFBox//img"+i+".png");
          ImageIO.write(image.getImage(), "png", imgFile);
        }
			}
      document.close();
    } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }		
  }
}

That's all for the topic Java PDFBox Example - Read Text And Extract Image From PDF. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment