Showing posts with label Java. Show all posts
Showing posts with label Java. Show all posts

Thursday, November 14, 2024

Securing PDFs Generated with Java: A Guide to Enhanced Protection

As businesses increasingly rely on digital documentation, protecting sensitive information within PDFs has become a top priority. PDFs are widely used because they preserve document formatting across devices, but without adequate security measures, they can be vulnerable to unauthorized access, copying, or editing. In Java applications that generate PDFs, adding protection layers can ensure that confidential information stays secure.

In this guide, we’ll explore how to enhance the security of PDFs generated using Java. From implementing password protection to setting user permissions and encryption, we’ll cover various approaches to safeguard your documents.

Why Protect PDFs?

Sensitive information—like financial records, proprietary research, and personal data—often ends up in PDF format. Securing these files not only helps prevent data breaches but also supports compliance with data privacy regulations like GDPR and HIPAA. Implementing PDF protection in your Java application can ensure that only authorized users have access and prevent alterations or copying of sensitive content.

Techniques for Securing PDFs in Java

Password Protection

Password-protecting PDFs provides an effective security layer to limit access. Java libraries like iText and Apache PDFBox offer methods to apply two types of passwords:

  • User Password: Required to open the document, ensuring only authorized users can view its contents.
  • Owner Password: Grants full control over document permissions (such as printing or editing) and allows modification of the document’s security settings.

Encryption:

Encryption secures PDF content by converting it into unreadable code, making it accessible only to users with the correct credentials. PDF encryption can use different approaches depending on security requirements:

Password-Based Encryption (Symmetric)


This is the most common form of PDF encryption, where a single password or key is used to both encrypt and decrypt the document. Password-based encryption ensures that only users who know the user password can open and view the document. An owner password can also be set to control permissions, such as printing, copying, or editing.

  • Supported Encryption Algorithms: Libraries like PDFBox and iText support AES-128 and AES-256 encryption, providing strong security for PDF content.
  • Use Case: Ideal for general document security when you need a straightforward way to limit access.

Setting Permissions:

Permissions offer granular control over document actions, such as printing, copying, editing, and content extraction. The owner password allows you to set these permissions in the PDF. This is especially useful for documents with sensitive content where you want to prevent unauthorized actions. Libraries like iText and PDFBox enable developers to configure these permissions during PDF creation.


Digital Signatures:

Digital signatures authenticate a document’s origin and confirm that it hasn’t been altered since signing, making them essential for legal, financial, and compliance-focused documents. Created using a private key embedded in a certificate, digital signatures provide document authenticity and tamper-evidence.

  • Self-Signed Certificates: Useful for internal testing and validation, as they verify document integrity without requiring external validation. You can generate a self-signed certificate for demonstration purposes with the following keytool command:

    keytool -genkeypair -alias testkey -keyalg RSA -keysize 2048 -validity 365 -keystore keystore.p12 -storetype PKCS12 -storepass changeit -keypass changeit -dname "CN=Test User, OU=Development, O=YourCompany, L=City, S=State, C=US"
  • CA-Signed Certificates: Issued by trusted Certificate Authorities (CAs), CA-signed certificates are recommended for production environments as they provide credibility for official, external-facing documents. These certificates ensure maximum authenticity and compatibility, particularly in regulated industries.

When obtaining a document signing certificate from a CA, there are typically two private key storage options:

  1. Hardware Tokens: Many CAs, including DigiCert, GlobalSign, and Entrust, require document signing certificates to be stored on a hardware token (usually a USB device). This token-based storage provides high security by storing the private key in a tamper-resistant environment. However, it has limitations for cloud applications and may not support fully automated workflows.

  2. Cloud HSM (Hardware Security Module): Some CAs support storing document signing keys in a Cloud HSM. Cloud HSMs are FIPS-compliant, secure environments that allow the private key to remain secure while still being accessible to cloud-hosted applications. Examples of cloud HSM services include:

    • AWS CloudHSM
    • Azure Dedicated HSM
    • Google Cloud HSM

    Cloud HSMs offer several advantages for cloud-based and automated workflows:

    • Secure Remote Access: Allows cloud applications to access the signing key securely.
    • Programmatic Access: Enables automated, high-volume signing directly from applications.
    • Compliance and Security: Meets FIPS standards for handling sensitive data.

For production use, a CA-signed certificate is preferred to ensure that digital signatures are universally trusted and recognized. Be aware that some CAs may impose restrictions on the number of documents that can be signed per year or month, so choose an option that meets your specific needs.

Watermarking: 

Watermarking places visible text or images (e.g., "Confidential" or "Draft") across each page to visually indicate the document’s purpose or confidentiality level. Watermarks do not restrict access but serve as a deterrent to unauthorized distribution, making them particularly useful for internal or sensitive documents.

iText:

Sample code using the iText library to enable digital signatures. This example uses a self-signed certificate, but in a real scenario, a CA-signed certificate should be used as discussed earlier. Additionally, the code demonstrates how to control permissions, enable a user password, and add a watermark.


import com.itextpdf.kernel.pdf.*;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.canvas.PdfCanvas;
import com.itextpdf.kernel.pdf.xobject.PdfImageXObject;
import com.itextpdf.io.image.ImageDataFactory;
import com.itextpdf.layout.Document;
import com.itextpdf.layout.element.Paragraph;

import com.itextpdf.signatures.*;

import org.bouncycastle.jce.provider.BouncyCastleProvider;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.security.KeyStore;
import java.security.PrivateKey;
import java.security.Security;
import java.security.cert.Certificate;

public class SignedPDFExample {

    private static final String KEYSTORE_PATH = "keystore.p12";
    // Path to your keystore
    private static final char[] KEYSTORE_PASSWORD = "changeit".toCharArray();
    private static final String KEY_ALIAS = "testkey";
    // Alias of the key
    private static final String WATERMARK_IMAGE_PATH = "watermark.png";
    // Path to the watermark image

    public static void main(String[] args) {
        try {
            Security.addProvider(new BouncyCastleProvider());

            // Generate, watermark, and sign the PDF
            ByteArrayOutputStream signedOutputStream =
                    generateWatermarkedAndSignedPdf();
            try (FileOutputStream fos =
                new FileOutputStream("signed_watermarked_test.pdf")) {
                fos.write(signedOutputStream.toByteArray());
            }
            System.out.println("Signed and watermarked PDF created:
                signed_watermarked_test.pdf");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static ByteArrayOutputStream generateWatermarkedAndSignedPdf()
        throws Exception {
        // Load the keystore
        KeyStore ks = KeyStore.getInstance("PKCS12");
        ks.load(new FileInputStream(KEYSTORE_PATH), KEYSTORE_PASSWORD);
        PrivateKey privateKey = (PrivateKey) ks.getKey(KEY_ALIAS, KEYSTORE_PASSWORD);
        Certificate[] chain = ks.getCertificateChain(KEY_ALIAS);

        // Define permissions and set owner password (no user password required)
        String ownerPassword = "ownerpassword";
String userPassword = "userpassword";
        WriterProperties writerProperties = new WriterProperties()
                .setStandardEncryption(
                        userPassword.getBytes(), // user password is set
                        ownerPassword.getBytes(), // owner password
                        EncryptionConstants.ALLOW_PRINTING |
                        EncryptionConstants.ALLOW_COPY,
                        // Adjust permissions as needed
                        EncryptionConstants.ENCRYPTION_AES_256);

        // Generate PDF in memory with encryption properties
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        PdfWriter writer = new PdfWriter(outputStream, writerProperties);
        PdfDocument pdfDoc = new PdfDocument(writer);
        Document document = new Document(pdfDoc);
        document.add(
                new Paragraph("This is a test PDF with a digital signature,
                        watermark, and restricted permissions."));

        // Apply watermark to each page
        PdfImageXObject watermarkImage =
            new PdfImageXObject(ImageDataFactory.create(WATERMARK_IMAGE_PATH));
        // Loop through all pages to add watermark
        int numberOfPages = pdfDoc.getNumberOfPages();
        for (int i = 1; i <= numberOfPages; i++) {
            PdfCanvas canvas = new PdfCanvas(pdfDoc.getPage(i));
            Rectangle pageSize = pdfDoc.getPage(i).getPageSize();
            // Updated to Rectangle type
            float x = (pageSize.getWidth() - watermarkImage.getWidth()) / 2;
            float y = (pageSize.getHeight() - watermarkImage.getHeight()) / 2;

            canvas.saveState();
            canvas.addXObject(watermarkImage, x, y);
            canvas.restoreState();
        }

        document.close();

        // Sign the PDF in memory
        ByteArrayOutputStream signedOutputStream = new ByteArrayOutputStream();

        // Initialize PdfReader with the owner password to allow modifications for
        // signing
        PdfReader reader = new PdfReader(new ByteArrayInputStream
                                        (outputStream.toByteArray()),
                new ReaderProperties().setPassword(ownerPassword.getBytes()));
        PdfSigner signer = new PdfSigner(reader, signedOutputStream,
                            new StampingProperties());

        // Configure signature appearance
        PdfSignatureAppearance appearance = signer.getSignatureAppearance();
        appearance.setReason("Testing Signature").setLocation("USA");

        // Cryptographic parameters
        IExternalDigest digest = new BouncyCastleDigest();
        IExternalSignature signature = new PrivateKeySignature(privateKey, "SHA-256",
                BouncyCastleProvider.PROVIDER_NAME);

        // Sign and return signed PDF stream with editing restrictions
        signer.signDetached(digest, signature, chain, null, null, null, 0,
                            PdfSigner.CryptoStandard.CMS);
        return signedOutputStream;
    }
}



Add iText Maven Repo to the Pom.xml

    <repositories>
      <repository>
        <id>itext-releases</id>
        <name>iText Repository - releases</name>
        <url>https://repo.itextsupport.com/releases</url>
      </repository>
    </repositories>

Add below Dependencies

    <dependency>        
         <groupId>com.itextpdf</groupId>        
         <artifactId>kernel</artifactId>        
         <version>7.1.5</version>    
    </dependency>

    <dependency>        
         <groupId>com.itextpdf</groupId>        
         <artifactId>sign</artifactId>        
         <version>7.1.5</version>    
    </dependency>      
   
    <dependency>
        <groupId>com.itextpdf</groupId>
        <artifactId>sign</artifactId>
        <version>7.1.5</version>
    </dependency>      
    <dependency>
        <groupId>org.bouncycastle</groupId>
        <artifactId>bcpkix-jdk15on</artifactId>
        <version>1.68</version>
    </dependency>

Please note that iText requires a license for commercial use. Alternatively, PDFBox is licensed under the Apache License and is free to use in all cases.

PDFBox:

Sample code using the PDFBox library to enable digital signatures. This example uses a self-signed certificate, but in a real scenario, a CA-signed certificate should be used as discussed earlier. Additionally, the code demonstrates how to control permissions, enable a user password, and add a watermark.

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.pdmodel.encryption.StandardProtectionPolicy;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.font.Standard14Fonts;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature;
import org.apache.pdfbox.pdmodel.interactive.digitalsignature.SignatureInterface;
import org.bouncycastle.cert.jcajce.JcaCertStore;
import org.bouncycastle.cert.jcajce.JcaX509CertificateHolder;
import org.bouncycastle.cms.CMSSignedDataStreamGenerator;
import org.bouncycastle.operator.ContentSigner;
import org.bouncycastle.operator.jcajce.JcaContentSignerBuilder;
import org.bouncycastle.operator.jcajce.JcaDigestCalculatorProviderBuilder;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.security.KeyStore;
import java.security.PrivateKey;
import java.security.Security;
import java.security.cert.Certificate;
import java.security.cert.X509Certificate;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.List;

public class PDFBoxSignedWatermarkedExample {

    private static final String KEYSTORE_PATH = "keystore.p12";
    private static final char[] KEYSTORE_PASSWORD = "changeit".toCharArray();
    private static final String KEY_ALIAS = "testkey";
    private static final String WATERMARK_IMAGE_PATH = "watermark.png";

    public static void main(String[] args) {
        try {
            Security.addProvider(
                    new org.bouncycastle.jce.provider.BouncyCastleProvider());

            // Generate, watermark, and sign the PDF
            ByteArrayOutputStream signedOutputStream =
                    generateWatermarkedAndSignedPdf();
            try (FileOutputStream fos =
                new FileOutputStream("signed_watermarked_test.pdf")) {
                fos.write(signedOutputStream.toByteArray());
            }
            System.out.println("Signed and watermarked PDF created:
                 signed_watermarked_test.pdf");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static ByteArrayOutputStream generateWatermarkedAndSignedPdf()
            throws Exception {
        KeyStore ks = KeyStore.getInstance("PKCS12");
        ks.load(new FileInputStream(KEYSTORE_PATH), KEYSTORE_PASSWORD);
        PrivateKey privateKey = (PrivateKey) ks.getKey(KEY_ALIAS, KEYSTORE_PASSWORD);
        Certificate[] certificateChain = ks.getCertificateChain(KEY_ALIAS);

        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        PDDocument document = new PDDocument();

        // Add a page with sample content
        PDPage page = new PDPage(PDRectangle.A4);
        document.addPage(page);

        try (PDPageContentStream contentStream =
                new PDPageContentStream(document, page)) {
            contentStream.beginText();
            contentStream.setFont(
                        new PDType1Font(Standard14Fonts.FontName.HELVETICA_BOLD), 12);
            contentStream.newLineAtOffset(100, 700);
            contentStream
                    .showText("This is a test PDF with a digital signature,
                                    watermark, and restricted permissions.");
            contentStream.endText();
        }

        // Apply watermark to each page
        PDImageXObject watermarkImage =
                PDImageXObject.createFromFile(WATERMARK_IMAGE_PATH, document);
        PDPageTree pages = document.getPages();
        for (PDPage pdfPage : pages) {
            PDRectangle pageSize = pdfPage.getMediaBox();
            try (PDPageContentStream contentStream =
                    new PDPageContentStream(document, pdfPage,
                    PDPageContentStream.AppendMode.APPEND, true, true)) {
                float x = (pageSize.getWidth() - watermarkImage.getWidth()) / 2;
                float y = (pageSize.getHeight() - watermarkImage.getHeight()) / 2;
                contentStream.drawImage(watermarkImage, x, y);
            }
        }

        // Set permissions with owner password and user password
        AccessPermission accessPermission = new AccessPermission();
        accessPermission.setCanPrint(true);
        accessPermission.setCanModify(false);
        accessPermission.setCanExtractContent(true);

        StandardProtectionPolicy protectionPolicy =
                    new StandardProtectionPolicy("ownerpassword", "userpassword",
                accessPermission);
        protectionPolicy.setEncryptionKeyLength(256);
        document.protect(protectionPolicy);

        // Save the document with encryption applied to an output stream
        document.save(outputStream);
        document.close();

        // Sign the document
        return signPdf(new ByteArrayInputStream(outputStream.toByteArray()),
                        privateKey, certificateChain);
    }

    public static ByteArrayOutputStream signPdf(InputStream pdfInputStream,
                        PrivateKey privateKey, Certificate[] chain)
            throws Exception {
        // Read all bytes from InputStream for PDFBox compatibility
        ByteArrayOutputStream buffer = new ByteArrayOutputStream();
        int bytesRead;
        byte[] data = new byte[1024];
        while ((bytesRead = pdfInputStream.read(data, 0, data.length)) != -1) {
            buffer.write(data, 0, bytesRead);
        }
        buffer.flush();
        byte[] pdfBytes = buffer.toByteArray();

        // Load the PDF document from byte array with the user password
        PDDocument document = Loader.loadPDF(pdfBytes, "userpassword");

        PDSignature signature = new PDSignature();
        signature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
        signature.setSubFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED);
        signature.setName("Test User");
        signature.setLocation("USA");
        signature.setReason("Testing Digital Signature");
        signature.setSignDate(Calendar.getInstance());

        document.addSignature(signature, new SignatureInterface() {
            @Override
            public byte[] sign(InputStream content) {
                try {
                    return signContent(content, privateKey, chain);
                } catch (Exception e) {
                    e.printStackTrace();
                }
                return null;
            }
        });

        ByteArrayOutputStream signedOutput = new ByteArrayOutputStream();
        document.saveIncremental(signedOutput);
        document.close();
        return signedOutput;
    }

    public static byte[] signContent(InputStream content, PrivateKey privateKey,
             Certificate[] chain) throws Exception {
        List<X509Certificate> certList = new ArrayList<>();
        for (Certificate cert : chain) {
            certList.add((X509Certificate) cert);
        }

        CMSSignedDataStreamGenerator cmsStreamGenerator =
                    new CMSSignedDataStreamGenerator();
        cmsStreamGenerator.addCertificates(new JcaCertStore(certList));

        ContentSigner contentSigner = new JcaContentSignerBuilder("SHA256withRSA")
                                        .setProvider("BC").build(privateKey);
        JcaX509CertificateHolder certificateHolder =
                new JcaX509CertificateHolder((X509Certificate) chain[0]);

        cmsStreamGenerator.addSignerInfoGenerator(
                new org.bouncycastle.cms.SignerInfoGeneratorBuilder(
                        new JcaDigestCalculatorProviderBuilder()
                        .setProvider("BC").build())
                        .build(contentSigner, certificateHolder));

        ByteArrayOutputStream signatureOutput = new ByteArrayOutputStream();
        try (OutputStream out = cmsStreamGenerator.open(signatureOutput, true)) {
            byte[] buffer = new byte[8192];
            int bytesRead;
            while ((bytesRead = content.read(buffer)) != -1) {
                out.write(buffer, 0, bytesRead);
            }
        }

        return signatureOutput.toByteArray();
    }
}



Dependencies:

<dependency>
     <groupId>org.apache.pdfbox</groupId>
     <artifactId>pdfbox</artifactId>
      <version>3.0.3</version>
 </dependency>
     
<dependency>
      <groupId>org.bouncycastle</groupId>
      <artifactId>bcpkix-jdk15on</artifactId>
      <version>1.68</version>
</dependency>

Conclusion:

Protecting PDFs generated in Java applications can provide an essential layer of security for sensitive information. By leveraging tools like iText, PDFBox you can ensure your documents are password-protected, encrypted, and permission-controlled. Implementing these security practices strengthens your application’s data privacy and compliance posture, providing peace of mind for both developers and end users.

Thursday, October 31, 2024

How to Resolve org.apache.poi.util.RecordFormatException: "Tried to Allocate an Array of Length, but the Maximum Length is 100,000,000

When uploading and processing a large .xls file through Java in an AEM backend, I encountered the following exception:

Exception in thread "main" org.apache.poi.util.RecordFormatException:
Tried to allocate an
array of length 159,135,035, but the maximum length for this record type is 100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to
request increasing the maximum allowable size for this record type. You can set a higher override value with IOUtils.setByteArrayMaxOverride()

Stack Trace:

at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:596)
at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:281) ... at ExcelReader.readSpreadsheet(ExcelReader.java:17) at ExcelReader.main(ExcelReader.java:52)



Explanation of the Issue

This exception occurs because Apache POI tries to allocate a large byte array to process the Excel file, exceeding the default size limit of 100,000,000 bytes. The issue is often due to the size or structure of the .xls file, which requires a larger array than Apache POI allows by default.

Solutions

There are two primary ways to resolve this issue:


Option 1: Increase the Byte Array Limit

You can override the maximum allowable byte array size using IOUtils.setByteArrayMaxOverride(). Setting this to a higher value, such as 200,000,000, may help process larger Excel files. However, this approach can impact memory usage and may lead to performance issues with very large files. Before increasing the size, make sure the issue isn’t due to a corrupted or incorrect file format.

Updated Code Snippet:


import org.apache.poi.ss.usermodel.*; import org.apache.poi.ss.usermodel.WorkbookFactory; import org.apache.poi.util.IOUtils; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.util.Iterator; public class ExcelReader { public void readSpreadsheet(String filePath) { Workbook workBook = null; try (FileInputStream fis = new FileInputStream(new File(filePath))) { // Increase the byte array limit for larger files IOUtils.setByteArrayMaxOverride(200_000_000); // Create Workbook instance workBook = WorkbookFactory.create(fis); Sheet sheet = workBook.getSheetAt(0); Iterator<Row> rowIter = sheet.rowIterator(); boolean firstRow = true; while (rowIter.hasNext()) { Row row = rowIter.next(); if (!firstRow) { Iterator<Cell> cellIter = row.cellIterator(); while (cellIter.hasNext()) { Cell cell = cellIter.next(); System.out.print(cell + "\t"); // Print cell content } System.out.println(); // Newline for each row } else { firstRow = false; } } } catch (IOException e) { System.err.println("IO Exception: " + e.getMessage()); } finally { if (workBook != null) { try { workBook.close(); } catch (IOException e) { System.err.println("Unable to close Workbook object: "
                     + e.getMessage()); } } } } public static void main(String[] args) { String filePath = "Test.xlsx"; ExcelReader reader = new ExcelReader(); reader.readSpreadsheet(filePath); } }

Command to Compile and Run:

javac -cp ".;poi-5.3.0.jar;poi-ooxml-5.3.0.jar;poi-ooxml-full-5.3.0.jar;poi-ooxml-schemas-5.3.0.jar;xmlbeans-5.1.1.jar;commons-collections4-4.5.0-M2.jar;commons-io-2.17.0.jar;log4j-api-2.24.1.jar;commons-compress-1.27.1.jar;log4j-core-2.24.1.jar" ExcelReader.java

java -cp ".;poi-5.3.0.jar;poi-ooxml-5.3.0.jar;poi-ooxml-full-5.3.0.jar;poi-ooxml-schemas-5.3.0.jar;xmlbeans-5.1.1.jar;commons-collections4-4.5.0-M2.jar;commons-io-2.17.0.jar;log4j-api-2.24.1.jar;commons-compress-1.27.1.jar;log4j-core-2.24.1.jar" ExcelReader

Option 2: Use the Streaming API (SXSSFWorkbook)

Using SXSSFWorkbook, a streaming API in Apache POI, allows you to process large .xlsx files in a memory-efficient way. This approach loads only a small part of the file into memory at any time, making it suitable for processing large Excel files.

Code Using SXSSFWorkbook:


import org.apache.poi.ss.usermodel.*; import org.apache.poi.xssf.usermodel.XSSFWorkbook; import org.apache.poi.xssf.streaming.SXSSFWorkbook; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.util.Iterator; public class ExcelReaderStream { public void readSpreadsheet(String filePath) { SXSSFWorkbook streamingWorkbook = null; try (FileInputStream fis = new FileInputStream(new File(filePath)); XSSFWorkbook workbook = new XSSFWorkbook(fis)) { // Wrap XSSFWorkbook in SXSSFWorkbook to enable streaming streamingWorkbook = new SXSSFWorkbook(workbook); Sheet sheet = streamingWorkbook.getSheetAt(0); // Iterate through rows in a memory-efficient way for (Row row : sheet) { Iterator<Cell> cellIter = row.cellIterator(); while (cellIter.hasNext()) { Cell cell = cellIter.next(); System.out.print(getCellValue(cell) + "\t"); } System.out.println(); // Newline for each row } } catch (IOException e) { System.err.println("IO Exception: " + e.getMessage()); } finally { if (streamingWorkbook != null) { try { streamingWorkbook.dispose(); // Dispose of temporary files streamingWorkbook.close(); } catch (IOException e) { System.err.println("Unable to close SXSSFWorkbook object: "
                    + e.getMessage()); } } } } // Utility method to get the value of a cell as a String private String getCellValue(Cell cell) { switch (cell.getCellType()) { case STRING: return cell.getStringCellValue(); case NUMERIC: if (DateUtil.isCellDateFormatted(cell)) { return cell.getDateCellValue().toString(); } else { return String.valueOf(cell.getNumericCellValue()); } case BOOLEAN: return String.valueOf(cell.getBooleanCellValue()); case FORMULA: return cell.getCellFormula(); default: return ""; } } public static void main(String[] args) { String filePath = "Test.xlsx"; ExcelReaderStream reader = new ExcelReaderStream(); reader.readSpreadsheet(filePath); } }

Command to Compile and Run:

javac -cp ".;poi-5.3.0.jar;poi-ooxml-5.3.0.jar;poi-ooxml-full-5.3.0.jar;poi-ooxml-schemas-5.3.0.jar;xmlbeans-5.1.1.jar;commons-collections4-4.5.0-M2.jar;commons-io-2.17.0.jar;log4j-api-2.24.1.jar;commons-compress-1.27.1.jar;log4j-core-2.24.1.jar" ExcelReaderStream.java

java -cp ".;poi-5.3.0.jar;poi-ooxml-5.3.0.jar;poi-ooxml-full-5.3.0.jar;poi-ooxml-schemas-5.3.0.jar;xmlbeans-5.1.1.jar;commons-collections4-4.5.0-M2.jar;commons-io-2.17.0.jar;log4j-api-2.24.1.jar;commons-compress-1.27.1.jar;log4j-core-2.24.1.jar" ExcelReaderStream

Conclusion

  • Option 1: Increasing the byte array limit using IOUtils.setByteArrayMaxOverride() can solve the immediate issue but may impact memory usage.
  • Option 2: Using SXSSFWorkbook with the streaming API is a more scalable solution for large .xlsx files, allowing memory-efficient processing.

Choose the approach that best suits your file size and memory requirements. Let me know if you have any questions!