Automated Data Extraction Pipeline
Batch processing tool with compression, chunking, and multi-encoding support for high-volume financial spreadsheets.
Overview
Processing thousands of financial spreadsheets daily requires robust automation that handles edge cases gracefully — files with different encodings, workbooks with VBA macros, and datasets that exceed memory limits. This pipeline automates the end-to-end extraction workflow with adaptive chunking, compression, and comprehensive error handling.
Technical Approach
graph LR
A[File<br/>Queue] --> B[Encoding<br/>Detection]
B --> C[VBA/Macro<br/>Detection]
C --> D[Adaptive<br/>Chunk Sizing]
D --> E[Parallel<br/>Processing]
E --> F[gzip<br/>Compression]
F --> G[Output<br/>Dataset]
G --> H[Validation<br/>Report]
Multi-Encoding Detection: Files are probed for encoding using a cascade of UTF-8, Latin-1, and CP1252 detection. The pipeline handles mixed encodings within a single workbook gracefully.
VBA/Macro Detection: Workbooks containing VBA macros are flagged and processed in a sandboxed mode that extracts data without executing macros. Macro presence is logged for security auditing.
Adaptive Chunk Sizing: Large files are split into chunks based on available memory and file characteristics. Chunk boundaries respect logical row groups (e.g., deal boundaries) to avoid splitting related records.
Parallel Processing: Independent chunks are processed concurrently, with results assembled in order after completion. Failed chunks are retried with smaller sizes before reporting errors.
Compression & Output: Processed data is compressed using gzip and base64-encoded for efficient storage and transmission. Output includes row counts and checksums for verification.
Key Results
- Processes thousands of spreadsheets daily with <1% error rate
- Adaptive chunking handles files from kilobytes to gigabytes
- Multi-encoding support eliminates manual encoding fixes
- VBA detection provides security audit trail
- gzip compression reduces storage requirements by 60-80%
Technologies Used
Python openpyxl pandas gzip base64