Entity Resolution at Scale
NLP + LLM system deduplicating entity names across 5M+ financial records using fuzzy matching and distributed computing.
Overview
Financial datasets often contain the same entity represented in dozens of different ways — abbreviations, misspellings, legal suffixes, and regional naming conventions all create duplicates that compromise downstream analytics. This project built an end-to-end entity resolution pipeline that deduplicates entity names across millions of records using a combination of traditional NLP techniques and LLM-powered disambiguation.
Technical Approach
The system operates in a multi-stage pipeline, progressively filtering and scoring candidate matches:
graph LR
A[Raw Entity<br/>Records] --> B[Preprocessing<br/>& Normalization]
B --> C[PySpark<br/>Cross-Join]
C --> D[FuzzyWuzzy<br/>Multi-Method<br/>Scoring]
D --> E[Confidence<br/>Thresholding]
E -->|Ambiguous| F[GPT-4<br/>Disambiguation]
E -->|High Confidence| G[Resolved<br/>Entity Map]
F --> G
G --> H[Incremental<br/>Processing]
Stage 1 — Preprocessing & Normalization: Strip legal suffixes (LLC, Inc., Ltd.), normalize whitespace and casing, expand common abbreviations, and generate blocking keys to reduce the comparison space.
Stage 2 — Distributed Cross-Join: PySpark distributes the pairwise comparison workload across the cluster. Blocking keys reduce the O(n^2) comparison space to manageable partitions.
Stage 3 — Multi-Method Fuzzy Scoring: Each candidate pair is scored using multiple FuzzyWuzzy methods (ratio, partial_ratio, token_sort_ratio, token_set_ratio). The ensemble score provides robustness against different types of name variations.
Stage 4 — LLM Disambiguation: Pairs that fall in the ambiguous confidence range are sent to GPT-4 with structured prompts that include entity context (sector, geography, relationships) for final resolution.
Stage 5 — Incremental Processing: New records are matched against the existing resolved entity map without reprocessing the entire dataset.
Key Results
- Deduplicated 5M+ financial entity records with >95% precision
- Reduced manual review queue by 80% through confidence-based routing
- Incremental processing handles daily updates in minutes rather than hours
- Multi-method scoring catches variations that single-method approaches miss
Technologies Used
Python PySpark FuzzyWuzzy GPT-4 API NLP Distributed Computing