Multi-modal data → insight
JoblogicCustomer Data Analysis Platform
Unstructured, multi-modal comms turned into clustered insight at scale.
- Role
- AI Engineer — design & build
- Year
- 2025
- Access
- Private — walkthrough on request
- Databricks
- Transcription
- UMAP
- HDBSCAN
- Embeddings
- Claude (Sonnet 4.6)
Problem
Tenants sit on years of customer communication — emails and call recordings — locked in formats nobody mines: ZIP, PST, MSG. That archive is exactly where the real signal lives about how customers actually work and where they struggle. But it's unstructured, multi-modal, and at a scale no human can read. Without structured insight from it, any tenant-specific agent or automation is built on guesswork.
My role
AI Engineer — I built the pipeline end to end: ingestion, audio normalisation and transcription, email extraction, cleaning, embedding, dimensionality reduction, clustering, representative sampling, and the final LLM analysis stage.
Approach & architecture
The pipeline turns a messy archive upload into named, clustered insight. Crucially, it never tries to send everything to an LLM — it clusters first, then sends only representative conversations for interpretation.
- 1
Ingest
Tenants upload ZIP/PST/MSG via SharePoint; files are copied into Databricks volumes and archives extracted.
- 2
Normalise audio
Recordings are standardised to 16 kHz mono MP3 and split into ≤20-minute chunks for reliable transcription.
- 3
Transcribe & extract
Calls are transcribed, email content is extracted, and text is cleaned.
- 4
Embed
Conversations are turned into vector embeddings.
- 5
Reduce & cluster
UMAP reduces dimensionality; HDBSCAN finds natural clusters without pre-specifying how many.
- 6
Sample & interpret
Representative conversations per cluster are sent to Claude (Sonnet 4.6) to surface use-cases, workflows and interaction patterns.
Hard parts
- Messy, multi-modal ingestion. PST/MSG/ZIP are awkward archive formats and audio quality varies wildly. Normalisation and ≤20-minute chunking were what made transcription reliable rather than flaky.
- Scale and cost. You cannot embed-and-transcribe an archive and feed all of it to an LLM. Clustering first, then sampling representatives per cluster, is what makes the economics work.
- Why UMAP + HDBSCAN. Density-based clustering finds natural groupings without committing to a fixed number of clusters up front; UMAP makes that tractable in high dimensions. Tuning both so the clusters were genuinely meaningful was the core data-science work.
- Representativeness. Sample selection had to actually represent each cluster, so the model's summary generalises instead of fixating on an outlier.
Impact
- Turned unstructured, multi-modal archives into clustered, named insight at scale.
- The output feeds tenant-specific agents and automations — closing the loop with the agent-building work above.
- ‹defensible metric: volume of comms processed / distinct use-cases surfaced per tenant›.
Next case study
Automation Design Portal