Email classification, labeling and entity classification/extraction

I am currently trying to build an application which requires me to go through a user’s emails and classify them and then extract some information. For which I have tried to implement a pipeline with two different models which first classify (used a zero-shot text classification model) the emails and then after the classification is done, they extract information (used a zero-shot entity recognition model). This worked with a varying degree of success. The classification worked 80% of the time and entity recognition worked 20%-30% of the time.

This clearly showed me that the models require fine tuning on a set of real world emails and the enron email dataset isn’t going to cut it and here comes the first challenge which is to go and label the data manually for 1000s of emails which seems pretty time consuming to do.

The other issue is the processing time, I could have used a more advanced LLM but I want to keep the processing times per email as low as possible as there could be 1000s-10000s emails per user to be processed.

So TLDR, How can I fine-tune a zero-shot text classification and a zero-shot entity recognition model. Second, how can I acquire labelled data or label the data myself but speed up the process. Third, how can I speed up the processing times.

Any tips, tools and guidance is greatly appreciated.

1 Like

To which classes you want to classify the emails to? To spam and not-spam (i.e. a binary classification problem) or are there more than two classes in the emails of the Enron dataset (a multiclass problem? If so, can there be just one class assigned to the email, or many classes/labels (multilabel classification problem)?

So the emails that I am concerned with would contain some sort of payment information regarding the services they use. So the classes would look something like [“subscription”, “one-off”, “payment”, “others”] and yes can be multi-label

hey, what model did you use for this task? i am having a similar usecase

1 Like

Like most data science problems, it depends greatly on the characteristics of your data. Are these malicious payment requests (using adversarial wording to circumvent simple keyword-based filters), or standard expenses and receipts.

Are you looking to train a model that can run inference in realtime before emails land in the user’s inbox, or run as an offline batch job overnight (your model throughput may need to be much less than you expect if load can be averaged out across a full day for example).

Given how cheap LLM calls are these days (relative to human labelling), I would lean towards scaling out your data annotation to real examples you have using some API.

There’s a fair bit that goes into leveraging LLMs to annotate training data at scale, that I’ve personally been trying to streamline at everyrow.io that I’m working on, but the TL;DR is that you want to be able to ergonomically iterate on the classification boundaries as you see failure cases, meaning supporting making API requests at high concurrency (with process pools etc), and having a simple way of inspecting the results.

As for how to run inference at scale with fine-tuned language models, I’ve had decent success with https://github.com/vllm-project/vllm, which is recommended by huggingface themselves, but again it depends on your use-case. Possibly a managed service on one of the cloud providers may be a better fit if you have bursty traffic spikes with lots of emails to process during certain hours of the day.

Best of luck!

1 Like