MM015 – Hands-On Machine Learning: Email Classification
Issue
My inbox folder has too many emails. Different scheduled processes tell me everything about the jobs: which of them are to be started, which are running, which ended. I know, why should I get process email in the application monitoring / log analytics / IoT era? Because Old is Gold. In my proactive mind I want to track it. Otherwise, I would miss malfunctions and anomalies. Probably it is a common problem, or maybe not…but… What can I do to better improve my work? How can I spend less time looking at the emails but being sure that I will not lose pieces of information? Fortunately, we are in 2019 (nearly 2020) and machine learning is at his peak, so let’s find a way to make it better!
Approach
Let’s think about the solution, generally speaking. I have 4 groups of emails for each step status and environment, “production ok”, “production ko”, “test ok”, “test ko” how can I categorize my email automatically? This is a clustering of 4. The same logic can fit to CRM, IoT and so on… How can I automatically split my clients into n groups? How can I recognize the state of my working machine and group them?
Let’s try the trick with Python:
- I’ve to read my emails and save them
- I’ve to make them understandable to KMeans algorithm
- Classification!
1 – READ THE EMAILS
Thanks to Exchangelib library I can access to my exchange account and download every email. This way I’m getting only the latest mail.
from exchangelib import DELEGATE, Account, Credentials, EWSDateTime, Configuration from pony.orm import * from datetime import * import pydash from bs4 import BeautifulSoup import time db = Database() class Body(db.Entity): id = PrimaryKey(int, auto=True) message = Required(str) body_id = Set('Mail') class Mail(db.Entity): id = PrimaryKey(int, auto=True) creation_date = Optional(datetime) send_date = Optional(datetime) received_date = Optional(datetime) subject = Optional(str) body_id = Required(Body) def email_connection(): creds = Credentials( username='myemail@company.it', password='my password') config = Configuration(server='outlook.office365.com', credentials=creds) account = Account( primary_smtp_address='smtp.company.com', config = config, autodiscover=False, access_type=DELEGATE) return account def write_thn(bidwh,n): print('Total: ' + str(n)) for item in folder.all().order_by('-datetime_received')[0:n]: write_mail_on_db(item) def getMails(): set_sql_debug(True) db.bind(provider='sqlite',filename='mails.sqlite', create_db=True) db.generate_mapping(create_tables=True) account = email_connection() ex_folder = account.root /'My Email Folder' write_thn(ex_folder,9999) def write_mail_on_db(item): with db_session: print(item.datetime_created, item.datetime_sent, item.datetime_received, item.subject) cdate = item.datetime_created sdate = item.datetime_sent rdate = item.datetime_received if (not pydash.predicates.is_date(cdate)): cdate = None if (not pydash.predicates.is_date(sdate)): sdate = None if (not pydash.predicates.is_date(rdate)): rdate = None message = item.body subject = item.subject if (pydash.predicates.is_empty(subject)): subject = ” elif len(subject) > 1000: subject = subject[0:1000] if (pydash.predicates.is_empty(message)): m = Mail(creation_date=cdate, send_date=sdate, received_date=rdate, subject=subject) else: cleantext = BeautifulSoup(message, “lxml”).text b = Body(message=cleantext[0:1000]) m = Mail(creation_date=cdate, send_date=sdate, received_date=rdate, subject=subject, body_id=b) if __name__ == '__main__': getMail()
2 – MAKE THEM CLEAR TO KMEANS ALGORITHMS
Numbers are worth a thousand words in the computing world. So we need to transform text in numbers. In this scenario, we don’t need to understand the sentence semantic but we just need to “tokenize” words and get the ones that work for us. For example, I always code the environment and step result in the mail subject with keywords like ‘Production OK’,’Test OK’,Production KO’ or ‘Test KO’. This means that we just need to assign a different number to those tokens. For example:
- Production = 0
- Test = 1
- OK = 2
- KO = 3
#file clustering.py from sklearn.cluster import KMeans import numpy from pandas import DataFrame from schemas import * import time from keras.preprocessing.text import hashing_trick from pydash import * from keras.preprocessing.text import text_to_word_sequence db.bind(provider='sqlite', filename='mails.sqlite', create_db=True) db.generate_mapping(create_tables=True) with db_session: mail_list = Mail.select()[:] clean_subjects = [] for x in mail_list: clean_subjects.append(x.subject.split(':')[0].strip()) def text_to_matrix(text): a = text_to_word_sequence(text) words = set(a) vocab_size = len(words) l = 5 #max(vocab_size), should me calculated result = hashing_trick(text, l, hash_function='md5') return result def dense_matrix2d(result): max = 0 txt = [] for x in result: tmp = text_to_matrix(x) txt.append(tmp) if max < len(tmp): max = len(tmp) zer = numpy.zeros((len(result), max)) for x in range(0, len(result)): for y in range(0, len(txt[x])): zer[x][y]= txt[x][y] return (zer)
3 – CLASSIFICATION!
In the last step, I will use Keras’ KMeans in order to cluster and classify my email tokens. I want it to recognize “Production OK”, “Production KO”, “Test OK”, “Test KO”.
if __name__ == '__main__': seed = 7 numpy.random.seed(seed) zer = dense_matrix2d(clean_subjects) print(zer) L=4 #n of clusters kmeans=KMeans(n_clusters=L, random_state=0).fit_predict(zer) cnt_labels = count_by(clean_subjects) cnt_kmeans = count_by(kmeans) print([cnt_labels,cnt_kmeans]) for k in cnt_labels.keys(): print(k,cnt_labels[k]) with db_session: #storing labels and tokens a = Labels1(label=k,cnt=cnt_labels[k]) for k in cnt_kmeans.keys(): print(k,cnt_kmeans[k]) #storing kmean labels with db_session: a = Klabels1(cls=k,cnt=cnt_kmeans[k]) for x in range(0,len(mail_list)): #storing classifications with db_session: n = Classifications1(cls=kmeans[x].item(),orig_id=mail_list[x].id,send_date=mail_list[x].send_date.replace('+00:00',”)) print(kmeans)
Results
Below, for each mail subject, we have the automatic cluster assignation (0-2) and relative label. This result seems to be accurate to me, but you could say: “wait, could you just take a substring of mail subject?” Yes! Of course in this case, but it won’t be so funny and it’s not scalable! This way is a Proof-of-Concept and can be easily reproduced.
Mail Subject | KMeans Cluster | KMeans Class Label |
PROD OK: wkf_SA0 | 2 | PROD OK |
PROD OK: wkf_ON_ORDER_DWH | 2 | PROD OK |
PROD OK: wkf_PUB | 2 | PROD OK |
PROD OK: wkf_DIM | 2 | PROD OK |
TEST OK: wkf_F | 1 | TEST OK |
TEST KO: wkf_SA0_PRE_FF_ALL | 0 | TEST KO |
TEST OK: wkl_DQ_REPORTS | 1 | TEST OK |
TEST OK: wkf_TDS | 1 | TEST OK |
PROD OK: wkf_SA0_PRE_FF_LOG | 2 | PROD OK |
TEST OK: wkf_WKR | 1 | TEST OK |
TEST OK: wkf_WKRI | 1 | TEST OK |
PROD OK: wkf_SA0_PRE_FF_ALL | 2 | PROD OK |
TEST OK: wkf_PUB | 1 | TEST OK |
TEST KO: wkf_SA0_PRE_FF_DAILY | 0 | TEST KO |
TEST OK: wkf_DIME | 1 | TEST OK |
PROD OK: wkf_WKR | 2 | PROD OK |
TEST OK: wkf_PF | 1 | TEST OK |
PROD KO: wkf_PUB | 3 | PROD KO |
TEST KO: wkf_PF | 0 | TEST KO |
TEST OK: wkf_SA0_PRE_FF_DAILY | 1 | TEST OK |
PROD KO: wkf_ON_ORDER_DWH | 3 | PROD KO |
TEST KO: wkf_WKR | 0 | TEST KO |
TEST KO: wkf_F | 0 | TEST KO |
PROD OK: wkf_TDS | 2 | PROD OK |
TEST KO: wkf_PRIC | 0 | TEST KO |
TEST KO: wkf_PUB | 0 | TEST KO |
TEST OK: wkf_PRIC | 1 | TEST OK |
TEST KO: wkf_SA0_PRE_FF | 0 | TEST KO |
PROD OK: wkf_PF | 2 | PROD OK |
PROD OK: wkf_IN_TRANSIT_JP | 2 | PROD OK |
PROD OK: wkf_I | 2 | PROD OK |
PROD KO: wkf_SA0_PRE_FF | 3 | PROD KO |
TEST KO: wkf_SA0 | 0 | TEST KO |
TEST KO: wkf_SA0_FF_ALL_DAILY | 0 | TEST KO |
TEST OK: wkf_SA0_FF_ALL_DAILY | 1 | TEST OK |
TEST OK: wkf_ON_ORDER_DWH | 1 | TEST OK |
PROD KO: wkf_SA0_PRE_FF_ALL | 3 | PROD KO |
TEST OK: wkf_INIT | 1 | TEST OK |
TEST OK: wkf_SA0_PRE_FF_ALL | 1 | TEST OK |
TEST OK: wkf_SA0 | 1 | TEST OK |
TEST OK: wkf_IN_TRANSIT_JP | 1 | TEST OK |
TEST OK: wkf_SA0_PRE_FF_LOG | 1 | TEST OK |
PROD OK: wkf_F | 2 | PROD OK |
TEST KO: wkf_ON_ORDER_DWH | 0 | TEST KO |
TEST KO: wkf_IN_TRANSIT_JP | 0 | TEST KO |
PROD OK: wkf_INIT | 2 | PROD OK |
PROD OK: wkf_WKRI | 2 | PROD OK |
TEST OK: wkf_I | 1 | TEST OK |
Next
What can we do from this point on:
- Creating a deep learning algorithm in order to understand if steps are finishing on time or there’s some deviation from the standard (anomaly detection)
- Changing dataset and use the same model
- Try another algorithm such as logistic regression and so on
Thank you and stay tuned!
Bye!
Graziano
[Want to know more about us & Advanced Analytics? Contact us]