MM015 – Hands-On Machine Learning: Email Classification

Pubblicato da Graziano Fracasso il

Issue

My inbox folder has too many emails. Different scheduled processes tell me everything about the jobs: which of them are to be started, which are running, which ended. I know, why should I get process email in the application monitoring / log analytics / IoT era? Because Old is Gold. In my proactive mind I want to track it. Otherwise, I would miss malfunctions and anomalies. Probably it is a common problem, or maybe not…but… What can I do to better improve my work? How can I spend less time looking at the emails but being sure that I will not lose pieces of information? Fortunately, we are in 2019 (nearly 2020) and machine learning is at his peak, so let’s find a way to make it better!

Approach

Let’s think about the solution, generally speaking. I have 4 groups of emails for each step status and environment, “production ok”, “production ko”, “test ok”, “test ko” how can I categorize my email automatically? This is a clustering of 4. The same logic can fit to CRM, IoT and so on… How can I automatically split my clients into n groups? How can I recognize the state of my working machine and group them?

Let’s try the trick with Python:

  1. I’ve to read my emails and save them
  2. I’ve to make them understandable to KMeans algorithm
  3. Classification!

1 – READ THE EMAILS

Thanks to Exchangelib library I can access to my exchange account and download every email. This way I’m getting only the latest mail.

from exchangelib import DELEGATE, Account, Credentials, EWSDateTime, Configuration
from pony.orm import *
from datetime import *
import pydash
from bs4 import BeautifulSoup
import time
db = Database()
class Body(db.Entity):
    id = PrimaryKey(int, auto=True)
    message = Required(str)
    body_id = Set('Mail')
class Mail(db.Entity):
    id = PrimaryKey(int, auto=True)
    creation_date = Optional(datetime)
    send_date = Optional(datetime)
    received_date = Optional(datetime)
    subject = Optional(str)
    body_id = Required(Body)
def email_connection():
    creds = Credentials(
        username='myemail@company.it',
        password='my password')
    config = Configuration(server='outlook.office365.com', credentials=creds)
    account = Account(
        primary_smtp_address='smtp.company.com',
        config = config,
        autodiscover=False,
        access_type=DELEGATE)
    return account
def write_thn(bidwh,n):
    print('Total: ' + str(n))
    for item in folder.all().order_by('-datetime_received')[0:n]:
        write_mail_on_db(item)
def getMails():
    set_sql_debug(True)
    db.bind(provider='sqlite',filename='mails.sqlite', create_db=True)
    db.generate_mapping(create_tables=True)
    account = email_connection()
    ex_folder = account.root /'My Email Folder'
    write_thn(ex_folder,9999)
def write_mail_on_db(item):
    with db_session:
        print(item.datetime_created, item.datetime_sent, item.datetime_received, item.subject)
        cdate = item.datetime_created
        sdate = item.datetime_sent
        rdate = item.datetime_received
        if (not pydash.predicates.is_date(cdate)):
            cdate = None
        if (not pydash.predicates.is_date(sdate)):
            sdate = None
        if (not pydash.predicates.is_date(rdate)):
            rdate = None
        message = item.body
        subject = item.subject
        if (pydash.predicates.is_empty(subject)):
            subject =elif len(subject) > 1000:
            subject = subject[0:1000]
        if (pydash.predicates.is_empty(message)):
            m = Mail(creation_date=cdate,
                     send_date=sdate,
                     received_date=rdate,
                     subject=subject)
        else:
            cleantext = BeautifulSoup(message, “lxml”).text
            b = Body(message=cleantext[0:1000])
            m = Mail(creation_date=cdate,
                     send_date=sdate,
                     received_date=rdate,
                     subject=subject,
                     body_id=b)
if __name__ == '__main__':
    getMail()

2 – MAKE THEM CLEAR TO KMEANS ALGORITHMS

Numbers are worth a thousand words in the computing world. So we need to transform text in numbers. In this scenario, we don’t need to understand the sentence semantic but we just need to “tokenize” words and get the ones that work for us. For example, I always code the environment and step result in the mail subject with keywords like ‘Production OK’,’Test OK’,Production KO’ or ‘Test KO’. This means that we just need to assign a different number to those tokens. For example:

  • Production = 0
  • Test = 1
  • OK = 2
  • KO = 3
#file clustering.py 
from sklearn.cluster import KMeans
import numpy
from pandas import DataFrame
from schemas import *
import time
from keras.preprocessing.text import hashing_trick
from pydash import *
from keras.preprocessing.text import text_to_word_sequence
db.bind(provider='sqlite', filename='mails.sqlite', create_db=True)
db.generate_mapping(create_tables=True)
with db_session:
    mail_list = Mail.select()[:]
clean_subjects = []
for x in mail_list:
    clean_subjects.append(x.subject.split(':')[0].strip())
def text_to_matrix(text):
    a = text_to_word_sequence(text)
    words = set(a)
    vocab_size = len(words)
    l = 5 #max(vocab_size), should me calculated
    result = hashing_trick(text, l, hash_function='md5')
    return result
def dense_matrix2d(result):
    max = 0
    txt = []
    for x in result:
        tmp = text_to_matrix(x)
        txt.append(tmp)
        if max < len(tmp):
           max = len(tmp)
    zer = numpy.zeros((len(result), max))
    for x in range(0, len(result)):
        for y in range(0, len(txt[x])):
            zer[x][y]= txt[x][y]
    return (zer)

3 – CLASSIFICATION!

In the last step, I will use Keras’ KMeans in order to cluster and classify my email tokens. I want it to recognize “Production OK”, “Production KO”, “Test OK”, “Test KO”.

if __name__ == '__main__':
    seed = 7
    numpy.random.seed(seed)
    zer = dense_matrix2d(clean_subjects)
    print(zer)
    L=4 #n of clusters
    kmeans=KMeans(n_clusters=L, random_state=0).fit_predict(zer)
    cnt_labels = count_by(clean_subjects)
    cnt_kmeans = count_by(kmeans)
    print([cnt_labels,cnt_kmeans])
    for k in cnt_labels.keys():
        print(k,cnt_labels[k])
        with db_session:
            #storing labels and tokens
            a = Labels1(label=k,cnt=cnt_labels[k])
    for k in cnt_kmeans.keys():
        print(k,cnt_kmeans[k])
        #storing kmean labels
        with db_session:
            a = Klabels1(cls=k,cnt=cnt_kmeans[k])
    for x in range(0,len(mail_list)):
        #storing classifications
        with db_session:
            n = Classifications1(cls=kmeans[x].item(),orig_id=mail_list[x].id,send_date=mail_list[x].send_date.replace('+00:00',))
    print(kmeans)

Results

Below, for each mail subject, we have the automatic cluster assignation (0-2) and relative label. This result seems to be accurate to me, but you could say: “wait, could you just take a substring of mail subject?” Yes! Of course in this case, but it won’t be so funny and it’s not scalable! This way is a Proof-of-Concept and can be easily reproduced.

Mail Subject KMeans Cluster                                    KMeans Class Label
PROD OK: wkf_SA0 2 PROD OK
PROD OK: wkf_ON_ORDER_DWH 2 PROD OK
PROD OK: wkf_PUB 2 PROD OK
PROD OK: wkf_DIM 2 PROD OK
TEST OK: wkf_F 1 TEST OK
TEST KO: wkf_SA0_PRE_FF_ALL 0 TEST KO
TEST OK: wkl_DQ_REPORTS 1 TEST OK
TEST OK: wkf_TDS 1 TEST OK
PROD OK: wkf_SA0_PRE_FF_LOG 2 PROD OK
TEST OK: wkf_WKR 1 TEST OK
TEST OK: wkf_WKRI 1 TEST OK
PROD OK: wkf_SA0_PRE_FF_ALL 2 PROD OK
TEST OK: wkf_PUB 1 TEST OK
TEST KO: wkf_SA0_PRE_FF_DAILY 0 TEST KO
TEST OK: wkf_DIME 1 TEST OK
PROD OK: wkf_WKR 2 PROD OK
TEST OK: wkf_PF 1 TEST OK
PROD KO: wkf_PUB 3 PROD KO
TEST KO: wkf_PF 0 TEST KO
TEST OK: wkf_SA0_PRE_FF_DAILY 1 TEST OK
PROD KO: wkf_ON_ORDER_DWH 3 PROD KO
TEST KO: wkf_WKR 0 TEST KO
TEST KO: wkf_F 0 TEST KO
PROD OK: wkf_TDS 2 PROD OK
TEST KO: wkf_PRIC 0 TEST KO
TEST KO: wkf_PUB 0 TEST KO
TEST OK: wkf_PRIC 1 TEST OK
TEST KO: wkf_SA0_PRE_FF 0 TEST KO
PROD OK: wkf_PF 2 PROD OK
PROD OK: wkf_IN_TRANSIT_JP 2 PROD OK
PROD OK: wkf_I 2 PROD OK
PROD KO: wkf_SA0_PRE_FF 3 PROD KO
TEST KO: wkf_SA0 0 TEST KO
TEST KO: wkf_SA0_FF_ALL_DAILY 0 TEST KO
TEST OK: wkf_SA0_FF_ALL_DAILY 1 TEST OK
TEST OK: wkf_ON_ORDER_DWH 1 TEST OK
PROD KO: wkf_SA0_PRE_FF_ALL 3 PROD KO
TEST OK: wkf_INIT 1 TEST OK
TEST OK: wkf_SA0_PRE_FF_ALL 1 TEST OK
TEST OK: wkf_SA0 1 TEST OK
TEST OK: wkf_IN_TRANSIT_JP 1 TEST OK
TEST OK: wkf_SA0_PRE_FF_LOG 1 TEST OK
PROD OK: wkf_F 2 PROD OK
TEST KO: wkf_ON_ORDER_DWH 0 TEST KO
TEST KO: wkf_IN_TRANSIT_JP 0 TEST KO
PROD OK: wkf_INIT 2 PROD OK
PROD OK: wkf_WKRI 2 PROD OK
TEST OK: wkf_I 1 TEST OK

Next

What can we do from this point on:

  1. Creating a deep learning algorithm in order to understand if steps are finishing on time or there’s some deviation from the standard (anomaly detection)
  2. Changing dataset and use the same model
  3. Try another algorithm such as logistic regression and so on

Thank you and stay tuned!
Bye!
Graziano

[Want to know more about us & Advanced Analytics? Contact us]