Is this a toxic comment?

False news and toxic comments on the web are no longer merely a nuisance: they can topple governments; or act as a catalyst for communal disharmony. This gives the recently concluded Toxic comment identification competition on Kaggle an added value.

Introduction

The Kaggle Competition was organized by the Conversation AI team as part of its attempt at improving online conversations. Their current public models are available through Perspective API, but looking to explore better solutions through the Kaggle community.

EDA

In terms of the size, the dataset is relatively small with training set containing 134,384 records and test set 117,888.

Training and test sets both contain following fields.

  • ID – a random unique string
  • Comment – the text of the comment

In addition, the training set contains following six binary label fields. These labels are not mutually exclusive: a comment can be both “Toxic” and “Severe_toxic”.

  • Toxic
  • Severe_toxic
  • Obscene
  • Threat
  • Insult
  • Identity_hate

Here is the distribution of classes in the training set.

Distribution of classes

Distribution of classes

Number of tags applied per comment.

Number of tags applied per comment

Number of tags applied per comment

Most frequent toxic words per class, taken from jagangupta’s kernel.

Most frequent toxic words

Most frequent toxic words

If interested, Jagangupta’s brilliant kernel explores the dataset in depth and presents a detailed report.

Text pre-processing

Little text pre-processing helped improve the results somewhat in the range of ~0.0005 mean ROC AUC (more on evaluation later). Replacing IP addresses by a token, replacing abbreviations, emojis and expressions such as “gooood” with appropriate/normalized words helped. Removing stop-words seemed to be hindering sequence models than helping them, which means models have been able to use at least some of the words. I also kept few symbols such as “!” and “?” since both fastText and GloVe embeddings have representations for symbols.

def clean(comment):
    """
    This function receives comments and returns a clean comment
    """
    comment=comment.lower()
    # normalize ip
    comment = re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"," ip ", comment)
    # replace words like Gooood with Good
    comment = re.sub(r'(\w)\1{2,}', r'\1\1', comment)
    # replace ! and ? with ! and ? so they can be kept as tokens by Keras
    comment = re.sub(r'(!|\?)', " \\1 ", comment)   
        
    #Split the sentences into words
    words=comment.split(' ')
    
    # normalize common abbreviations
    # replacements is a dictionary loaded from https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view 
    words=[replacements[word] if word in replacements else word for word in words]
    
    clean_sent=" ".join(words)
    return(clean_sent)

Another interesting point is the maximum number of features used: it stops having any noticeable effect after a certain threshold. So to save the computation cost and time, it’s worth setting a limit. In Keras tokenizer, this can be achieved by setting the num_words parameter, which limits the number of words used to a defined n most frequent words in the dataset. In this case, I settled for 100,000 as the maximum number of words used for models.

It is same with the length of features used to represent a sentence, and can be selected by looking at the following graph.

Number of words in a sentence

Number of words in a sentence. Source: https://www.kaggle.com/sbongo/for-beginners-tackling-toxic-using-keras

Sentences longer than a given threshold need to be truncated while shorter sentences need to be padded to fit the length. This is required before feeding a dataset to a sequence model because the model needs to have a defined number of units. In all experiments detailed, I used 200 as the sequence length as it didn’t have any noticeable difference using more features.

from keras.preprocessing import text, sequence

maxlen = 200 # length of the submitted sequence
EMBEDDING_FILE = './data/fasttext/crawl-300d-2M.vec'

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

x_train = train["comment_text"].fillna("fillna").values
x_test = test["comment_text"].fillna("fillna").values

# default filters parameter removes symbols ! and ? which we want to keep
tokenizer = text.Tokenizer(num_words=max_features, filters='"#$%&()*+,-./:;<=>@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(list(x_train) + list(x_test))
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)

# pad sentences to meet the maximum length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# build a mapping of word to its embeddings
def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE))

# build the embedding matrix 
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Models

The evaluation of models was based on the mean column-wise ROC AUC. That is, averaging the ROC AUC score of each column.

Though I tried a number of models and variations, they can be roughly summarized to five models.

  1. Logistic regression model combining unigram, bigram and character level features with td-idf weighting
  2. LSTM/GRU based model with GloVe and fastText embeddings
  3. LSTM/GRU + CNN model with GloVe and fastText embeddings
  4. A deep CNN model
  5. Sequence layer with attention

This is each model in detail.

    1. Logistic regression

The final LR model was a combination of three sets of features extracted from two unigram, bigram td-idf weighted 10,000 most frequent words combined with a character level tokenizer of lengths 2 to 6 (highest 50,000 features). With a single cross validation split of 5% this model achieved a highest LB score of 0.9805.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from scipy.sparse import hstack

class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# unigram feature extractor
unigram_vectorizer = TfidfVectorizer(
    sublinear_tf=True, strip_accents='unicode',analyzer='word', ngram_range=(1, 1), 
    use_idf=1, smooth_idf=True, stop_words='english', max_features=10000
)
unigram_vectorizer.fit(all_text) # all_text is concat of training and test text
train_unigram_features = unigram_vectorizer.transform(train_text)
test_unigram_features = unigram_vectorizer.transform(test_text)

bigram_vectorizer = TfidfVectorizer(
    sublinear_tf=False, strip_accents='unicode', analyzer='word', ngram_range=(2, 2),
    use_idf=1, smooth_idf=True, stop_words='english', max_features=10000
)

bigram_vectorizer.fit(all_text)
train_bigram_features = bigram_vectorizer.transform(train_text)
test_bigram_features = bigram_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True, strip_accents='unicode', analyzer='char',
    stop_words='english', ngram_range=(2, 6), max_features=50000
)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

train_features = hstack([train_unigram_features, train_bigram_features, train_char_features])
test_features = hstack([test_unigram_features, test_bigram_features, test_char_features])

for class_name in class_names:
    train_target = train[class_name]
    classifier = LogisticRegression(solver='sag')

    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]
    1. A sequence model with bi-directional LSTM/GRU with embeddings

Essentially these are Bi-directional LSTM or GRU models taking embeddings as input. Initially I tried with a many-to-one model (returning only the last state) for the sequence layer, but this performed poorly. Influenced from the community, sequence layer was changed to return all states of units, and these time-distributed signals were captured by two pooling layers (average and max), and then concatenated and used as input for the final layer: six densely connected activation units (sigmoid) representing six output labels. This performed much better, achieving around 0.9830 on LB easily.

I tried few variations of the same model. Among them, a bi-directional LSTM/GRU layer connected to a time-distributed dense layer (figure a) and a multi-layer bi-directional LSTM/GRU model (figure b) are noteworthy. The multi-layer model seemed to be overfitting, and at best performed comparable to a single layer when regularized heavily with a high dropout rate. The single bidirectional LSTM layer connected to a time-distributed dense layer with a moderate dropout rate, however, performed a little better than a single bidirectional LSTM layer. So this last model was used for further model averaging.

One observations from this experiment was that it didn’t have much difference in accuracy whether it was LSTM or GRU units were used for the sequence layer. However, different embeddings had a noticeable difference. I tried with fastText (crawl, 300d, 2M word vectors) and GloVe (Crawl, 300d, 2.2M vocab vectors), and fastText embeddings worked slightly better in this case (~0.0002-5 in mean AUC). I didn’t bother with training embeddings since it didn’t look like there was enough dataset to train. This lecture explains what might happen when trying to train pre-trained embeddings on a small dataset.

For building models, Keras was used with default Tensorflow backend. As for the hardware, AWS P2 instances (Tesla K80 GPUs) were used.

LSTM-Dense-Pooling

a) LSTM-Dense-Pooling

Multi-layer Bi-GRU

b) Multi-layer Bi-GRU

def build_model(): # figure (a) as a Keras model
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
    x = SpatialDropout1D(0.4)(x)
    x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.0, dropout=0.2))(x)
    x = TimeDistributed(Dense(100, activation = "relu"))(x) # time distributed  (sigmoid)
    x = Dropout(0.1)(x)
    
    # global pooling layer
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(6, activation="sigmoid")(conc)
    
    model = Model(inputs=inp, outputs=outp)

    return model

model = build_model()
opt = Nadam(lr=0.001)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.summary()
    1. LSTM/GRU + CNN model with GloVe and fastText embeddings

This model is an attempt at combining sequence models with convolution neural networks (CNNs). At the beginning I tried a convolution layer passing signals to the sequence layer. But it seems swapping these layers — embeddings feeding to LSTM first and then using a CNN on each LSTM unit’s state — and pooling to an output layer brings better results. This study and kernel better explain this model.

CNN-LSTM

CNN-LSTM-pooling model

def build_model(): # figure (a) as a Keras model
    input = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix])(input)
    x = SpatialDropout1D(0.4)(x)
    x = Bidirectional(LSTM(80, return_sequences=True, recurrent_dropout=0.2, dropout=0.2))(x)
    x = Conv1D(filters=64, kernel_size=2, padding='valid', kernel_initializer="he_uniform")(x)
    x = Dropout(0.2)(x)

    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)    
    conc = concatenate([avg_pool, max_pool])

    output = Dense(64, activation="relu")(conc)
    output = Dropout(0.1)(output)   
    output = Dense(6, activation="sigmoid")(output)        

    model = Model(inputs=input, outputs=output)    
    return model 

As a variation, I tried to emulate bigrams and trigrams using kernel sizes of 2 and 3 in the CNN layer and concatenating their outputs through pooling layers. But again, this seemed to overfit.

    1. A deep CNN model

This model is based on this paper and this kernel. In summary, it consists of multiple layers of convolution and pooling layers with skip layer connections. Unfortunately, this didn’t perform up to the mark of above two models, both fastText and GloVe embeddings scoring 0.9834 and averaging to 0.9843 on LB. It could be either due to not being able to spend time on tuning it or overfitting again. Still this could work with more data, and indeed, it has according to stats reported in the paper.

    1. Sequence layer with attention

This was a half hearted attempt. But still I wanted to try out how well attention works out. I used an attention layer and a secondary LSTM layer. Strangely, it couldn’t surpass the first two models without attention. Probably, it could have with more time spent on tuning it, but the training time needed was higher than other models.

Attention with LSTM

Attention with LSTM (Source: https://www.coursera.org/learn/nlp-sequence-models/)

Model averaging and ensemble

Due to the relatively small dataset size, there seemed to be a case of overfitting. So it’s not surprising that model averaging and regularization showed a strong positive effect on the prediction accuracy. It was done through various forms: one was through stratified 10-fold training, and this improved the performance noticeably, though obviously it took more time. When averaging folds, weighted average or ranked average performed slightly better than taking the mean of predictions.

Another interesting tidbit is how to use stratified-k fold with multi label classification, since popular stratified k-fold scikit-learn function only supports single label splits. Here, numpy.packbits can come in handy.

import numpy as np
from sklearn.model_selection import StratifiedKFold

pred = np.zeros((x_test.shape[0], 6))
y_packed = np.packbits(y_train, axis=1)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=32)

for i, (train_idx, valid_idx) in enumerate(kfold.split(x_train, y_packed)):
    print("Running fold {} / {}".format(i + 1, n_folds))
    print("Training / Valid set counts {} / {}".format(train_idx.shape, valid_idx.shape))

    # train the model

Model averaging was done again using different embeddings: here, the predictions produced by the same model using GloVe and fastText embeddings were averaged. This step improved the final accuracy of predictions significantly: the best single LSTM-CNN model achieved 0.9856 on LB after averaging predictions from two embeddings, where GloVe and fastText only got 0.9850 and 0.9854 respectively using the same model.

Towards the end, the competition turned into an ensemble madness. I would have liked to try some stacking, which in my opinion is the better way of combining models than by randomly conjured up coefficients.

Improvements and notes from the community

In my opinion, the best feature of Kaggle competitions is the collaborative learning experience. Here are some of the effective techniques that have been used by other teams.

    1. Augmenting the train/test dataset with translation

Using translations is an interesting method of augmenting the dataset, and it has worked wonderfully without information leaking. The technique is quite simple: translate sentences to few nearby languages, and then translate them back to English. When you think about it, this makes sense since it can help reduce overfitting by adding more data with variations in the sentence structure. More details can be found from this kernel.

    1. Adding more embeddings

It seems much of the complexity in this case comes from the embedding layer; and using more embeddings helps more than using different structures. So using all variations of GloVe, Word2vec, LexVec and fastText (e.g., Crawl, Twitter, Wikipedia) can help by averaging resulting predictions. More on various pre-trained embeddings can be found from here and here.

    1. Byte-pair encoding (BPE)

Usually in any text processing work there is a large number of out of vocabulary words. In other words, these are the words that are not found in the embeddings. The standard way of handling such words is to use a token such as <unk> and moving on. Byte-pair-encoding has shown better results handling these kind of words by breaking the word into subwords — similar to phoneme in speech recognition — and using embeddings of these subwords to arrive at the full word. More on this can be found from here.

    1. Capsule networks

Few teams had tried Geoffrey Hinton’s Capsule Networks, and they have reported to be overfitting. Anyway, it’s something worth trying out.

Useful resources:

Tagged

Notes from Quora duplicate question pairs finding Kaggle competition

Quora duplicate question pairs Kaggle competition ended a few months ago, and it was a great opportunity for all NLP enthusiasts to try out all sorts of nerdy tools in their arsenals. This is just jotting down notes from that experience.

Introduction

Quora has over 100 million users visiting every month, and needs to identify duplicate questions submitted — an incident that should be very common with such a large user base. One interesting characteristic that differentiates it from other NLP tasks is the limited amount of context available in the title; in most cases this would amount to a few words.

Exploratory data analysis

The dataset is simple as it can get: both training and test sets consist of two questions in consideration. Additionally, in the training set there are few extra columns: one denoting whether it’s a duplicate, and two more for unique IDs of each question.

qid1, qid2 – unique ids of each question (only available in the training set)
question1, question2 – the full text of each question
is_duplicate – the target variable; set to 1 if question1 and question2 essentially have the same meaning; 0 otherwise.

Some quick stats:

  • Training set size – 404,290
  • Test set size – 2,345,796
  • Total training vocabulary – 8,944,593
  • Avg. word count per question – 11.06

A quick EDA reveals some interesting insight to the dataset.

  • Classes are not balanced.

 

Training class balance

Training class balance

In the training/validation set, the duplicate percentage (label 1) is ~36.9%. Since the class balance can influence some classifiers, this fact becomes useful when training models later.

 

  • Normalized unigram word shared counts can be a good feature
Shared unigram counts

Shared unigram counts

 

Violin plot of shared word counts

 

When the shared word ratio (Jaccard similarity) is considered, this becomes even more prominent:

(1)   \begin{equation*} \frac{\textit{question1 words} \cap \textit{question2 words}}{ \textit{question1 words} \cup \textit{question2 words}} \end{equation*}

Violin plot of shared word ratio

Violin plot of shared word ratio

 

The correlation of shared unigram counts towards the class further indicates that other n-grams can also perhaps participate as features in our model.

Arguably the best perk of being part of a Kaggle competition is the incredible community. Here are some in-depth EDAs carried out by some of its members:

 

Statistical modelling

XGBoost is a gradient boosting framework that has become massively popular, especially in the Kaggle community. The popularity is not underserving as it has won many competitions in the past and known for its versatility. So as the primary model, XGBoost was used with following parameters, selected based on the performance of the validation set.

  objective = 'binary:logistic'
  eval_metric = 'logloss'
  eta = 0.11
  max_depth = 5

Before discussing features used, there’s one neat trick that I believe everyone who did well in the competition used. After the first few submissions of prediction results, it became apparent that there’s something wrong when you compare the results obtained against the validation set with the Kaggle leaderboard (LB). No matter how many folds were used for the validation set, the results obtained against the validation set didn’t reflect on the LB. This is due to the fact that the class balance between the training set and the test set was considerably different, and the cost function (logloss) being sensitive to the imbalance. Specifically, in the training set around 37% were positive labels while in the test set it was approximated to be around 16.5%. So some oversampling of the negatives in the training set was required to get a comparable result on the LB. More on oversampling can be found here and here.

Features

From a bird’s eye view, features used can be categorised into three groups.

  1. Classical text mining features
  2. Embedded features
  3. Structural features

Following features can be categorised under classical text mining features.

  • Unigram word match count
  • Ratio of the shared count (against the total words in 2 questions)
  • Shared 2gram count
  • Ratio of sum of shared tf-idf score against the total weighted word score
  • Cosine distance
  • Jaccard similarity coefficient
  • Hamming distance
  • Word counts of q1, q2 and the difference (len(q1), len(q2), len(q1) – len(q2))
  • Caps count of q1, q2 and the difference
  • Character count of q1, q2 and difference
  • Average length of a word in q1 and q2
  • Q1 stopword ratio, Q2 stopword ratio, and the difference of ratios
  • Exactly same question?

Since a large portion of sentence pairs are questions, many duplicate questions are starting with the same question word (which, what, how .etc). So few more features were used to indicate whether this clause applies.

  • Q1 starts with ‘how’, Q2  starts with ‘how’ and both questions have ‘how‘ (3 separate features)
  • same for words ‘what‘, which, who, where, when, why

Some fuzzy features generated from the script here, which in turn used fuzzywuzzy package,

  • Fuzzy WRatio
  • Fuzzy partial ratio
  • Fuzzy partial token set ratio
  • Fuzzy partial token sort ratio
  • Fuzzy qratio
  • Fuzzy token set ratio
  • Fuzzy token sort ratio

As for the embedded features, Abhishek Thakur’s script did everything needed: it generates a word2vec representation of each word using a pre-trained word2vec model on Google News corpus using gensim package. It then generates a sentence representation by normalizing each word vector.

def sent2vec(s):
  words = str(s).lower().decode('utf-8')
  words = word_tokenize(words)
  words = [w for w in words if not w in stop_words]
  words = [w for w in words if w.isalpha()]
  M = []
  for w in words:
    try:
      M.append(model[w])
    except:
      continue

  M = np.array(M)
  v = M.sum(axis=0)

  return v / np.sqrt((v ** 2).sum())

Based on the vector representations of the sentences, following distance features were generated by the same script.

Combined with these calculated features, full 300 dimension word2vec representations of each sentence were used for the final model. The raw vector addition required a large expansion of the AWS server I was using, but in hindsight brought little improvement.

Structural features have caused much argument within the community. These features aren’t meaningful NLP features, but because of the way how the dataset was formed, it had given rise to some patterns within the dataset. It’s doubtful if these features will be much use in a real-word scenario, but within the context of the competition, they gave a clear boost. so I guess everyone used them disregarding whatever moral compunctions one might have had.

These features include,

  • Counting the number of questions shared between two sets formed by the two questions
  •   from collections import defaultdict
    
      q_dict = defaultdict(set)
    
      def build_intersects(row):
        q_dict[row['question1']].add(row['question2'])
        q_dict[row['question2']].add(row['question1'])
    
      def count_intersect(row):
        return(len(q_dict[row['question1']].intersection(q_dict[row['question2']])))
    
      df_train.apply(build_intersects, axis=1, raw=True)
      df_train.apply(count_intersect, axis=1, raw=True)
    
  • Shard word counts, tf-idf and cosine scores within these sentence clusters
  • The page rank of each question (within the graph induced by questions as nodes and shared questions as edges)
  • Max k-cores of the above graph

Results

The effect of features on the final result can be summerized by following few graphs.

  • With only classic NLP features (at XGBoost iteration 800):
    • Train-logloss:0.223248, eval-logloss:0.237988 (0.21861 on LB)

  • With both classic NLP + structural features (at XGBoost iteration 800):
    • Training accuracy: 0.929, Validation accuracy: 0.923
    • Train-logloss:0.17021,  eval-logloss:0.185971 (LB 0.16562)

 

  • With classic NLP + structural + embedded features (at XGBoost iteration 700):
    • Training accuracy: 0.938, Validation accuracy: 0.931
    • Train-logloss:0.149663, eval-logloss:0.1654 (LB 0.14754)

Rank wise this feature set and the model achieved a max 3% at one point, though it came down to 7% by the end due to my lethargic finish. But considering it was an individual effort against mostly other team works consisting several ensemble models, I guess it wasn’t bad. More than anything, it was great fun and a good opportunity to play with some of the best ML competitors in the Kaggle community/world and collaboratively learn from that community.

I’ve shared the repository of Jupyter notebooks used and can be found from here.

PS: Some of the stuff that I wanted to try out, but didn’t get to:

Tagged ,

The necessity of lifelong learning

Live as if you were to die tomorrow. Learn as if you were to live forever.
― Mahatma Gandhi

The term “lifelong learning” sounds nonsensical when you consider that learning from experience is an intrinsic function built into all humans and animals. But today, this term in the context of rapid advances in the field of AI and automation carries a different meaning. This is an attempt at discussing why it’s increasingly needed today, and encourage everyone to take up on actively learning and expanding your horizons if you haven’t started already.

The pace of technological advancement

The consensus is that what you learn today will be out of date within 5-10 years from now. By that argument alone, it’s a no brainer that we should keep learning. The pace of advance is almost tangible when it comes to technical fields and not taking time to update yourself would be a critical carrier mistake. Since my experience is with computer science, this post will focus more on CS but I believe it holds true for most other areas as well.

I doubt there’s any other field that’s advancing as fast as CS at the moment (definitely subjective:)). Most of us working in the field acknowledge this fact and accept the challenge, and even call it an endearing quality. At any rate, the changing of tools is expected every 5-10 year period in CS so this shouldn’t be anything new. However, just changing of tools will not be enough if you want to get into emerging CS domains such as Internet of Things (IoTs), Software Defined Networking (SDN), Deep learning .etc, that generally have strong theoretical foundations. Here, online courses can help in two ways.

1. You probably will need more maths and/or computer science fundamentals such as operating systems, networks, algorithms .etc. This is where MOOCs and especially Khan academy can be of great help. They can help us revise old maths lectures and fundamentals.

2. Once in a while there are wonderful offerings on emerging topics by pioneering researchers, and these courses can really bring you to the “edge” than what you would normally find in a regular class.

Automation and consequences

Marc Andreessen famously wrote sometime ago software is eating the world; now probably it’s time to say specifically that artificial intelligence is eating the world, or at least it’s going to. With ever increasing computational power and lifelong efforts by some great scientists, today we are seeing very exciting advances happening on weekly basis. Even though it took self-driving cars and Watson to bring AI to the mainstream, AI has been here for almost as long as the computer itself. From coining of the term in 1956, it has undergone through various stages of evolutions. From the golden era of logic based reasoning to the perceptrons and subsequent AI winter through to the advent of neural networks and current deep learning frenzy: AI has indeed come a long way.

There’s no question of this wave of AI and automation going to affect the way we work. The question is how much it’s going to change; and do we really need to worry? After all, during the last century the world saw some major revolutions in the way humans work and why this should be any different? With every major disruptive innovation, there have been both expiration of traditional jobs and creation of new jobs.

One main difference I see with AI based automation is that it’s not trying to emulate a single function like traditionally how it has happened. For example, horse-driven carriage to automobiles, or papers to digital media have revolutionized human civilization as we know it. But in each of these cases they were limited to one specific area. When we think of what’s happening today with AI, it’s trying emulate some skills that have been intrinsically marked as human territory and doing so to the degree of human precision: cognition and decision making key among them. With such faculties been outsourced to machines, there’s no telling of how widespread the affect will be.

While machine learning researchers caution the world to brace for mass outbreaks of unemployment cycles, some opinion the effect will be similar to disruptions happened in the past. While I agree with the former school of thought, I doubt anyone has a good estimation. This is probably why the Whitehouse policy paper for AI discusses on both overestimated and underestimated influences. Indeed some effects are quite unexpected. But looking at how things are going, we can already see some industries like transportation are due for a rude disruption. Here is another estimation of what type of jobs are more prone to overtaking. It can be expected that single-skill jobs will continue to decay while jobs that require social or maths skill will remain largely unaffected or get more demand.

In summary, think we can all agree on that this wave of AI is going to affect how we work, and as the wise say: it’s good to be safe than sorry. If you still think this may be into the far future, time to think again.

Technology domain is interconnected

Again this is mostly with regards to computer science, but it may hold true in other fields as well. Today, to get some meaningful work done, you usually need to tread upon at least a few cross disciplines. If you are a software engineer, it’s not enough to know the fundamentals and a few languages; depending on your flavour, it may be into systems, embedded systems. etc or distributed systems, web security, big data and ilk. If you are into data science — a cross discipline to begin with — there’s no escaping from learning, from statistics to CS and everything in between! Each of these field is vast on its own and advances rapidly just like most areas in CS. In that sense, the words “Try to learn something about everything and everything about something” is apt today than any other time.

With such a large scope to draw from and a rapidly advancing industry, I doubt any traditional college can satisfy the need no matter how good the degree program is. Fortunately, today we don’t have to look beyond our browser to learn whatever the topic we need to learn and the only question is whether we are ready to expand our horizons.

A modicum of balance to a knowledge driven world

With the ever persistent brain drain from developing countries and today’s demand for knowledge driven industries, most of the countries are at a severe disadvantage. With the imminent wave of automation, this kind of overwhelmingly biased world doesn’t look promising to begin with. Luckily, some very wise people, who are also happen to be leading machine learning researchers, kicked off the drive for today’s online learning initiative in parallel to the rise of AI (this is not anyway discounting the wonderful service rendered through MIT opencourseware prior to the arrival of MOOCs). So it’s not an exaggeration to call such learning initiatives as great equalizers in education and a step towards improving world’s future living standard. As with everything today, some of them are increasingly getting money driven now, but still they have started something that could change the world for the better.

What should we learn

Little humble bragging: I was an early adopter into MOOCs (as they were coined later) in 2011 and finished both Prof. Andrew Ng’s first online machine learning course, which went on to become Coursera, and the first intro to Artificial intelligence course by Prof. Sebastian Thrun and Peter Norvig, which was the start of Udacity. From then to date, I took part in many courses, but as the norm with MOOCs finished only a dozen or so in truth. Anyway, I’d say I have a fairly good rapport with MOOCs as you can get, and would like to share few tips solely based on my subjective experience.

When it comes to learning, you can spend time on lots of things very similar but gain very little in return. In that sense, the classic “Teach Yourself Programming in Ten Years” by Peter Norvig is something everyone should read on what to learn.

Another lesson I learnt is that even though courses are free and limitless, your time is not. So even though a course is really interesting, I now carefully take time to decide whether that’ll help me to expand my knowledge in something I really need. Also rather than trying to keep up with bunch of courses at once and not getting anything fully done, restricting yourself to few depending on your schedule and fully concentrating on them is far better. Again, this is a no brainer, but our impulse is to grab everything free.

Another recent development is all the online services are introducing specializations and mini-degree programs. I have doubts whether this is the best way to go from a learner’s point of view. One of the advantages of online learning is that you are not restricted by any institutional rules to select what to learn or from where. But with this type of mini-degree programs, we are again bringing in traditional restrictions to learning. Instead I’d prefer to select my own meal, and if they are really good, pay for them or audit until I’m convinced. But again, this is very much subjective.

In conclusion, learning is an intrinsic function built into everyone. But with this new order of the world, learning has turned into a fast track lane and if we don’t catch up to the speed, world may move forward leaving us stranded.

Machine Learning in SaaS paradigm

Our ultimate objective is to make programs that learn from their experience as effectively as humans do. We shall…say that a program has common sense if it automatically deduces for itself a sufficient wide class of immediate consequences of anything it is told and what it already knows.

John Mccarthy, “Programs with Common Sense”, 1958

For over a half a century, machine learning has been a strong research topic within the academic circles. But in the last decade or so it has made heavy in roads to the practical world of tech industry and today it’s no secret that most of the large players are using numerous machine learning techniques to enhance various aspects of their workflows. In this post I’m hoping to look at a few ways how a SaaS application (presumably run by a startup without an infinite amount of resources) can use machine learning to enhance its overall experience.

Personalise user experience

In your experience of using SaaS products, how many times you may have found that your favourite item is at the bottom of the list and you have to scroll half a mile or navigate through several layers of menus? Usability usually favours the majority and you may discover the bitterness of being stuck in the minority.

For instance, if you visit the same coffee shop every morning and buy the same drink, and if the shop owners are good at their business then they should know your preferences after a few days and you won’t have to go through the ordering routine everyday.  It may be just that the shop owner confirms “Same as usual?” and that’s it!  So if your application is a bit more intelligent (i.e., good at its business), it could take a leaf out of this scenario and save user’s time, and in turn, improve customer satisfaction by learning more about their preferences. However to be on the safe side, just as in the case of coffee shop owner’s confirmation, you may need to give an extra setting option to the user confirming whether it’s preferable for the app to learn user behaviours and adapt.

Reduce your support requests

Say you have a great product on your hands and it’s getting more and more traction. If you have experienced this situation, one thing that you won’t miss is the number of support requests that would be sky-rocketing in parallel to the hotness of your product. Given that startups have limited man power, human work hours should be efficiently utilized in more rewarding tasks such as adding new features, targeting new sales channels and fixing bugs, instead of spending them answering repetitive support issues that can be easily avoided. One remedy would be to evaluate the usability of your application, which certainly is prudent, but it won’t hurt to make your application a bit more intelligent to identify user troubles.

In your neighbourhood if you had noticed a stranger wondering around looking lost, wouldn’t you offer your help? Similarly, your app can do the same and be kind enough to identify a new user wondering around and offer help. Not only will you be saving your team’s man hours, but you will be also saving user’s precious time, and as an added bonus impress the user even more so in your product.

Make the search intelligent

Search is a window to your application data and improving the quality of search will directly influence the user experience.  Rather than making the user guess under what keywords his target content is indexed under, what if your application is good at identifying user intent behind the search? That would certainly be the icing on top of your search functionality. To make it even better, mix some fuzzy-ness to auto correct a search term when there’s an obvious error. Of course all this is easier said than done and every company is not a Google. But you can take an initiative by analysing search terms to identify week spots, start addressing them first and moving forward as a minor experimental optimisation process.

A good application wide search will greatly help answer most questions your users may have. From my experience, most of the support questions are recurring in nature so if users can easily find answers from your community forum or support articles, it will help lighten your support load.

Finding more details about users (enrichment)

Well, this is more of a grey area. Gossiping on other’s juicy dirty secrets is usually frowned upon, but a little awareness of what’s going on around you could be useful and even healthy. Most of the large companies are already digging up your day-to-day buying patterns to better target you but  the amount of how deep you dig into user information (or abstain from it) is certainly up to you. One way to look at this would be how you treat advertisements – as long as they are relevant and useful in achieving your goal you won’t mind it.  But the second it falls below your requirements and becomes nagging, it will be a nuisance and called spamming.  Likewise if you can give users a coupon they can’t ignore it’s likely they won’t mind and you can comfort your conscience by thinking you are doing a service rather than snooping around.

Gauging user reaction to new features

It’s a well-accepted practise for applications to use a simple voting system to collect desired new features by their users. Usually what happens is that application admins put up a set of features they feel important and users vote on them. Taking this approach one step further and factoring each new feature into some set of attributes, you can know what users really need, and at the same time know more about your users. Of course, collaborative filtering (CF) is not a new technology: all social networking services and e-commerce giants are using it to rate new items and know preferences of new users back and forth. But, even without being a massive social networking service, you can still use CF to get to know the attributes of your user base, such as technical savviness, seeking automation. etc.

Conclusion

This only sums up some of the more straight-forward scenarios where machine learning techniques can play a part in improving SaaS applications. It certainly is an exciting field in which I’m trying to get a grasp on as a passive interest and hoping to carry out experiments to learn the applicability of various ideas. It would be exciting to hear more ideas and how well they have worked out for you, so please feel free to share them here.

Some pointers to get started/keep an eye on:

Avoiding AWS potholes

We are in the era of clouds, and at the moment AWS is the Zeus among public clouds. With its scalable and flexible architecture, cheap rates, secure PCI compliant environment, wide array of loosely coupled services and boasting of 99.95% availability, they may deserve the crown. However they are not without holes and few days ago I got the chance to taste it firsthand. This post is about few measures that you should (and I mean this with capital SHOULD) take before moving your production servers to AWS.

To start with, I had been using Slicehost and Linode as VPS providers for couple of years while tinkering with AWS. After a trial run of few months I was satisfied that everything is working as it should be and moved to AWS for real. But the mistake I’ve done and AWS didn’t bother to mention anywhere easily findable is to couple Elastic Block Storage (ESB) with all instance stores. And this is something easy to overlook when you are coming from a regular VPS provider because ephemeral Instance store is the most counterpart similar device to a slice and you may expect the same behaviour throughout.

So back to the story, everything was running fine until AWS had scheduled a maintenance rebooting of the instance two weeks ago. Nothing much to worry right ? But it turns out that the instance didn’t reboot and there was very little possible to do from the AWS web console. Unlike in regular VPS slices, AWS doesn’t come with a back-door SSH console and it turns out even the staff can do pretty much little regarding an instance store. The only solution they could give me was to reboot the instance few times and if it doesn’t work out…well, they are sorry and it’s a lost cause.

I earlier mentioned the mistake I’ve made. But what I got right was to have several layers of backups including database replication slaves. So backups were running pretty much as expected and there wasn’t any lasting damage done.  And only when you are in trouble that you are glad of the time well spent on emergency procedures.

So rest of the story is very little. I removed the crashed instance, restarted a new one from the custom AMI we had and copied data over from DB slaves. But this scenario could have gone vastly wrong if there wasn’t a redundancy setup and for some unfortunate bootstrapping startup it could have reduced all their hard work to crisp.

I know servers should be up running and having them down is not heroic. But there are few points you should have in place before moving your production servers to AWS.

  1. Have a proper backup procedure in place. Better if replication slaves are in some other server vendor or in another AWS region and have a monitor setup to make sure replication process is working properly. Also it’s better to have several layers of backups running so you will have point-in-time recoverable database copy as well as one day old, week old, month old.etc data copies in worst to worst case scenarios.
  2. Use Elastic Block Storage (EBS) – They are the external USB drives of AWS. Couple one or more EBS with your instance store  and use them to store any data you think is valuable. If your instance die, you can just decouple the block and reattach to another fresh instance and run without a hitch.
  3. Have a custom bare-bone AMI with just the OS and may be couple of basic services. Also have an AMI with fully ready-to-launch setup. This way you can make another production ready instance in minimal time as well as have an option in a worst case scenario where the full ready made AMI doesn’t work. Finally, test all your AMIs to make sure that they are working properly.
  4. Have snapshots from your EBS devices in scheduled intervals.
  5. Use these not so easy to find AWS architectural guidelines in designing your platform.

So as I mentioned it’s not about heroics, but making sure your service not getting reduced to ashes because of some stupid server glitch. As someone wise had noted, better be ready than sorry!

Update:

There is another set of sound suggestions made in comment #4 by kordless for any cloud deployment. If you are into heavy scaling they may be particularly useful.