Kaggle Quora duplicate question analysis – a very naive approach with a Decision Tree

The aim of this Kaggle competition is to predict whether the question pairs in the data set, obtained from Quora, have the same meaning. There are currently many approaches in the Kaggle Kernel section each with its own merits and drawback. In this analysis I hope to experiment with the most popular methods as described in this post by shubh24. It would be interesting to gain a better understanding of how natural languages can be processed in machine learning tasks. Let’s look at the training data frame. It’s pretty simple.


The important features here are the question pair (question1, question2) and their labels. There are 2 null values in the question2 column and their labels are 0. Therefore we can assume that if null values exist, the pair would automatically be classified as being different. Having noted that the instances containing the null values were removed as they would not affect data exploration and model creation. A simple value counts reveals how many pairs are (are not) duplicates.


It turns out there are 149263 duplicate pairs and 255025 non-duplicate pairs. This gives us a base line accuracy of 63.1%

Intuitively the longer the questions are the more variation they might contain. Hence there could be a higher chance for the questions being different. It is also assumed that when people ask questions, they use similar structures and wordings. Therefore it would be uncommon for the same question to be asked with a big difference in length. Therefore I have counted the length (number of words separated by a blank space) for each question and obtained their average and difference.

train['q1_len'] = train['question1'].apply(lambda x:len(x.strip().split()))
train['q2_len'] = train['question2'].apply(lambda x:len(x.strip().split()))
train['len_diff'] = abs(train['q1_len'] - train['q2_len'])
train['len_avg'] = (train['q1_len'] + train['q2_len'])/2



Personally I think stripping is a good practise (Sorry for the bad pun) as it helps minimising nasty surprises. It does so by getting rid of the blank spaces at the beginning/end of a string. The results were plotted.


I must admit that these are not the full plots as there are outliers in the data that distorts the whole plot (the invisible long tail in a histogram!). Hence the plots here show the distributions without the outliers. We can see that the shapes of the different classes in the plots are similar. However to confirm that  t-test were run.

print ttest_ind(same['len_diff'], different['len_diff'], equal_var=False)
print ttest_ind(same['len_avg'], different['len_avg'], equal_var=False)
print ttest_ind(same[same['len_diff']<60]['len_diff'], different[different['len_diff']<60]['len_diff'], equal_var=False)
print ttest_ind(same[same['len_avg']<60]['len_avg'], different[different['len_avg']<60]['len_avg'], equal_var=False)


The results from the t-tests suggest that there is a statistically significant difference (in terms of ‘len_diff’ and ‘len_avg’) between identical/different question pairs. We can now naively use those features to generate a model. However it could be debatable whether t-tests are appropriate in this situation because the distribution is not normal. However we can still try to fit a model with only ‘len_diff’ and ‘len_avg’ and see how it performs!

 from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.metrics import accuracy_score, log_loss

X = train[['len_diff','len_avg']]
y = train['is_duplicate']
sk = StratifiedKFold(n_splits=10, shuffle=True,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
clf = LogisticRegression()
model = clf.fit(X_train,y_train)
pred = clf.predict(X_test)
ls = log_loss(y_test,pred)
a_s =accuracy_score(y_test,pred)
score = cross_val_score(clf,X,y,cv=sk)

The above code shows how a Logistic Regression model was trained. A Decision Tree was trained in a similar way and the results are shown below.


It is obvious that these models are no better than random guess. We have to do better! One of the simplest approaches mentioned in shubh24’s post was suggested by Amandeep Rathee. In his kernel, Amandeep suggested that instead of questions length, similarity of questions could be determined by counting common words presented in both of the questions. Amandeep further pointed out that rather than the actual counted number, the important indicator is the ratio between number of identical words and average number of words within the pair. Amandeep’s work was carried out in R and in the following section, I will try to reproduce it in Python. First we have to lower all the questions to eliminate the difference caused by case.

import string
lowered_q1 = train['question1'].apply(lambda x:string.lower(x))
lowered_q2 = train['question2'].apply(lambda x:string.lower(x))

Then we would have to remove all the punctuation and formatting inside the sentences/questions. It is done by replacing those with an empty strings (”). The sentence would then be broken up into a list of words. This process is called tokenising. In Python it can be done using many different methods. In this case I have replaced the apostrophes with empty strings before using and Regular Expression Tokeniser (found in the very helpful nltk module). I have done the replacement before tokenising because I do not want some words to be broken up (e.g. ‘can’t’ -> ‘can’, ‘t’).

from nltk import RegexpTokenizer

lowered_q1 = lowered_q1.apply(lambda x:x.replace("'",''))
lowered_q2 = lowered_q2.apply(lambda x:x.replace("'",''))

tokenizer = RegexpTokenizer(r'\w+')
tokenised_q1 = lowered_q1.apply(lambda x: tokenizer.tokenize(x))
tokenised_q2 = lowered_q2.apply(lambda x: tokenizer.tokenize(x))



This is what the result looks like. Then we count the common words between question-list pairs. In order to slightly speed up the comparison I would turn the lists into sets to eliminate duplicates. It is very common for longer questions to have more identical words (especially we have not removed stop words here!). Therefore the count was then divided by the average length of the question pairs. This would give us the information about the ratio of identical words in a question.

def count_common_words(lst_1, lst_2):
    same_word_lst = []
    for q1, q2 in zip(lst_1, lst_2):
        s_1 = set(q1)
        s_2 = set(q2)
        count = np.array([1 for s in s_1 if s in s_2])
        count_sum = count.sum()
    return same_word_lst

same_word_count_lst = count_common_words(tokenised_q1,tokenised_q2)
train['same_word_count'] = same_word_count_lst
train['same_word_ratio'] = train['same_word_count']/train['len_avg']

Sorry about the indentation as I cannot get it to work with WordPress! Having created more features surely more plots would follow.


The distributions of the newly calculated features are shown above. Duplicate word ratio seems to be a fairly indicative feature since both the means and medians are shown to be quite different.T-tests were not performed due to the concern mentioned earlier. Combining with the ‘length’ features created earlier, more models were fitted.

Logistic Regression log loss: 12.38
Logistic Regression accuracy: 0.642
Logistic Regression mean CV accuracy: 0.643
Logistic Regression CV standard deviation: 0.002
Decision Tree log loss: 10.57
Decision Tree accuracy: 0.694
Decision Tree mean CV accuracy: 0.694
Decision Tree CV standard deviation: 0.003

In order to maintain a fair comparison, the same models were used to evaluate these features. It can be seen that the performance of both Logistic Regression and Decision Tree have improved significantly. Log loss (the target metric in this competition) has reduced to 10 from 12, which is a 17% improvement. The low cross validated standard deviations also suggest that the model is relatively stable.

Apart from Logistic Regression and Decision Tree, a Random Forest model was also fitted. It is interesting to observe that Random Forest is performing similarly to Decision Tree. It is very likely caused by the small amount of features (4) available. Part of the ‘randomness’ of Random Forests come from the random selection of features subset. With such a limited choice the forest can not grow properly. By default it can only pick sqrt(4) = 2 features which is not helpful.

I think this is a good point to call this a base line model. Despite the relatively poor performance (log loss of smaller than 1 can commonly be achieved as demonstrated in the leaderboard), it is a good indicator which shows what kind of performance this naive approach can achieve, and hence improve upon.

Since Decision Tree seem to perform fairly in this instance so a grid search is used as an attempt to further optimise the model. The parameters combination is shown in the code. However it does not yield a better performance. Nevertheless these parameters were still used in the final prediction of the test set.

params = {'criterion':['gini','entropy'],
gscv = GridSearchCV(clf,params,cv=sk,n_jobs=-1)
model = gscv.fit(X,y)
print model.best_score_
print model.best_params_
best score: 0.694
best parameters:
- min_samples_split: 6 
- min_samples_leaf: 3
- criterion: 'entropy'
- class_weight: None

Out of my own interest (because this is not quite related to the competition), a ROC curve and a classification report were generated using the training set. By looking at the True Positive Rate (a.k.a Recall) we know that our False Positive Rate is between 0.2 and 0.3. If this competition is evaluated by accuracy, I can probably shift the threshold to reduce False Positive Rate to increase performance. This is because the data is slightly imbalanced with more class 0 (non-duplicate pairs). However the metric chosen is log loss so oh well…it is still something nice to know.


At the end, predictions for the test set was made using the 4 created features with a Decision Tree model. And to my (huge) surprise it is doing a OK job in the public leaderboard. The resultant log loss is 0.53435 which is very different from our test (validation) set result. We are talking about a magnitude of 2 orders so the reason must be understood! However I will leave this to the next post. This marks the end of my first attempt in cracking the Quora challenge and hope you liked it. Happy coding!

2/4/2015 update: The high difference in log loss is caused by my mistake in using “predict” instead of using “predict_proba” as the input for log loss. By using the correct input data, model evaluation has changed completely.  It turns out that Gradient Boosting has a lower log loss comparing to the models experimented earlier in this post. By applying Gradient Boosting to our data, a log loss of 0.47997 in the public board is achieved!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s