Getting to Know Julia

Julia Project - Song Lyric Text Classification by Artist

2019-09-12T00:00:00+00:00

I had an idea for a work-related project I’d like to do some day given the opportunity. The objective would be to build a machine learning model that can classify notes or documents for compliance purposes. To get such a project off the ground sufficient labelled training data would be needed. There isn’t going to be the luxury of trained data like the famous IMDb data set which contains 50,000 labelled movie reviews. The data we would obtain would likely be less than 1000 rows (at least to start with).

So I went on a search for trained text data sets and somehow ended up with this Song Lyric dataset from Kaggle. I thought it would be a fun challenge to pick 5 popular artists who had made many songs and try to build a model that could predict the artist who sung the song with test data unseen by the training step. The filtered dataset used for training is less than 800 rows making it kind of comparable to the work-related project I had in mind.

The task of predicting the artist isn’t quite as straightforward as you first think; each artist will likely have songs in different genres (e.g. upbeat, downbeat and ballads). The songs may have been written by different band members and will also be of variable length. Just guessing the artist correctly is going to be a 1 in 5 chance or (0.2 probability).

This project brings together all my recent Julia blog post learnings with NLP, Flux, Neural Networks and Convolutional Neural Networks, (i.e. CNN’s or ConvNets). An added challenge was the lack of similar examples on the web for Word Embeddings with Flux or Word Embeddings with Flux CNN’s. I’m quite proud I got these working without having to copy anyone else’s work. The code may not be super-pretty but it works!!

Let’s get started loading the libraries we need.

using CSV, DataFrames, Random, TextAnalysis, Languages, Statistics, PyPlot, Flux, BSON

#Display Flux Version
import Pkg ; Pkg.installed()["Flux"]

loaded
v"0.7.2"

Loading and Initial Data Preparation

Load the data from the CSV file we downloaded from Kaggle and show a count of all songs.

df_all=CSV.read("/mnt/juliabox/NLP/songdata.csv")
categorical!(df_all, :artist)
show(by(df_all, :artist, nrow))

643×2 DataFrame
│ Row │ artist        │ x1    │
├─────┼───────────────┼───────┤
│ 1   │ 'n Sync       │ 93    │
│ 2   │ ABBA          │ 113   │
│ 3   │ Ace Of Base   │ 74    │
│ 4   │ Adam Sandler  │ 70    │
│ 5   │ Adele         │ 54    │
│ 6   │ Aerosmith     │ 171   │
│ 7   │ Air Supply    │ 174   │
⋮
│ 636 │ Zeromancer    │ 30    │
│ 637 │ Ziggy Marley  │ 64    │
│ 638 │ Zoe           │ 1     │
│ 639 │ Zoegirl       │ 38    │
│ 640 │ Zornik        │ 12    │
│ 641 │ Zox           │ 21    │
│ 642 │ Zucchero      │ 30    │
│ 643 │ Zwan          │ 14    │

This is a great dataset but we need to make a new dataframe containing just the song lyrics and labelled artists selected. The data is randomly shuffled using a known ‘seed’ so we can replicate the same order each time the notebook is run. The first row of data is output.

artists=["Queen", "The Beatles", "Michael Jackson", "Eminem", "INXS"]
df=df_all[[x in artists for x in df_all[:artist]],:]
df_all=nothing
Random.seed!(1000);
df=df[shuffle(1:size(df, 1)),:]
df[1,:]

size(df,1)

This size of the dataset is only 727 rows – a shortage of examples will mean this is likely to be a hard task!

Preprocessing - clean-up

The next block of code uses the TextAnalysis library to create a corpus of our song lyrics and cleans it for the next step.

docs=Any[]
for i in 1:size(df,1)
    txt=df.text
    txt=replace(df[i,:].text, "\n" => " ")
    txt=replace(df[i,:].text, "'" => "")
    dm=TextAnalysis.DocumentMetadata(Languages.English(), df[i,:].song,"","")
    doc=StringDocument(txt, dm)
    push!(docs, doc)
end
crps=Corpus(docs)
orig_corpus=deepcopy(crps);
prepare!(crps, strip_non_letters | strip_punctuation | strip_case | strip_stopwords | strip_whitespace)

Let’s take a look at the first song to see what just took place. In the original corpus the first song by Queen looked like this.

orig_corpus[1]

StringDocument{String}("Oh my love weve had  \nOur share of tears  \nOh my friends weve had  \nOur hopes and fears  \nOh my friend its been  \nA long hard year  \nBut now its Christmas  \nYes its Christmas  \nThank God its Christmas  \n  \nThe moon and stars  \nSeem awful cold and bright  \nLets hope the snow will  \nMake this Christmas right  \n  \nMy friend the world will share  \nThis special night  \nBecause its Christmas  \nYes its Christmas  \nThank God its Christmas  \nFor one night  \nThank God its Christmas  \nYeah thank God its Christmas  \nThank God its Christmas  \nCan it be Christmas  \nLet it be Christmas every day  \n  \nOh my love we live  \nIn troubled days  \nOh my friend we have  \nThe strangest ways  \nOh my friends on this  \nOne day of days  \nThank God its Christmas  \nYes its Christmas  \nThank God its Christmas  \nFor one day  \n  \nThank God its Christmas  \nYes its Christmas  \nThank God its Christmas  \nWooh yeah  \nThank God its Christmas  \nYeah yeah yeah yes its Christmas  \nThank God its Christmas  \nFor one day yeah - Christmas  \n  \nA very merry Christmas to you all  \n\n", TextAnalysis.DocumentMetadata(Languages.English(), "Thank God It's Christmas", "", ""))

After the pre-processing step it looked like this.

crps[1]

StringDocument{String}("oh love weve share tears oh friends weve hopes fears oh friend hard christmas christmas thank god christmas moon stars awful cold bright hope snow christmas friend world share special night christmas christmas thank god christmas night thank god christmas yeah thank god christmas thank god christmas christmas christmas day oh love live troubled days oh friend strangest oh friends day days thank god christmas christmas thank god christmas day thank god christmas christmas thank god christmas wooh yeah thank god christmas yeah yeah yeah christmas thank god christmas day yeah christmas merry christmas ", TextAnalysis.DocumentMetadata(Languages.English(), "Thank God It's Christmas", "", ""))

Preprocesing - prep for training

The update lexicon commands will quickly count our words and consequently let us lookup words to see in which songs they occur.

update_lexicon!(crps)
update_inverse_index!(crps)

The word “christmas” is located in the song corpus with these indexes

crps["christmas"]

8-element Array{Int64,1}:
   1
 162
 239
 328
 332
 490
 606
 638

The following code builds our word dictionary (word_dict).

Each word in our song corpus can now be represented by a unique integer.

m_dtm=DocumentTermMatrix(crps)
word_dict=m_dtm.column_indices

Dict{String,Int64} with 8449 entries:
  "ont"         => 5080
  "youd"        => 8421
  "bsta"        => 897
  "enjoy"       => 2388
  "chocolate"   => 1226
  "fight"       => 2675
  "null"        => 5007
  "princess"    => 5603
  "snuggle"     => 6777
  "carousels"   => 1068
  "needin"      => 4914
  "helping"     => 3378
  "manufacture" => 4437
  "sheezy"      => 6462
  "sleepless"   => 6682
  "favor"       => 2612
  "henry"       => 3391
  "eddie"       => 2303
  "aaaah"       => 5
  "borders"     => 779
  "tenor"       => 7459
  "star"        => 7001
  "prick"       => 5594
  "worship"     => 8340
  "itll"        => 3775
  ⋮             => ⋮

This function returns the word_dict index value of the word passed in s. It returns 0 if the word is not found.

tk_idx(s) = haskey(word_dict, s) ? i=word_dict[s] : i=0

tk_idx (generic function with 1 method)

Let’s try it out.

tk_idx("christmas")

For the training step all the songs need to be the same length of words and the words need converting to numbers. The following function performs this task by padding shorter songs with zeros and truncating longer songs to the size specified.

function pad_corpus(c, size)
    M=[]
    for doc in 1:length(c)
        tks = tokens(c[doc])
        if length(tks)>=size
            tk_indexes=[tk_idx(w) for w in tks[1:size]]
        end
        if length(tks)<size
            tk_indexes=zeros(Int64,size-length(tks))
            tk_indexes=vcat(tk_indexes, [tk_idx(w) for w in tks])
        end
        doc==1 ? M=tk_indexes' : M=vcat(M, tk_indexes')
    end
    return M
end

pad_corpus (generic function with 1 method)

num_terms_in_songs=[length(tokens(crps[i])) for i in 1:length(crps)]
println("min $(minimum(num_terms_in_songs)) max $(maximum(num_terms_in_songs)) mean $(mean(num_terms_in_songs))")

min 19 max 400 mean 99.43053645116919

We can see that the mean is around 100 words, however, I found (when hyperparameter tuning) that a higher number improved accuracy. We will set doc_pad_size to 200.

X becomes our training data which is now in a format suitable for input into a neural network model.

doc_pad_size=200
padded_docs = pad_corpus(crps, doc_pad_size)
X = padded_docs'

200×727 LinearAlgebra.Adjoint{Int64,Array{Int64,2}}:
   0     0     0     0     0  …     0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0  …     0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0  …     0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
   0     0     0     0     0        0     0     0     0     0     0
    ⋮                             ⋮  ⋱                             ⋮      
7460  1684  5002  7490  4321     3472  4321  7863  3667  3456  3269
3456  5061  1144   409  2632     1180   409  6799  6839  3244  4677
7580  8423   833  3623  4408  …  7733  3021  3168  3623  3456  4326
6817  3028   915  1093  4321     2524  4321  1803  3071  3244  1564
6372  4220  6472  7562  2632     4448   409  5576  3667  3456  4186
2122  3623  4968  7368  4408     6630  3021  6921  3667  3456  7618
1684  8309  4968  4189  4321     3676  5083  7589  3623  3456  1684
8398  3575  8092  3614  2632  …  5448  4321  3177  3071  3456  3699
8398  8423  4859  2823  4408     3377   409  7631  3667  3456  2182
8398  7589  7057   582  4321     7652   409  3991  1092  3244  3211
8398  7589  3956  3338  8398     4562  5083  7636  7490  3456  3754
8398  7589  1243  2823  2632     7458  3021   622   411  3244  4368

Our data labels y (i.e artists) also need processing into a one-hot-matrix for classification. First let’s define a dictionary of artists called artist_dict.

artist_dict = Dict()
for (n, a) in enumerate(unique(df.artist))
   artist_dict["$a"] = n
end
artist_dict

Dict{Any,Any} with 5 entries:
  "Queen"           => 1
  "Eminem"          => 5
  "The Beatles"     => 3
  "Michael Jackson" => 4
  "INXS"            => 2

We’ll now use onehotbatch magic to make the required transformation for this classification problem.

artist_indexes=[artist_dict[df[:artist][i]] for i in 1:size(df,1)]
y = Flux.onehotbatch(artist_indexes, 1:5)

5×727 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
  true  false  false   true  false  …  false  false  false  false  false
 false   true  false  false  false     false  false  false  false   true
 false  false   true  false  false     false   true  false  false  false
 false  false  false  false   true      true  false   true   true  false
 false  false  false  false  false     false  false  false  false  false

Let’s now split our X data into training and test data sets.

Training data will be used to ‘train’ the model.
Test data will be new ‘unseen’ data used to make new predictions. As we have knowledge of the artists we will be able to score the accuracy of the model.

X_train = X[:, 1:649]
y_train = y[:,1:649]
X_test = X[:, 650:end]
y_test = y[:, 650:end]

println("X_train $(size(X_train)) y_train $(size(y_train)) X_test $(size(X_test)) y_test $(size(y_test))")

X_train (200, 649) y_train (5, 649) X_test (200, 78) y_test (5, 78)

The final preprocessing step neatly combines our training data and labels into a training_set tuple for Flux.

train_set = [(X_train, y_train)]

1-element Array{Tuple{Array{Int64,2},Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}},1}:
 ([0 0 … 0 0; 0 0 … 0 0; … ; 4579 8398 … 7635 3929; 1249 8398 … 7454 3038], [true false … false false; false true … true false; … ; false false … false false; false false … false true])

Embedding Prep

Our X data is now numbers and these numbers point to words in the word_dict. In it’s current state the numbers don’t really have all that much value for training. The next step is to load the GloVe word embeddings and prepare them as the first layer in our Flux neural network. Word embeddings give our words ‘meaning’ and have been covered in detail in my previous blog posts; please refer to these if you need more background on word vectors and embedding them in Neural Networks.

Let’s load in the embeddings.

function load_embeddings(embedding_file)
    local LL, indexed_words, index
    indexed_words = Vector{String}()
    LL = Vector{Vector{Float32}}()
    open(embedding_file) do f
        index = 1
        for line in eachline(f)
            xs = split(line)
            word = xs[1]
            push!(indexed_words, word)
            push!(LL, parse.(Float32, xs[2:end]))
            index += 1
        end
    end
    return reduce(hcat, LL), indexed_words
end

load_embeddings (generic function with 1 method)

Note we have gone for the 300 dimension file this time for better results.

embeddings, vocab = load_embeddings("glove.6B.300d.txt")
embed_size, max_features = size(embeddings)
println("Loaded embeddings, each word is represented by a vector with $embed_size features. The vocab size is $max_features")

Loaded embeddings, each word is represented by a vector with 300 features. The vocab size is 400000

Now we define our usual functions for returning word vectors.

#Function to return the index of the word 's' in the embedding (returns 0 if the word is not found)
function vec_idx(s)
    i=findfirst(x -> x==s, vocab)
    i==nothing ? i=0 : i 
end

#Function to return the word vector for string 's'
wvec(s) = embeddings[:, vec_idx(s)]

#return the wordvec for "christmas" as a test
wvec("christmas")

300-element Array{Float32,1}:
 -0.12172  
 -0.50138  
 -0.094431 
  0.1533   
 -0.53234  
  0.77088  
 -0.18902  
  0.45391  
 -0.55459  
 -0.60449  
 -0.070504 
  0.020576 
  0.49627  
  ⋮        
 -0.013262 
 -0.28618  
 -0.0091329
  0.057448 
 -0.073389 
  0.45916  
 -0.30745  
 -0.40096  
 -0.039834 
  0.11326  
  0.092584 
 -0.37479  

As you may of noticed a few step above, the vocab in the GloVe file we loaded earlier was 400,000. We don’t need all these words and it will make training very slow or cause memory issues if we try to keep them all. We also need to handle ‘missing words’. In the next step we make an embedding matrix of word vectors based on our own word dictionary.

Let’s see how big the embedding matrix should be at the minimum.

length(word_dict)

max_features = 300
vocab_size = 8450
println("max_features=$max_features x vocab_size=$vocab_size")

max_features=300 x vocab_size=8450

We’ll make the vocab_size at least 1 item bigger (to store a zero unknown word).

It’s likely that there will be a few words from the lyrics that aren’t in GloVe. We need to make sure that any missing words don’t spoil the training by being zero, ‘too big’ or ‘too small’. We therefore pre-fill the matrix with comparable random numbers as a first step using glorot_normal.

embedding_matrix=Flux.glorot_normal(max_features, vocab_size)

300×8450 Array{Float32,2}:
 -0.0295832   -0.00962955  -0.00975701   …  -0.0178692   -0.00624065 
 -0.00807149  -0.0167376   -0.0101676        0.021685    -0.0144029  
  0.0135334    0.00604884  -0.00648684       0.00542843   0.00395646 
 -0.00981705  -0.0340076   -0.014655        -0.00343636   0.0193315  
 -0.0175665   -0.00567335   0.0157851       -0.00380507  -0.00563199 
  0.00679509  -0.0167397   -0.0349645    …   0.0162774    0.00153693 
  0.00236159   0.0258442    0.0297015       -0.0117106    0.00243774 
  0.0119477    0.0113597   -0.0330014       -0.022494    -0.000611503
 -0.0117824    0.00965574   0.0291393       -0.00894787  -0.00370767 
 -0.0251287    0.0157542   -0.00152643       0.00256018  -0.0117952  
 -0.0102175    0.00565934  -0.00816817   …  -0.0257206    0.0139027  
 -0.00337642  -0.00810942  -0.026816         0.00700659   0.0145595  
 -0.0189478    0.0183039   -0.0253489        0.00468408   0.00352472 
  ⋮                                      ⋱                           
  0.0341929    0.0230084   -0.00523734      -0.00861224   0.00337825 
 -0.0102566    0.0121515    0.00860467      -0.00747732  -0.00846712 
 -0.00629439  -0.0118928    0.00331296   …  -0.022168    -0.0182947  
  0.0127277   -0.0146548   -0.0358121       -0.00254599  -0.00691585 
 -0.00704753  -0.0109151   -0.0131335        0.00149089  -0.00471239 
 -0.00688779  -0.0127001    0.00146849       0.00887815  -0.0080609  
 -0.00544714  -0.0144375    0.0112734        0.0162863    0.0125952  
  0.0122895    0.018809     0.0105552    …   0.0117019   -0.0186995  
  0.0277002   -0.0295917    0.00182625       0.0267027    0.010207   
 -0.0244318    0.0156611    0.0113718       -0.00889063   0.0157727  
  0.00685371   0.0027254   -0.000454166      0.0062418   -0.0218112  
 -0.00230126   0.00790164   0.0146713        0.0186511    0.00484746 

The for loop below inserts the known word vectors from GloVe by overwriting the pre-filled random numbers. It is important to note that they are inserted at the index determined from word_dict plus 1. The plus one makes a correction for words that are zero.

for term in m_dtm.terms
    if vec_idx(term)!=0
        embedding_matrix[:,word_dict[term]+1]=wvec(term)
    end
end
embedding_matrix

300×8450 Array{Float32,2}:
 -0.0295832   -0.47974     0.090805  -0.00158495   …  -0.0178692    0.0060653
 -0.00807149   0.093277    0.25026    0.000283305      0.021685    -0.56901  
  0.0135334   -0.44665    -0.14494    0.00417346       0.00542843  -0.4516   
 -0.00981705   0.33504     0.81738    0.0119664       -0.00343636   0.13047  
 -0.0175665   -0.83164    -0.76269    0.0211371       -0.00380507   0.063553 
  0.00679509   0.36115     0.58164    0.0119629    …   0.0162774   -0.44511  
  0.00236159   0.07612    -0.081049   0.00540261      -0.0117106    0.17436  
  0.0119477    0.6984      0.28666    0.00103992      -0.022494    -0.19654  
 -0.0117824   -0.21912    -0.24209   -0.010946        -0.00894787   0.54479  
 -0.0251287   -0.1397     -0.083947  -0.00893104       0.00256018   0.037594 
 -0.0102175    0.28931    -0.15224   -0.0118294    …  -0.0257206    0.26817  
 -0.00337642   0.28525     0.22769    0.0355204        0.00700659  -0.11157  
 -0.0189478   -0.61277    -0.27592    0.00716774       0.00468408  -1.16     
  ⋮                                                ⋱                         
  0.0341929    0.40865     0.30203    0.00646084      -0.00861224  -0.0442   
 -0.0102566   -0.66024    -0.47214    0.00124003      -0.00747732   0.42311  
 -0.00629439  -0.3993     -0.38838   -0.0138936    …  -0.022168     0.14924  
  0.0127277    0.1155     -0.35227    0.00467165      -0.00254599   0.53348  
 -0.00704753  -0.4311     -0.65561   -0.0033085        0.00149089   0.21203  
 -0.00688779  -0.70635    -0.4813     0.00513726       0.00887815  -0.7755   
 -0.00544714  -0.16662     0.16227   -0.0096694        0.0162863    0.21987  
  0.0122895    0.054079   -0.095315   0.000943435  …   0.0117019   -0.6204   
  0.0277002    0.73493     1.1127     0.00626607       0.0267027    0.39769  
 -0.0244318   -0.40104    -0.12874   -0.0130443       -0.00889063   0.062195 
  0.00685371   0.0041243   0.023493  -0.0203715        0.0062418    0.34639  
 -0.00230126   0.047944   -0.36228    0.0113611        0.0186511    0.60853  

First Model

For our first model architecture we use the pre-trained embeddings and a normal dense layer.

m = Chain(x -> embedding_matrix * Flux.onehotbatch(reshape(x, doc_pad_size*size(x,2)), 0:vocab_size-1),
          x -> reshape(x, max_features, doc_pad_size, trunc(Int64(size(x,2)/doc_pad_size))),
          x -> mean(x, dims=2),
          x -> reshape(x, max_features, :),
          Dense(max_features, 5),
          softmax
)

Chain(getfield(Main, Symbol("##13#17"))(), getfield(Main, Symbol("##14#18"))(), getfield(Main, Symbol("##15#19"))(), getfield(Main, Symbol("##16#20"))(), Dense(300, 5), NNlib.softmax)

Layer 1: The embedding layer. The onehotbatch multiplication ensures that the correct word vectors are used for every song in x. The output shape is 300x12980; i.e. all the documents are one long rolled out vector.

Layer 2: Reshapes the output from layer into the dimensions to 300x200x649.

Layer 3: Finds the mean vector for the song. The output shape is 300x1x649.

Layer 4: Reshapes the output from layer 3 into a shape suitable for training 300x649.

Layer 5: The dense training layer. The output is 5x649.

Layer 6: Softmax to give us nice probabilities.

More information on this model architecture can be found in a previous post Julia Word Embedding Layer in Flux - Pre-trained GloVe

loss_h=[]
accuracy_train=[]
accuracy_test=[]
accuracy(x, y) = mean(Flux.onecold(x) .== Flux.onecold(y))
loss(x, y) = sum(Flux.crossentropy(m(x), y))
optimizer = Flux.Momentum(0.2)

Momentum(0.2, 0.9, IdDict{Any,Any}())

Now our loss and accuracy functions are set-up lets begin training the first model.

for epoch in 1:400
    Flux.train!(loss, Flux.params(m), train_set, optimizer)
    l = loss(X_train, y_train).data
    push!(loss_h, l)
    accuracy_trn=accuracy(m(X_train).data, y_train)
    push!(accuracy_train, accuracy_trn)
    accuracy_tst=accuracy(m(X_test).data, y_test)
    push!(accuracy_test, accuracy_tst)
    println("$epoch -> loss= $l accuracy train=$accuracy_trn accuracy test=$accuracy_tst")
end

-> loss= 1.5928491 accuracy train=0.2218798151001541 accuracy test=0.1794871794871795
-> loss= 1.5850185 accuracy train=0.22033898305084745 accuracy test=0.19230769230769232
-> loss= 1.5755149 accuracy train=0.22496147919876733 accuracy test=0.15384615384615385
-> loss= 1.5658044 accuracy train=0.24345146379044685 accuracy test=0.15384615384615385
-> loss= 1.5568578 accuracy train=0.23728813559322035 accuracy test=0.2051282051282051
⋮
-> loss= 0.9313582 accuracy train=0.6764252696456087 accuracy test=0.6282051282051282
-> loss= 0.9309272 accuracy train=0.6764252696456087 accuracy test=0.6282051282051282
-> loss= 0.9304973 accuracy train=0.6764252696456087 accuracy test=0.6282051282051282
-> loss= 0.93006843 accuracy train=0.6764252696456087 accuracy test=0.6282051282051282
-> loss= 0.9296404 accuracy train=0.6764252696456087 accuracy test=0.6282051282051282

figure(figsize=(12,5))

subplot(121)
PyPlot.xlabel("Epoch")
ylabel("Loss")
plot(loss_h)

subplot(122)
PyPlot.xlabel("Epoch")
ylabel("Accuracy")
plot(accuracy_train, label="train")
plot(accuracy_test, label="test")
legend()

We observe that the accuracy on the test data is peaking at just over 60% accuracy. Not bad but let’s try a new model.

Second Model - 1d CNN

The second model uses a 1-Dimensional Convolutional Neural Network.

Here are two great videos to help explain the general approach and why the architecture works.

Also check this out too.

The training data for Flux CNNs must be in WHCN order; i.e. Width, Height, Channels and Number of items in the mini-batch.

size(X_train)

(200, 649)

Presently the size of X_train is 200x649. We now pick a batch size and split data into mini-batches.

using Base.Iterators: repeated, partition

batch_size = 32
mb_idxs = partition(1:size(X_train,2), batch_size)
train_set=[]
for i in mb_idxs
    push!(train_set, (X_train[:,i], y_train[:,i]))
end

The training set train_set now consists of 21 mini-batches. Each batch has 32 training (x,y) tuples with the exception of the last batch which has 9.

Now we build the 1d convolution model in Flux.

m = Chain(x -> embedding_matrix * Flux.onehotbatch(reshape(x, doc_pad_size*size(x,2)), 0:vocab_size-1),
          x -> reshape(x, max_features, doc_pad_size, 1, trunc(Int64(size(x,2)/doc_pad_size))),
          Conv((300,1), 1=>400, relu),
          x -> maxpool(x, (1,300)),
          x -> reshape(x, :, size(x,4)),
          Dense(400, 600, relu),
          Dense(600, 5),
          softmax
)

Chain(getfield(Main, Symbol("##13#17"))(), getfield(Main, Symbol("##14#18"))(), Conv((300, 1), 1=>400, NNlib.relu), getfield(Main, Symbol("##15#19"))(), getfield(Main, Symbol("##16#20"))(), Dense(400, 600, NNlib.relu), Dense(600, 5), NNlib.softmax)

Layer 1 and 2 handles the word embeddings as per model 1. The output shape from layer 2 (for the first batch) is 300×200×1x32.

Layer 3 Applies the 1d convolution filters. We use 400 channels to find new feature relationships. Activation is relu. The output size is 1x200x400x32.

Layer 4 Applies max pooling using a window size of 300x1. The output size is 1x1x400x32.

Layer 5 Flattens the shape to 400x32. This is now suitable for training in the next layer.

Layer 6 & 7 Dense layers with relu activation. Output after layer 7 will be 5x32.

Layer 7 Softmax to output probabilities per artist between 0 and 1

Whilst tuning the model I found it really useful to test the model layers on the first batch with the following command for layer 1: m[1](train_set[1][1]) and m[1:2](train_set[1][1]) for layers 1 to 2 and so on. To check the entire model is running obviously use the syntax below.

m(train_set[1][1])

Tracked 5×32 Array{Float32,2}:
64655      0.0773002   0.00207123  0.643971    …  0.000502719  0.0176868 
121        0.891177    0.0061145   0.0336226      0.000556032  0.196564  
0158486    0.0108271   0.96432     0.00717857     0.000501922  0.764975  
216433     0.0196466   0.0274192   0.279381       0.998012     0.0190336 
000169021  0.00104961  7.55584e-5  0.0358473      0.000427233  0.00174087

loss_h=[]
accuracy_train=[]
accuracy_test=[]
best_acc=0.0
last_improvement=0
stat=""
accuracy(x, y) = mean(Flux.onecold(x) .== Flux.onecold(y))
loss(x, y) = sum(Flux.crossentropy(m(x), y))
optimizer = Flux.Momentum(0.004)

Momentum(0.004, 0.9, IdDict{Any,Any}())

Lets begin training the second model. Note this training loop has been modified to allow for automatic learning rate drops if the accuracy does not improve.

for epoch in 1:40
    Flux.train!(loss, Flux.params(m), train_set, optimizer)
    l = loss(X_train, y_train).data
    push!(loss_h, l)
    accuracy_trn = accuracy(m(X_train).data, y_train)
    accuracy_tst = accuracy(m(X_test).data, y_test)
    
    if accuracy_tst >= best_acc
        stat=" - improvement, saving model"
        BSON.@save "artist_conv.bson" m epoch accuracy_tst
        best_acc = accuracy_tst
        last_improvement=epoch
    else
        stat=" - decline"
    end
    
    if epoch - last_improvement >= 5
        optimizer.eta /= 10.0
        stat=" - no improvements for a while, dropping learning rate by factor of 10"
        last_improvement = epoch
    end
    
    if epoch - last_improvement >= 15
        stat=" - No improvement for 15 epochs STOPPING"
        break
    end 
    
    push!(accuracy_train, accuracy_trn)
    push!(accuracy_test, accuracy_tst)
    println("$epoch -> loss= $l accuracy train=$accuracy_trn accuracy test=$accuracy_tst $stat")
end

-> loss= 1.5475174 accuracy train=0.4391371340523883 accuracy test=0.3333333333333333  - improvement, saving model
-> loss= 1.3981959 accuracy train=0.5177195685670262 accuracy test=0.4230769230769231  - improvement, saving model
-> loss= 1.2068622 accuracy train=0.576271186440678 accuracy test=0.46153846153846156  - improvement, saving model
-> loss= 1.014195 accuracy train=0.674884437596302 accuracy test=0.5256410256410257  - improvement, saving model
-> loss= 0.8740304 accuracy train=0.6964560862865947 accuracy test=0.5769230769230769  - improvement, saving model
-> loss= 0.7762191 accuracy train=0.7134052388289677 accuracy test=0.5641025641025641  - decline
-> loss= 0.70594215 accuracy train=0.7195685670261941 accuracy test=0.6153846153846154  - improvement, saving model
-> loss= 0.63908446 accuracy train=0.7411402157164869 accuracy test=0.6153846153846154  - improvement, saving model
-> loss= 0.5899226 accuracy train=0.7704160246533128 accuracy test=0.6025641025641025  - decline
-> loss= 0.5465148 accuracy train=0.7935285053929122 accuracy test=0.6025641025641025  - decline
-> loss= 0.58453256 accuracy train=0.7796610169491526 accuracy test=0.5769230769230769  - decline
-> loss= 1.08682 accuracy train=0.6332819722650231 accuracy test=0.5641025641025641  - decline
-> loss= 0.84157795 accuracy train=0.687211093990755 accuracy test=0.5897435897435898  - no improvements for a while, dropping learning rate by factor of 10
-> loss= 1.0953864 accuracy train=0.6409861325115562 accuracy test=0.5256410256410257  - decline
-> loss= 0.38960773 accuracy train=0.9029275808936826 accuracy test=0.6153846153846154  - improvement, saving model
-> loss= 0.3409845 accuracy train=0.9229583975346687 accuracy test=0.6282051282051282  - improvement, saving model
-> loss= 0.3086353 accuracy train=0.9291217257318952 accuracy test=0.6666666666666666  - improvement, saving model
-> loss= 0.2871488 accuracy train=0.938366718027735 accuracy test=0.6410256410256411  - decline
-> loss= 0.27186963 accuracy train=0.9414483821263482 accuracy test=0.6410256410256411  - decline
-> loss= 0.25938293 accuracy train=0.9460708782742681 accuracy test=0.6410256410256411  - decline
-> loss= 0.24849279 accuracy train=0.9460708782742681 accuracy test=0.6410256410256411  - decline
-> loss= 0.23899835 accuracy train=0.9491525423728814 accuracy test=0.6538461538461539  - no improvements for a while, dropping learning rate by factor of 10
-> loss= 0.23771816 accuracy train=0.9506933744221879 accuracy test=0.6666666666666666  - improvement, saving model
-> loss= 0.23479633 accuracy train=0.9522342064714946 accuracy test=0.6538461538461539  - decline
-> loss= 0.2335223 accuracy train=0.9522342064714946 accuracy test=0.6538461538461539  - decline
-> loss= 0.2326564 accuracy train=0.9506933744221879 accuracy test=0.6538461538461539  - decline
-> loss= 0.23182735 accuracy train=0.9491525423728814 accuracy test=0.6538461538461539  - decline
-> loss= 0.23101975 accuracy train=0.9491525423728814 accuracy test=0.6538461538461539  - no improvements for a while, dropping learning rate by factor of 10
-> loss= 0.23071226 accuracy train=0.9522342064714946 accuracy test=0.6538461538461539  - decline
-> loss= 0.23060969 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline
-> loss= 0.23052704 accuracy train=0.9522342064714946 accuracy test=0.6538461538461539  - decline
-> loss= 0.23044708 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline
-> loss= 0.2303679 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - no improvements for a while, dropping learning rate by factor of 10
-> loss= 0.2303387 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline
-> loss= 0.23032849 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline
-> loss= 0.23032042 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline
-> loss= 0.23031256 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline
-> loss= 0.23030475 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - no improvements for a while, dropping learning rate by factor of 10
-> loss= 0.23030235 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline
-> loss= 0.23030187 accuracy train=0.9537750385208013 accuracy test=0.6538461538461539  - decline

figure(figsize=(12,5))

subplot(121)
PyPlot.xlabel("Epoch")
ylabel("Loss")
plot(loss_h)

subplot(122)
PyPlot.xlabel("Epoch")
ylabel("Accuracy")
plot(accuracy_train, label="train")
plot(accuracy_test, label="test")
legend()

An improvement of about 4% compared to model 1.

Load the best model

You may have noticed we were saving the model as we went in the training loop if there was an accuracy improvement. The next line of code loads our best model. This step negates the need to re-run the training loop every time we run the notebook. Training can take a few minutes to run on a CPU.

The next line of code loads our trained Flux model.

BSON.@load "artist_conv.bson" m

Conclusion

The model nearly got to 70% accuracy. With a little more perseverance I think I could have got there. The steps I had in mind to improve accuracy were

Study and make updates to the out of vocabulary words.
Data augmentation and balance of training examples

I might come back to this another day….

#Function to return the artist name based on the index 'a' 
function artist_name(a)
    i=findfirst(x -> x==a, artist_dict)
end
artist_name(1)

"Queen"

Put yourself to the test

Update i between 1 and 78 and put yourself to the test with the next three cells

Who wrote this song?

i=5
replace(df[649+i,:text], "\n" => " ")  # 649 is the test/train split

[Puff Intro]   Yeah   The old school   To the new school   Bad Boy, remix, let's go      [Black Rob]   Like that   Black gon' 
slide with Mike Jack   Puff done remixed one hell of a track   Put me on it   I wanna know   How many want it?   Damn, it feels 
good to see people love on it   For those who love slow down   'Member Motown had a brotha' happy as shit   I mean the whole 
sound   Bangin' and catch six-four since we was shorties   Fee owes now rebooked from California      Carry 40's but I 'member 
them times in '79   When I first started to rhyme   Sometimes I gots to look back at what it was   The good old days   The 
triple o'shays when there was love      I want you back   But I can't grab that far   It's how it is   When you're living like a 
star, bad boy   Come on, let's go      [Mj]   When I had you to myself   I didn't want you around   Those pretty faces   Always 
made you   Stand out in a crowd   But someone picked you from the bunch   When love was all it took   Now it's much too late for
me   To take a second look      Oh baby, give me one more chance   (To show you that I love you)   Won't you please let me   (Back 
in your heart)   Oh, darlin' I was blind to let you go   (Let you go baby)   But now since I see you in his arms   (I want you back) 
Oh, I do now   (I want you back)   Oh, oh, baby   (I want you back)   Yeah, yeah, yeah, yeah   (I want you back)   Nah, 
nah, nah, nah      Trying to live without your love   Is one long sleepless night   Let me show you girl   That I know wrong 
from right   Every street you walk on   I lay tear stains on the ground   Following the girl   I didn't even want you 
around      Let me tell ya now   Oh baby all I need is one more chance   (To show you that I love you)   Won't you please let 
me   (Back in your heart)   Oh darlin' I was blind to let you go   (Let you go baby)   But now since I see you in his arms   (I 
want you back)      [Black Rob]   It's just like Jermain Jackson   Tito, Mike and Marlon   Only think on my mind now is 
stardom   Blowin' the F-up   My game's stepped up   'Member when Mike and them   First came to record   Singin' hits like 
Skywriter   My Girl, People Make The World Go 'Round   Mama's Pearl, Can't Loose it   Joyful jukebox music   Never Can Say
Goodbye   That's why we use it      It's money honey   So I gots to be there   And I'm be yo Sugar Daddy   Say it's real  
Versachi chair, pd, life of the party   Bad Boy, make joys for everbody      Jackson 5 Chorus in background while: [Puff Daddy]
From the old to the new   Come on   Motown   Rock on   Yeah, yeah, yeah, yeah   [Jackson 5 Chorus until fade]

Pause and think! Here is the answer.

df[649+i,[:artist, :song]]

This is the prediction that model gave.

artist_name(test_predictions[i])

"Eminem"

OK so our model got it wrong, but may be you did too?

Although this one was labelled ‘Michael Jackson’ it was in the dataset as a rap remix of the his song with lyrics from P. Didy and Black Rob so I still think Eminem was the best prediction.

Confusion Matrix

The confusion matrix shows where the model predictions were correct (the diagonal) and where they failed (the other cells).

using MLBase

cm=confusmat(5,test_predictions, test_actual)

5×5 Array{Int64,2}:
3   2   2   2
9   1   2   1
2  10   1   0
3   2  14   0
0   0   1  11

labels=[artist_name(x) for x in 1:length(artists)]
cmap=get_cmap("Blues")
cax=matshow(cm)
imshow(cm, interpolation="nearest", cmap=cmap)
colorbar()
xticks(collect(0:4), labels, rotation=45)
yticks(collect(0:4), labels)
xlabel("Actual")
ylabel("Prediction")
show()

A deeper blue means more certainty.

All Predictions

test_predictions=Flux.onecold(m(X_test))
test_actual=Flux.onecold(y_test)
showall(DataFrame(Actual = artist_name.(test_actual); Prediction = artist_name.(test_predictions)))

78×2 DataFrame
│ Row │ Actual          │ Prediction      │
│     │ [90mString[39m          │ [90mString[39m          │
├─────┼─────────────────┼─────────────────┤
│ 1   │ Queen           │ Queen           │
│ 2   │ Eminem          │ INXS            │
│ 3   │ Michael Jackson │ Michael Jackson │
│ 4   │ The Beatles     │ The Beatles     │
│ 5   │ Michael Jackson │ Eminem          │
│ 6   │ INXS            │ INXS            │
│ 7   │ Eminem          │ Eminem          │
│ 8   │ Eminem          │ Eminem          │
│ 9   │ Eminem          │ Queen           │
│ 10  │ The Beatles     │ The Beatles     │
│ 11  │ The Beatles     │ The Beatles     │
│ 12  │ The Beatles     │ Michael Jackson │
│ 13  │ INXS            │ Queen           │
│ 14  │ The Beatles     │ The Beatles     │
│ 15  │ Michael Jackson │ Michael Jackson │
│ 16  │ Queen           │ Queen           │
│ 17  │ Michael Jackson │ Michael Jackson │
│ 18  │ INXS            │ INXS            │
│ 19  │ Michael Jackson │ Michael Jackson │
│ 20  │ Michael Jackson │ Michael Jackson │
│ 21  │ Michael Jackson │ Michael Jackson │
│ 22  │ Eminem          │ Eminem          │
│ 23  │ INXS            │ INXS            │
│ 24  │ INXS            │ INXS            │
│ 25  │ Queen           │ Queen           │
│ 26  │ Michael Jackson │ Michael Jackson │
│ 27  │ The Beatles     │ Queen           │
│ 28  │ Eminem          │ Eminem          │
│ 29  │ INXS            │ Michael Jackson │
│ 30  │ The Beatles     │ Michael Jackson │
│ 31  │ Michael Jackson │ The Beatles     │
│ 32  │ Queen           │ Queen           │
│ 33  │ Michael Jackson │ Michael Jackson │
│ 34  │ Michael Jackson │ INXS            │
│ 35  │ INXS            │ Queen           │
│ 36  │ Michael Jackson │ Michael Jackson │
│ 37  │ Queen           │ Queen           │
│ 38  │ INXS            │ Michael Jackson │
│ 39  │ INXS            │ INXS            │
│ 40  │ Eminem          │ Queen           │
│ 41  │ The Beatles     │ The Beatles     │
│ 42  │ INXS            │ INXS            │
│ 43  │ The Beatles     │ The Beatles     │
│ 44  │ Michael Jackson │ Michael Jackson │
│ 45  │ Michael Jackson │ Michael Jackson │
│ 46  │ INXS            │ Michael Jackson │
│ 47  │ The Beatles     │ The Beatles     │
│ 48  │ INXS            │ The Beatles     │
│ 49  │ Eminem          │ Eminem          │
│ 50  │ Eminem          │ Eminem          │
│ 51  │ Michael Jackson │ Michael Jackson │
│ 52  │ INXS            │ The Beatles     │
│ 53  │ The Beatles     │ The Beatles     │
│ 54  │ Eminem          │ Eminem          │
│ 55  │ Queen           │ Michael Jackson │
│ 56  │ Michael Jackson │ INXS            │
│ 57  │ Queen           │ Queen           │
│ 58  │ Eminem          │ Eminem          │
│ 59  │ Eminem          │ Eminem          │
│ 60  │ Queen           │ Michael Jackson │
│ 61  │ INXS            │ INXS            │
│ 62  │ INXS            │ Queen           │
│ 63  │ INXS            │ INXS            │
│ 64  │ Queen           │ Queen           │
│ 65  │ Michael Jackson │ Michael Jackson │
│ 66  │ Queen           │ INXS            │
│ 67  │ Eminem          │ Eminem          │
│ 68  │ Eminem          │ Eminem          │
│ 69  │ Queen           │ The Beatles     │
│ 70  │ Queen           │ Queen           │
│ 71  │ The Beatles     │ The Beatles     │
│ 72  │ The Beatles     │ Queen           │
│ 73  │ The Beatles     │ The Beatles     │
│ 74  │ Michael Jackson │ Michael Jackson │
│ 75  │ The Beatles     │ INXS            │
│ 76  │ Michael Jackson │ Queen           │
│ 77  │ Michael Jackson │ Queen           │
│ 78  │ INXS            │ INXS            │

Let me know if anything could be improved.

Julia Project - Monte Carlo Simulation for Investment Portfolio Earnings

2019-09-12T00:00:00+00:00

Introduction

In this notebook we use Julia to look at typical investment risk profiles and employ the Monte Carlo method with Geometric Brownian motion (GBM) to simulate the growth of an investment portfolio.

The simulations do not take into account any ongoing payments or tax. Nor do they encompass other factors such as inflation.

Note, do not rely on any part of this article for your own personal circumstances. This is not financial advice!

With the disclaimer out of the way let’s begin as usual by loading the Julia libraries we’ll need.

using DataFrames, CSV, Distributions, PyPlot, Dates, Statistics, StatsFuns

Now let’s load in our risk profile data. Users of the financial software XPLAN will recognise headings used in this dataframe. We only actually need the data from the ‘Total’ column being the overall expected growth (Growth + Income) and ‘StdDev’ which is the risk profile’s standard deviation.

df=CSV.read("/mnt/juliabox/Monte Carlo/assumptions.csv")

What is a Risk Profile

A risk profile is an evaluation of an individual’s willingness and ability to take risks. Financial Advisers often fit client’s into one of several risk profiles after asking them discovery questions. The risk profile names and values above have been made up but they are indicative of real values. The first risk profile ‘Defensive’ is made up from 15% growth assets and 85% defensive assets; this risk profile would suit a cautious investor who wants to make steady progress without taking too much risk. At the other end of the table a ‘Very Aggressive’ risk profile is made up from 100% growth assets and would suit an individual who is more willing to take a risk to gain higher returns.

The function below plots a normal distribution curve of a given risk profile.

function plot_rp(rp)
    μ = df[rp,:].Total
    σ = df[rp,:].StdDev
    dist = Normal(μ, σ)
    x = μ - 3σ : 0.01 : μ + 3σ
    plot(x, pdf(dist,x), label=df[rp,:].ProfileName)
    legend(loc="upper right", fontsize = "small")
    axis([-30,50,0,0.12])
    title("Normal Distribution")
    axvline(x=0, color="k", linestyle="--")
    xlabel("Annual Growth (%)")
    ylabel("PDF")
end

Let’s plot all the curves and then interpret the output.

for rp in 1:length(df)
    plot_rp(rp)
end

Interpretation

The vertical dotted line shows the boundary of positive growth (i.e. making money) vs negative growth (i.e. losing money).
The first and least risky investment profile is ‘Defensive’. You can observe that probability of achieving the mean total growth of 4.2% is the highest and most of the bell curve area is in the positive growth area.
The last and most risky investment profile is ’Very Aggressive’. You can observe that the probability of achieving the mean total growth of 7.27% is the lowest. The elongated bell curve shape means there is scope to earn much higher returns at the expense of possible negative returns.

For more information on Probability Distributions click here

Deterministic Prediction Function

The following function makes a deterministic prediction of the future portfolio value based on the following parameters: -

P is the original principal sum

r is the nominal annual interest rate

n is the compounding frequency

t is the overall length of time the interest is applied (expressed using the same time units as r, usually years).

This prediction assumes no additional contributions and perfect market conditions. For more information see this article on Periodic Compounding.

deterministic_predict(P, r , n, t) = P*(1 + r/n)^(n*t)

#Example 1 from wikipedia as a first sanity check
#Suppose a principal amount of $1,500 is deposited in a bank paying an annual interest rate of 4.3%, compounded quarterly.
#Then the balance after 6 years is found by using the formula above, with P = 1500, r = 0.043 (4.3%), n = 4, and t = 6:

deterministic_predict(1500.0 , 0.043, 4.0, 6.0)

1938.8368221341054

Now let’s apply this function to a retirement saving scenario. Our client is age 40 and wants to retire in 20 years’ time. They currently have $100,000 in their retirement portfolio. What will their balance be like at age 60?

First let’s set a couple of variables and functions that will come in useful.

original_principle_sum=100000 #          Initial portfolio value
interest_rate(rp) = df[rp,:].Total/100 # Simple function to get a risk profile's growth interest rate

Here’s another test output of the function for risk profile 1 (Defensive). For this test let’s assume the interest compounds monthly (12 times a year for each of the 20 years).

deterministic_predict(original_principle_sum, interest_rate(1), 12, 20)

231297.23315323537

By using the Moneysmart Compound Interest Calculator as a second sanity check we can see our deterministic function is working.

Stochastic Prediction Function

The reality with real share portfolios is that unit prices fluctuate up and down on a daily basis. Price fluctuations are generally more volatile for stocks that have the potential to earn more income for the investor. The function below uses Geometric Brownian motion (GBM) to simulate randomised returns based on the given risk profiles. Additional parameters are built into the function to repeat the GBM simulations over-and-over to generate what is known as a Monte Carlo experiment.

Here are some animated gifs showing 20 simulations per risk profile.

The animated gifs below show 20 simulations per risk profile. We can see that as we take more risk the simulations become more volatile.

These gifs were generated with the functions below. Let’s take a closer look at the code used. we start by setting up the known variables and add a few useful functions at the same time.

age=40 #                                 Age at start of projections
frequency = 252 #                        Assume 252 trading days per year
days = 1/frequency #                     Convenient way to express days
yrs_to_days(x)=x*frequency #             Simple function to convert years to days
sigma(rp) = df[rp,:].StdDev/100 #        Simple function to get a risk profile's standard deviation

Now we build the Monte Carlo function. Calling the function produces a matplotlib (PyPlot) chart based on the input parameters. I’ve included some comments in the code but if you need to more depth insight I recommend this great video which gave me the math needed.

function montecarlo(rp, N, iterations, show_Q)
    # rp = the index of the risk profile to use
    # N = Number of years forward to project
    # iterations - no of times to iterate and produce a simulation
    # show_Q - if True, so quantile lines
    
    growth = interest_rate(rp)
    
    # Periodic Daily Return (PDR)
    pdr = log(deterministic_predict(1, growth, frequency, 2*days) / deterministic_predict(1, growth, frequency, 1*days))
    pdr_std = sigma(rp) * sqrt(days)
    pdr_var = pdr_std^2
    drift = pdr - (pdr_var/2)
    
    predictions_all=[]
    
    axis([40,60,0,800000])
    title(df[rp,:].ProfileName)
    xlabel("Age")
    ylabel("Portfolio Value")
    
    for s in 1:iterations
        predictions=[]
        global df_pred=DataFrame(Age = Float64[], MC_Price = Float64[],MC_Balance = Float64[], Deterministic_Balance = Float64[])
        last_price = 1
        for i in 0:yrs_to_days(N)
            i==0 ? mc_price=1 : mc_price=last_price*exp(drift+pdr_std*norminvccdf(rand()))
            push!(df_pred, [age+i*days,
                    mc_price,
                    original_principle_sum*mc_price,
                    deterministic_predict(original_principle_sum, growth, frequency, i*days)])
            push!(predictions, original_principle_sum*mc_price)
            last_price = mc_price
        end
        s == 1 ? predictions_all = predictions : predictions_all = hcat(predictions_all, predictions)
        plot(df_pred[:Age], df_pred[:MC_Balance], color="#B8BFC5", label="Monte Carlo Iteration")
    end 
    if show_Q
        #Show quantile predictions
        df_Q = DataFrame(Age = Float64[], Q1 = Float64[], Q5 = Float64[], Q9 = Float64[])
        for i in 1:yrs_to_days(N)
            push!(df_Q, [age+i*days,
                    quantile(predictions_all[i,:],0.1),
                    quantile(predictions_all[i,:],0.5),
                    quantile(predictions_all[i,:],0.9)])
        end
        plot(df_Q[:Age], df_Q[:Q1], color="r", label="10th Percentile")
        plot(df_Q[:Age], df_Q[:Q5], color="b", label="50th Percentile")
        plot(df_Q[:Age], df_Q[:Q9], color="g", label="90th Percentile") 
    else
        plot(df_pred[:Age], df_pred[:Deterministic_Balance], color="b", label="Deterministic Prediction")   
    end
end

The following code was used to produce a sequence of PNG image files that I later used to create the animated gifs above. I used a free app for the Mac called PicGIF lite to generate the final animated gifs.

using PyCall
@pyimport matplotlib.animation as anim

fig = figure(figsize=(5,4))

for rp in 1:length(df)
    withfig(fig) do
        for k in 1:20
            clf()
            montecarlo(rp, 20, 1, false)
            savefig("rp_" * string(rp) *  "_" * string(k), bbox_inches="tight")
        end
    end
end

By running many simulations (see grey lines below) we can take the mean and quantiles each of each day’s simulations and after a while we start to see deterministic predictions emerging. The area between green and blue can be interpreted as ‘good’ market conditions. The area between the blue and the red would be ‘bad’ market conditions.

# Terminal command line to zip up the PNG files.
# zip rp.zip rp*

montecarlo(1, 20, 100, true)

montecarlo(2, 20, 100, true)

montecarlo(3, 20, 100, true)

montecarlo(4, 20, 100, true)

montecarlo(5, 20, 100, true)

montecarlo(6, 20, 100, true)

Julia Flux Convolutional Neural Network Explained

2019-09-01T00:00:00+00:00

In this blog post we’ll breakdown the convolutional neural network (CNN) demo given in the Flux Model Zoo. We’ll pay most attention to the CNN model build-up and will skip over some of the data preparation and training code.

The objective is to train a CNN to recognize hand-written digits using the famous MNIST dataset.

If you are new to CNN’s I recommend watching all the videos below to obtain the concepts needed to understand this post. Note, some of the videos dive into Kera’s coding but it’s actually very comparable to Flux.

Convolutional Neural Networks (CNNs) explained

Zero Padding in Convolutional Neural Networks explained

Max Pooling in Convolutional Neural Networks explained

Batch Size in a Neural Network explained

OK, we’ve got the concepts let’s dive into the Flux example. The first block of code prepares the data for training.

# Classifies MNIST digits with a convolutional network.
# Writes out saved model to the file "mnist_conv.bson".
# Demonstrates basic model construction, training, saving,
# conditional early-exit, and learning rate scheduling.
#
# This model, while simple, should hit around 99% test
# accuracy after training for approximately 20 epochs.

using Flux, Flux.Data.MNIST, Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated, partition
using Printf, BSON

# Load labels and images from Flux.Data.MNIST
@info("Loading data set")
train_labels = MNIST.labels()
train_imgs = MNIST.images()

# Bundle images together with labels and group into minibatchess
function make_minibatch(X, Y, idxs)
    X_batch = Array{Float32}(undef, size(X[1])..., 1, length(idxs))
    for i in 1:length(idxs)
        X_batch[:, :, :, i] = Float32.(X[idxs[i]])
    end
    Y_batch = onehotbatch(Y[idxs], 0:9)
    return (X_batch, Y_batch)
end
batch_size = 128
mb_idxs = partition(1:length(train_imgs), batch_size)
train_set = [make_minibatch(train_imgs, train_labels, i) for i in mb_idxs]

# Prepare test set as one giant minibatch:
test_imgs = MNIST.images(:test)
test_labels = MNIST.labels(:test)
test_set = make_minibatch(test_imgs, test_labels, 1:length(test_imgs))

Let’s pause here to look at how the training and test data has been arranged. As usual in Flux the training data is arranged as a tuple of x training data and y labels. Let’s verify.

typeof(train_set)

Array{Tuple{Array{Float32,4},Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}},1}

We see that the x part of the tuple is a 4 dimensional Float32 array and the y part is a Flux.OneHotVector.

Let’s take a look at the size of first training batch.

size(train_set[1][1]) # training data Float32

(28, 28, 1, 128)

It is important to note these dimensions are arranged in WHCN order standing for Width, Height, Channels and Number (of batches).

So as expected for MNIST, each image is W=28 pixels x H=28 pixels.

C = 1 as there is only one channel for the grey scale intensity.

N=128 as the batch size.

Now let’s have a look at the size of the first batch of y labels.

size(train_set[1][2]) # OneHotVector labels

(10, 128)

Each OneHotVector in the batch encodes the labelled digit; i.e. whether it is 1 through to 10. You can see the first OneHotVector in the first batch with the following code.

train_set[1][2][:,1]

10-element Flux.OneHotVector:
 false
 false
 false
 false
 false
  true
 false
 false
 false
 false

Flux CNN Model Explained

Here’s the next block of code from the model zoo that we’re mostly interested in: -

# Define our model.  We will use a simple convolutional architecture with
# three iterations of Conv -> ReLU -> MaxPool, followed by a final Dense
# layer that feeds into a softmax probability output.
@info("Constructing model...")
model = Chain(
    # First convolution, operating upon a 28x28 image
    Conv((3, 3), 1=>16, pad=(1,1), relu),
    x -> maxpool(x, (2,2)),

    # Second convolution, operating upon a 14x14 image
    Conv((3, 3), 16=>32, pad=(1,1), relu),
    x -> maxpool(x, (2,2)),

    # Third convolution, operating upon a 7x7 image
    Conv((3, 3), 32=>32, pad=(1,1), relu),
    x -> maxpool(x, (2,2)),

    # Reshape 3d tensor into a 2d one, at this point it should be (3, 3, 32, N)
    # which is where we get the 288 in the `Dense` layer below:
    x -> reshape(x, :, size(x, 4)),
    Dense(288, 10),

    # Finally, softmax to get nice probabilities
    softmax,
)

Layer 1

Conv((3, 3), 1=>16, pad=(1,1), relu),

The first layer can be broken down as follows: -

(3,3) is the convolution filter size (3x3) that will slide over the image detecting new features.

1=>16 is the network input and output size. The input size is 1 recalling that one batch is of size 28x28x1x128. The output size is 16 meaning we’ll create 16 new channels for every training digit in the batch.

pad=(1,1) This pads a single layer of zeros around the images meaning that the dimensions of the convolution output can remain at 28x28.

relu is our activation function.

The output from this layer only can be viewed with model[1](train_set[1][1]) and has the dimensions 28×28×16×128.

Layer 2

x -> maxpool(x, (2,2)),

Convolutional layers are generally followed by a maxpool layer. In our case the parameter (2,2) is the window size that slides over x reducing it to half the size whilst retaining the most important feature information for learning.

The output from this layer only can be viewed with model[1:2](train_set[1][1]) and has the output dimensions 14×14×16×128.

Layer 3

Conv((3, 3), 16=>32, pad=(1,1), relu),

This is the second convolution operating on the output from layer 2.

Conv((3, 3), is the same filter size as before.

16=>32 This time the input is 16 (from layer 2). The output size of the layer will be 32.

The padding, filter size and activation remains the same as before.

The output from this layer only can be viewed with model[1:3](train_set[1][1]) and has the output dimensions 14×14×32×128.

Layer 4

x -> maxpool(x, (2,2)),

Maxpool reduces the dimensionality in half again whilst retaining the most important feature information for learning.

The output from this layer only can be viewed with model[1:4](train_set[1][1]) and has the output dimensions 7×7×32×128.

Layers 5 & 6

Conv((3, 3), 32=>32, pad=(1,1), relu),

x -> maxpool(x, (2,2)),

Perform a final convolution and maxpool. The output from layer 6 is 3×3×32×128

Layer 7

x -> reshape(x, :, size(x, 4)),

The reshape layer effectively flattens the data from 4-dimensions to 2-dimensions suitable for the dense layer and training.

The output from this layer only can be viewed with model[1:7](train_set[1][1]) and has the output dimensions 288×128. If you’re wondering where 288 comes from, it is determined by multiplying the output of layer 6; i.e. 3x3x32.

Layer 8

Dense(288, 10),

Our final training layer takes the input of 288 and outputs a size of 10x128.

(10 for 10 digits 0-9)

Layer 9

softmax,

Outputs probabilities between 0 and 1 of which digit the model has predicted.

The remainder of the code is pasted below for completeness.

# Load model and datasets onto GPU, if enabled
train_set = gpu.(train_set)
test_set = gpu.(test_set)
model = gpu(model)

# Make sure our model is nicely precompiled before starting our training loop
model(train_set[1][1])

# `loss()` calculates the crossentropy loss between our prediction `y_hat`
# (calculated from `model(x)`) and the ground truth `y`.  We augment the data
# a bit, adding gaussian random noise to our image to make it more robust.
function loss(x, y)
    # We augment `x` a little bit here, adding in random noise
    x_aug = x .+ 0.1f0*gpu(randn(eltype(x), size(x)))

    y_hat = model(x_aug)
    return crossentropy(y_hat, y)
end
accuracy(x, y) = mean(onecold(model(x)) .== onecold(y))

# Train our model with the given training set using the ADAM optimizer and
# printing out performance against the test set as we go.
opt = ADAM(0.001)

@info("Beginning training loop...")
best_acc = 0.0
last_improvement = 0
for epoch_idx in 1:100
    global best_acc, last_improvement
    # Train for a single epoch
    Flux.train!(loss, params(model), train_set, opt)

    # Calculate accuracy:
    acc = accuracy(test_set...)
    @info(@sprintf("[%d]: Test accuracy: %.4f", epoch_idx, acc))
    
    # If our accuracy is good enough, quit out.
    if acc >= 0.999
        @info(" -> Early-exiting: We reached our target accuracy of 99.9%")
        break
    end

    # If this is the best accuracy we've seen so far, save the model out
    if acc >= best_acc
        @info(" -> New best accuracy! Saving model out to mnist_conv.bson")
        BSON.@save "mnist_conv.bson" model epoch_idx acc
        best_acc = acc
        last_improvement = epoch_idx
    end

    # If we haven't seen improvement in 5 epochs, drop our learning rate:
    if epoch_idx - last_improvement >= 5 && opt.eta > 1e-6
        opt.eta /= 10.0
        @warn(" -> Haven't improved in a while, dropping learning rate to $(opt.eta)!")

        # After dropping learning rate, give it a few epochs to improve
        last_improvement = epoch_idx
    end

    if epoch_idx - last_improvement >= 10
        @warn(" -> We're calling this converged.")
        break
    end
end

Need more help?, try this article by Mike Gold

Julia Word Embedding Layer in Flux - Self Trained

2019-08-25T00:00:00+00:00

In this example we take a look at how to use an embedding layer in Julia with Flux. If you need help on what embeddings are check out this page and then return here to see how we can use them as the first layer in a neural network.

The objective for this exercise is to machine learn the sentiment of 10 string arrays. The idea came from this tutorial written by Jason Brownlee who used Keras on a similar dataset.

using  Languages, TextAnalysis, Flux, PyPlot, Statistics

#Display Flux Version
import Pkg ; Pkg.installed()["Flux"]

v"0.7.2"

Data Preparation

The first block of code defines our training ‘documents’ and labels (y).

Arr = ["well done",
     "good work",
     "great effort",
     "nice work",
     "excellent",
     "weak",
     "poor effort",
     "not good",
     "poor work",
     "could have done better"]

# positve or negative sentiment to each 'document' string
y = [true true true true true false false false false false]

Next we set up a dictionary of words used. Each word points to an integer index. To do this the TextAnalysis package was used. If you’re interested in how this works watch this video.

docs=[]
for i in 1:length(Arr)
    push!(docs, StringDocument(Arr[i]))
end
crps=Corpus(docs)    
update_lexicon!(crps)
doc_term_matrix=DocumentTermMatrix(crps)
word_dict=doc_term_matrix.column_indices

Dict{String,Int64} with 14 entries:
  "done"      => 3
  "not"       => 10
  "excellent" => 5
  "have"      => 8
  "well"      => 13
  "work"      => 14
  "nice"      => 9
  "effort"    => 4
  "great"     => 7
  "poor"      => 11
  "could"     => 2
  "better"    => 1
  "good"      => 6
  "weak"      => 12

The following function returns the index of the word in the word dictionary and returns 0 if the word is not found.

tk_idx(s) = haskey(word_dict, s) ? i=word_dict[s] : i=0

The following function is used to ensure each document in the corpus has an equal length.

function pad_corpus(c, pad_size)
    M=[]
    for doc in 1:length(c)
        tks = tokens(c[doc])
        if length(tks)>=pad_size
            tk_indexes=[tk_idx(w) for w in tks[1:pad_size]]
        end
        if length(tks)<pad_size
            tk_indexes=zeros(Int64,pad_size-length(tks))
            tk_indexes=vcat(tk_indexes, [tk_idx(w) for w in tks])
        end
        doc==1 ? M=tk_indexes' : M=vcat(M, tk_indexes')
    end
    return M
end

The final step in our data preparation creates a dense matrix where the numbers greater than zero relate to a word. As the maximum document length is 4 (i.e. “could have done better”) we will use the pad size of 4. The matrix is transposed ready for training whereby each column represents one document.

pad_size=4
padded_docs = pad_corpus(crps, pad_size)
x = padded_docs'
data = [(x, y)]

 0  0   0  0   0   0   0   0  2
 0  0   0  0   0   0   0   0  8
 6  7   9  0   0  11  10  11  3
14  4  14  5  12   4   6  14  1

Flux Embedding Preparation

Next let’s get ready for the embedding layer. In this example we’ll learn 8 features per word but for a larger corpus you’ll probably need a higher dimension, perhaps even 300. The vocab size is set to 20 which is higher than the maximum index in our dictionary.

N = size(padded_docs,1)  #Number of documents (10)
max_features = 8
vocab_size = 20

The next block of code sets up a Julia constructor called EmbeddingLayer. The layer is initialized with a special random initializer called glorot_normal. Also note Flux must be set to treelike otherwise it will not update/learn the embeddings.

struct EmbeddingLayer
   W
   EmbeddingLayer(mf, vs) = new(param(Flux.glorot_normal(mf, vs)))
end

@Flux.treelike EmbeddingLayer

(m::EmbeddingLayer)(x) = m.W * Flux.onehotbatch(reshape(x, pad_size*N), 0:vocab_size-1)

Buliding the Model and Training

The model needs some explanation.

Layer 1. As x is fed into the model, the first layer’s embedding function matches the words in each document to corresponding word vectors. This is done by rolling all the word vectors one after the other and using onehotbatch to filter out the unwanted words. The output is a 8x40 array.

Layer 2. Unrolls the vectors into the shape 8x4x10; i.e. 8 features and 10 documents of padded size 4.

Layer 3. Now that our data is in the shape provided by layer 2 we can sum the word vectors to get an overall ‘meaning’ vector for each document. The output is now in the shape size of 8 x 1 x 10.

Layer 4: Drops an axis so that the shape of x is a size suitable for training. After this step the shape is 8x10.

Layer 5: is a normal Dense layer with the sigmoid activation function to give us nice probabilities.

If you’d like to see each layer in action I recommend using m[1](x) to see sample output from the first layer. m[1:2](x) to see output from the second layer and so on.

m = Chain(EmbeddingLayer(max_features, vocab_size),
          x -> reshape(x, max_features, pad_size, N),
          x -> sum(x, dims=2),
          x -> reshape(x, max_features, N),
          Dense(max_features, 1, σ)
)

Chain(EmbeddingLayer(Float32[0.278128 0.111989 … -0.244614 -0.377189; 0.0647178 0.0683725 … -0.112626 -0.434706; … ; 0.397401 0.407925 … 0.438091 0.0588613; -0.361919 -0.114776 … -0.356307 -0.10119] (tracked)), getfield(Main, Symbol("##3#6"))(), getfield(Main, Symbol("##4#7"))(), getfield(Main, Symbol("##5#8"))(), Dense(8, 1, NNlib.σ))

Now let’s initialize some arrays and create a function to calculate accuracy.

loss_h=[]
accuracy_train=[]
accuracy(x, y) = mean(x .== y)

As this is a binary (1 or 0) classification problem we need to use binarycrossentropy to calculate the loss. The optimizer is gradient descent.

loss(x, y) = sum(Flux.binarycrossentropy.(m(x), y))
optimizer = Flux.Descent(0.01)

Train the model.

for epoch in 1:400
    Flux.train!(loss, Flux.params(m), data, optimizer)
    #println(loss(x, y).data, " ", accuracy(m(x).>0.5,y))
    push!(loss_h, loss(x, y).data)
    push!(accuracy_train, accuracy(m(x).>0.5,y))
end
println(m(x).>0.5)
accuracy(m(x).>0.5,y)

Bool[true false true true true false false false false false]
0.9

Outputs over 0.5 are considered positive (true) and our final accuracy is 90%.

The second example is incorrectly scored as false. I think this is because the words “good” and “work” also appear in the negative examples. Next we’ll see what happens using the pre-trained word embeddings.

figure(figsize=(12,5))

subplot(121)
PyPlot.xlabel("Epoch")
ylabel("Loss")
plot(loss_h)

subplot(122)
PyPlot.xlabel("Epoch")
ylabel("Accuracy")
plot(accuracy_train, label="train")

Note, some parts of this could be done more elegantly, let me know.

Julia Word Embedding Layer in Flux - Pre-trained GloVe

2019-08-25T00:00:00+00:00

This example follows on from tutorial #1 in which we trained our own embedding layer. This time we use pre-trained word vectors (GloVe) instead of learning them. We’ll skip over some of the explanations as this is covered in tutorial #1.

As before, the objective for this exercise is to machine learn the sentiment of 10 string arrays. The idea came from this tutorial written by Jason Brownlee who used Keras on a similar dataset.

using  Languages, TextAnalysis, Flux, PyPlot, Statistics

#Display Flux Version
import Pkg ; Pkg.installed()["Flux"]

v"0.7.2"

Data Preparation

This code block is the same as tutorial #1. See this for more explanation.

Arr = ["well done",
     "good work",
     "great effort",
     "nice work",
     "excellent",
     "weak",
     "poor effort",
     "not good",
     "poor work",
     "could have done better"]

# positve or negative sentiment to each 'document' string
y = [true true true true true false false false false false]

docs=[]
for i in 1:length(Arr)
    push!(docs, StringDocument(Arr[i]))
end
crps=Corpus(docs)    
update_lexicon!(crps)
doc_term_matrix=DocumentTermMatrix(crps)
word_dict=doc_term_matrix.column_indices

tk_idx(s) = haskey(word_dict, s) ? i=word_dict[s] : i=0

function pad_corpus(c, pad_size)
    M=[]
    for doc in 1:length(c)
        tks = tokens(c[doc])
        if length(tks)>=pad_size
            tk_indexes=[tk_idx(w) for w in tks[1:pad_size]]
        end
        if length(tks)<pad_size
            tk_indexes=zeros(Int64,pad_size-length(tks))
            tk_indexes=vcat(tk_indexes, [tk_idx(w) for w in tks])
        end
        doc==1 ? M=tk_indexes' : M=vcat(M, tk_indexes')
    end
    return M
end

pad_size=4
padded_docs = pad_corpus(crps, pad_size)
x = padded_docs'
data = [(x, y)]

Flux Embedding Preparation

Load the pre-trained embeddings

This function loads the pre-trained GloVe embeddings. Try Embeddings.jl for a better way to do this if you can get it to work.

function load_embeddings(embedding_file)
    local LL, indexed_words, index
    indexed_words = Vector{String}()
    LL = Vector{Vector{Float32}}()
    open(embedding_file) do f
        index = 1
        for line in eachline(f)
            xs = split(line)
            word = xs[1]
            push!(indexed_words, word)
            push!(LL, parse.(Float32, xs[2:end]))
            index += 1
        end
    end
    return reduce(hcat, LL), indexed_words
end

We’ll use one of the smaller embedding files (glove.6B.50d.txt) as this problem is trivial. This file can be downloaded from here and must reside in the current working folder.

embeddings, vocab = load_embeddings("glove.6B.50d.txt")
embed_size, max_features = size(embeddings)
println("Loaded embeddings, each word is represented by a vector with $embed_size features. The vocab size is $max_features")

Loaded embeddings, each word is represented by a vector with 50 features. The vocab size is 400000

This function provides the index of a word in the GloVe embedding.

#Function to return the index of the word in the embedding (returns 0 if the word is not found)
function vec_idx(s)
    i=findfirst(x -> x==s, vocab)
    i==nothing ? i=0 : i 
end

This function provides the GloVe word vector of the given word.

wvec(s) = embeddings[:, vec_idx(s)]
wvec("done")

Here you can see the GloVe vector representation of one of our words “done”.

50-element Array{Float32,1}:
  0.33076  
 -0.4387   
 -0.32163  
 -0.4931   
  0.10254  
 -0.0027421
 -0.5172   
  0.024336 
 -0.12816  
  0.14349  
 -0.16691  
  0.56121  
 -0.56241  
  ⋮        
  0.060552 
 -0.16143  
 -0.26668  
 -0.1766   
  0.01582  
  0.25528  
 -0.096739 
 -0.097282 
 -0.084483 
  0.33312  
 -0.22252  
  0.74457  

Embedding Preparation

N = size(padded_docs,1)  #Number of documents (10)
max_features = 50
vocab_size = 20

The next block of code initializes a random embedding matrix as per the size of our vocab.

embedding_matrix=Flux.glorot_normal(max_features, vocab_size)

Now we overwrite the random embedding matrix with our word vectors from GloVe. The word vectors are inserted as columns as per the index from word_dict plus 1. The reason we add 1 is so that 0 can represent a zero-word that has been padded.

for term in doc_term_matrix.terms
    if vec_idx(term)!=0
        embedding_matrix[:,word_dict[term]+1]=wvec(term)
    end
end   

Buliding the Model and Training

m = Chain(x -> embedding_matrix * Flux.onehotbatch(reshape(x, pad_size*N), 0:vocab_size-1),
          x -> reshape(x, max_features, pad_size, N),
          x -> sum(x, dims=2),
          x -> reshape(x, max_features, N),
          Dense(max_features, 1, σ)
)

The model (m) needs some explanation.

Layer 1. The first layer’s embedding function matches the words in each document to corresponding word vectors. This is done by rolling all the word vectors one after the other and using onehotbatch to filter out the unwanted words. The output is a 50x40 array.

Layer 2. Unrolls the vectors into the shape 50x4x10; i.e. 8 features and 10 documents of padded size 4.

Layer 4: Drops the axis (1) so that the shape of x is a size suitable for training. After this step the shape is 50x10.

Layer 5: is a normal Dense layer with the sigmoid activation function to give us nice probabilities.

If you’d like to see each layer in action I recommend usingm[1](x) to see sample output from the first layer. m[1:2](x) to see output from the second layer and so on.

loss_h=[]
accuracy_train=[]
accuracy(x, y) = mean(x .== y)

As this is a binary (1 or 0) classification problem we need to use binarycrossentropy to calculate the loss. The optimizer is gradient descent.

loss(x, y) = sum(Flux.binarycrossentropy.(m(x), y))
optimizer = Flux.Descent(0.001)

Train the model

for epoch in 1:400
    Flux.train!(loss, Flux.params(m), data, optimizer)
    #println(loss(x, y).data, " ", accuracy(m(x).>0.5,y))
    push!(loss_h, loss(x, y).data)
    push!(accuracy_train, accuracy(m(x).>0.5,y))
end
println(m(x).>0.5)
accuracy(m(x).>0.5,y)

Bool[true true true true true false false false false false]
1.0

Outputs over 0.5 are considered positive (true) and our final accuracy this time is 100%.

figure(figsize=(12,5))

subplot(121)
PyPlot.xlabel("Epoch")
ylabel("Loss")
plot(loss_h)

subplot(122)
PyPlot.xlabel("Epoch")
ylabel("Accuracy")
plot(accuracy_train, label="train")

Note, I think some parts of this could be done more elegantly, let me know if anything could be improved (I’m still learning too).

Julia Word Embedding with Dracula

2019-08-05T00:00:00+00:00

According to Wikipedia “Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers”.

It is word vectors that make technologies such as speech recognition and machine translation possible. The algorithms to create them come from the likes of Google’s (Word2Vec), Facebook’s (FastText) and Stanford University’s (GloVe). For this notebook we will use a pre-trained embedding file built using GloVe.

The embedding file I used below is glove.6B.50d.txt. This file can be downloaded from GloVe and needs to be in the current working folder for this example.

The ideas explored below come from a brilliant GitHub Post Understanding word vectors … for, like, actual poets. By Allison Parrish. This was a Python notebook and I have basically re-written part of it in Julia. Very little credit goes to me!

Let’s begin by loading the packages we will need.

using Distances, Statistics
using MultivariateStats
using PyPlot
using WordTokenizers
using TextAnalysis
using DelimitedFiles

Load the Embeddings

There is a Julia package to load the embeddings with one or two lines of code called Embeddings.jl but I couldn’t get the package to install. I figured out the code to load the embeddings by delving into the repository.

function load_embeddings(embedding_file)
    local LL, indexed_words, index
    indexed_words = Vector{String}()
    LL = Vector{Vector{Float32}}()
    open(embedding_file) do f
        index = 1
        for line in eachline(f)
            xs = split(line)
            word = xs[1]
            push!(indexed_words, word)
            push!(LL, parse.(Float32, xs[2:end]))
            index += 1
        end
    end
    return reduce(hcat, LL), indexed_words
end

The function above takes the input of the embeddings filename and returns two arrays: -

embeddings – A Float32 Array, each row represents one word as an d dimensional vector
vocab – A string array of all the words

embeddings, vocab = load_embeddings("glove.6B.50d.txt")
vec_size, vocab_size = size(embeddings)
println("Loaded embeddings, each word is represented by a vector with $vec_size features. The vocab size is $vocab_size")

Loaded embeddings, each word is represented by a vector with 50 features. The vocab size is 400000

Lost? Don’t worry hang in there! Let’s see what in these arrays by way of some simple functions and examples.

Functions we’ll need

The function vec_idx returns the index position of a given word in the vocab. We can see that “cheese” is the 5796th word.

vec_idx(s) = findfirst(x -> x==s, vocab)
vec_idx("cheese")

The function vec returns the word vector of the given word. Below is the vector for the word “cheese”.

function vec(s) 
    if vec_idx(s)!=nothing
        embeddings[:, vec_idx(s)]
    end    
end
vec("cheese")

50-element Array{Float32,1}:
 -0.053903
 -0.30871 
 -1.3285  
 -0.43342 
  0.31779 
  1.5224  
 -0.6965  
 -0.037086
 -0.83784 
  0.074107
 -0.30532 
 -0.1783  
  1.2337  
  ⋮       
  1.9502  
 -0.53274 
  1.1359  
  0.20027 
  0.02245 
 -0.39379 
  1.0609  
  1.585   
  0.17889 
  0.43556 
  0.68161 
  0.066202

It’s pretty difficult to imagine words in a 50 dimensional space so let’s have a think about how some word vectors might look in 2 dimensions.

The words that are closer together have a similar meaning or context. Using the distances between word vectors things get interesting. Let’s define a function to do this using the cosine distance between two word vectors and then test it out (back to 50 dimensions!)

cosine(x,y)=1-cosine_dist(x, y)

The following cell shows that the cosine similarity between dog and puppy is larger than the similarity between trousers and octopus, thereby demonstrating that the vectors are working how we expect them to

cosine(vec("dog"), vec("puppy")) > cosine(vec("trousers"),vec("octopus"))

true

Now let’s define a function to give us a list of nearest neighbouring words.

function closest(v, n=20)
    list=[(x,cosine(embeddings'[x,:], v)) for x in 1:size(embeddings)[2]]
    topn_idx=sort(list, by = x -> x[2], rev=true)[1:n]
    return [vocab[a] for (a,_) in topn_idx]
end

Testing this out on the word “wine” we can see that list of words returned are all related words. It’s pretty remarkable given the word relationships were ‘learned’ and not specified by a human thesaurus word boffin.

closest(vec("wine"))

20-element Array{String,1}:
 "wine"     
 "wines"    
 "tasting"  
 "coffee"   
 "beer"     
 "champagne"
 "drink"    
 "taste"    
 "grape"    
 "drinks"   
 "beers"    
 "bottled"  
 "gourmet"  
 "blend"    
 "chocolate"
 "tastes"   
 "dessert"  
 "flavor"   
 "fruit"    
 "cooking"  

Math on words

Water + Frozen

closest(vec("water") + vec("frozen"))

20-element Array{String,1}:
 "water"       
 "frozen"      
 "dry"         
 "dried"       
 "salt"        
 "milk"        
 "oil"         
 "waste"       
 "liquid"      
 "ice"         
 "freezing"    
 "covered"     
 "hot"         
 "drain"       
 "food"        
 "sand"        
 "sugar"       
 "soil"        
 "contaminated"
 "cold"        

Amazingly the list contains ice!

Halfway between Day and Night

closest(mean([vec("day"), vec("night")]))

20-element Array{String,1}:
 "night"    
 "day"      
 "days"     
 "weekend"  
 "morning"  
 "sunday"   
 "afternoon"
 "saturday" 
 "came"     
 "week"     
 "evening"  
 "coming"   
 "next"     
 "on"       
 "before"   
 "hours"    
 "weeks"    
 "went"     
 "hour"     
 "time"     

The list contains morning and afternoon!

Blue is to Sky as X is to Grass

blue_to_sky = vec("blue") - vec("sky")
closest(blue_to_sky + vec("grass"))

20-element Array{String,1}:
 "grass"   
 "green"   
 "leaf"    
 "cane"    
 "bamboo"  
 "trees"   
 "grasses" 
 "tree"    
 "yellow"  
 "lawn"    
 "cotton"  
 "lawns"   
 "red"     
 "pink"    
 "farm"    
 "turf"    
 "vine"    
 "rubber"  
 "soft"    
 "chestnut"

Green is there at the top!

Man - Woman + Queen

closest(vec("man") - vec("woman") + vec("queen"))

20-element Array{String,1}:
 "queen"     
 "king"      
 "prince"    
 "crown"     
 "coronation"
 "royal"     
 "knight"    
 "lord"      
 "lady"      
 "ii"        
 "great"     
 "majesty"   
 "honour"    
 "name"      
 "palace"    
 "crowned"   
 "famous"    
 "throne"    
 "dragon"    
 "named"     

King = Magic!

Sentence Similarity with Dracula

Load the book Dracular by Bram Stoker from this website as plain text - https://www.gutenberg.org/ebooks/345

txt = open("pg345.txt") do file
    read(file, String)
end
println("Loaded Dracula, length=$(length(txt)) characters")

Loaded Dracula, length=883114 characters

The next cell tidies up the book’s data by removing characters that are not alpha-numeric and splits the text up into an array of sentences.

txt = replace(txt, r"\n|\r|_|," => " ")
txt = replace(txt, r"[\"*();!]" => "")
sd=StringDocument(txt)
prepare!(sd, strip_whitespace)
sentences = split_sentences(sd.text)
i=1
for s in 1:length(sentences)
    if length(split(sentences[s]))>3
        sentences[i]=lowercase(replace(sentences[s], "."=>""))
        i+=1
    end
end
sentences[1000:1010]

Ouput of sentences 1000 to 1010

11-element Array{SubString{String},1}:
 "he seems absolutely imperturbable"                                                                                          
 "i can fancy what a wonderful power he must have over his patients"                                                          
 "he has a curious habit of looking one straight in the face as if trying to read one's thoughts"                             
 "he tries this on very much with me but i flatter myself he has got a tough nut to crack"                                    
 "i know that from my glass"                                                                                                  
 "do you ever try to read your own face?"                                                                                     
 "i do and i can tell you it is not a bad study and gives you more trouble than you can well fancy if you have never tried it"
 "he says that i afford him a curious psychological study and i humbly think i do"                                            
 "i do not as you know take sufficient interest in dress to be able to describe the new fashions"                             
 "dress is a bore"                                                                                                            
 "that is slang again but never mind arthur says that every day"                                                              

This next function sentvec takes the vectors of each word in the array and finds the mean vector of the whole sentence.

function sentvec(s) 
    local arr=[]
    for w in split(sentences[s])
        if vec(w)!=nothing
            push!(arr, vec(w))
        end
    end
    if length(arr)==0
        ones(Float32, (50,1))*999
    else
        mean(arr)
    end
end

sentences[101]

"there was everywhere a bewildering mass of fruit blossom--apple plum pear cherry and as we drove by i could see the green grass under the trees spangled with the fallen petals"

sentvec(100)

50-element Array{Float32,1}:
  0.3447293   
  0.39965677  
 -0.054723457 
 -0.07291292  
  0.21394199  
  0.15642972  
 -0.49596983  
 -0.24674776  
 -0.23787305  
 -0.4288543   
 -0.314565    
 -0.18126178  
 -0.15339927  
  ⋮           
  0.08461739  
 -0.20704514  
 -0.22955278  
 -0.011368492 
  0.03529108  
  0.057512715 
 -0.0074529666
  0.02252327  
  0.037329756 
 -0.52179056  
 -0.076994695 
 -0.49725753  

This function returns the n nearest sentences (without any pretraining).

function closest_sent(input_str, n=20)
    mean_vec_input=mean([vec(w) for w in split(input_str)])
    list=[(x,cosine(mean_vec_input, sentvec(x))) for x in 1:length(sentences)]
    topn_idx=sort(list, by = x -> x[2], rev=true)[1:n]
    return [sentences[a] for (a,_) in topn_idx]
end

closest_sent("my favorite food is strawberry ice cream")

20-element Array{String,1}:
 "we get hot soup or coffee or tea and off we go"                                                                                                                                                                                          
 "there is not even a toilet glass on my table and i had to get the little shaving glass from my bag before i could either shave or brush my hair"                                                                                         
 "i had for dinner or rather supper a chicken done up some way with red pepper which was very good but thirsty"                                                                                                                            
 "drink it off like a good child"                                                                                                                                                                                                          
 "no you don't you couldn't with eyebrows like yours"                                                                                                                                                                                      
 "oh yes they like the lotus flower make your trouble forgotten"                                                                                                                                                                           
 "this with some cheese and a salad and a bottle of old tokay of which i had two glasses was my supper"                                                                                                                                    
 "but lor' love yer 'art now that the old 'ooman has stuck a chunk of her tea-cake in me an' rinsed me out with her bloomin' old teapot and i've lit hup you may scratch my ears for all you're worth and won't git even a growl out of me"
 "i know that from my glass"                                                                                                                                                                                                               
 "i found my dear one oh so thin and pale and weak-looking"                                                                                                                                                                                
 "And I like it not."                                                                                                                                                                                                                      
 "she has more colour in her cheeks than usual and looks oh so sweet"                                                                                                                                                                      
 "i can go with you now if you like"                                                                                                                                                                                                       
 "make them get heat and fire and a warm bath"                                                                                                                                                                                             
 "i felt in my heart a wicked burning desire that they would kiss me with those red lips"                                                                                                                                                  
 "give me some water my lips are dry and i shall try to tell you"                                                                                                                                                                          
 "oh what a strange meeting and how it all makes my head whirl round i feel like one in a dream"                                                                                                                                           
 "so i said:-- you like life and you want life?"                                                                                                                                                                                           
 "i had for breakfast more paprika and a sort of porridge of maize flour which they said was mamaliga and egg-plant stuffed with forcemeat a very excellent dish which they call impletata"                                                
 "for a little bit her breast heaved softly and her breath came and went like a tired child's"                                                                                                                                             

It’s interesting to see the sentences returned - they are indeed mostly similar.

As the sentence similarity function is slow to run I created a pre-trained array of all the sentences and the corresponding word vectors.

drac_sent_vecs=[]
for s in 1:length(sentences)
    i==1 ? drac_sent_vecs=sentvec(s) : push!(drac_sent_vecs,sentvec(s))
end

Save data to files (to save doing the training step next time).

writedlm( "drac_sent_vec.csv",  drac_sent_vecs, ',')

writedlm( "drac_sentences.csv",  sentences, ',')

Open the files

sentences=readdlm("drac_sentences.csv", '!', String, header=false)
drac_sent_vecs=readdlm("drac_sent_vec.csv", ',', Float32, header=false)

8093×50 Array{Float32,2}:
395886     0.136462     0.0393325   …   -0.00172208   -0.094155  
105341     0.298508    -0.108769        -0.11237       0.108809  
306499     0.372668     0.0499599        0.011585     -0.0269931 
439134     0.237768    -0.157471        -0.047655     -0.206138  
479465     0.0339237    0.0574679       -0.0110334    -0.0810052 
305005     0.236101    -0.167058    …   -0.161612     -0.481633  
274253    -0.103281    -0.0939105       -0.0443089    -0.0691436 
454941     0.308015    -0.376682         0.118407     -0.017146  
280243     0.0355603   -0.371213        -0.054871      0.0895917 
303624     0.24452     -0.259576        -0.0073874     0.372042  
292713     0.0700706   -0.128396    …   -0.0598984     0.0768687 
427364     0.0626689   -0.00844564      -0.0528361     0.20124   
42247      0.139159    -0.134028        -0.109309     -0.322777  
   ⋮                                     ⋱                            
527544     0.0679754   -0.0678955       -0.0834867    -0.141069  
274218    -0.120684    -0.176243         0.156214     -0.2699    
364304     0.277423     0.163191         0.00988463   -0.119377  
386379     0.203583     0.148782        -6.83427e-5   -0.125681  
0938667    0.214723     0.586457    …   -0.0834033     0.454743  
  -0.66594     -0.6551       0.92148         -0.42447      -0.058735  
447467     0.25429     -0.151193        -0.0932182    -0.244452  
215579     0.135113     0.0431876       -0.307311     -0.121217  
374962     0.121228    -0.172914        -0.106937     -0.301211  
194821    -0.0167174   -0.0303678   …    0.0276704     0.168872  
605342     0.221943     0.21447         -0.143455      0.00104976
0        999.0        999.0            999.0         999.0       

Redefine the sentence similarity function to look at the pre-trained array.

function closest_sent_pretrained(pretrained_arr, input_str, n=20)
    mean_vec_input=mean([vec(w) for w in split(input_str)])
    list=[(x,cosine(mean_vec_input, pretrained_arr[x,:])) for x in 1:length(sentences)]
    topn_idx=sort(list, by = x -> x[2], rev=true)[1:n]
    return [sentences[a] for (a,_) in topn_idx]
end

Test it out and the results are instant this time.

closest_sent_pretrained(drac_sent_vecs, "i walked into a door")

20-element Array{String,1}:
 "with a glad heart i opened my door and ran down to the hall"                                                                                                                                                                                   
 "i held my door open as he went away and watched him go into his room and close the door"                                                                                                                                                       
 "again a shock: my door was fastened on the outside"                                                                                                                                                                                            
 "suddenly he called out:-- look madam mina look look i sprang up and stood beside him on the rock he handed me his glasses and pointed"                                                                                                         
 "then lucy took me upstairs and showed me a room next her own where a cozy fire was burning"                                                                                                                                                    
 "i keep the key of our door always fastened to my wrist at night but she gets up and walks about the room and sits at the open window"                                                                                                          
 "just before twelve o'clock i just took a look round afore turnin' in an' bust me but when i kem opposite to old bersicker's cage i see the rails broken and twisted about and the cage empty"                                                  
 "if he go through a doorway he must open the door like a mortal"                                                                                                                                                                                
 "i went to the door"                                                                                                                                                                                                                            
 "when i came back i found him walking hurriedly up and down the room his face all ablaze with excitement"                                                                                                                                       
 "i came back to my room and threw myself on my knees"                                                                                                                                                                                           
 "after a few minutes' staring at nothing jonathan's eyes closed and he went quietly into a sleep with his head on my shoulder"                                                                                                                  
 "every window and door was fastened and locked and i returned baffled to the porch"                                                                                                                                                             
 "i sat down beside him and took his hand"                                                                                                                                                                                                       
 "bah with a contemptuous sneer he passed quickly through the door and we heard the rusty bolt creak as he fastened it behind him"                                                                                                               
 "passing through this he opened another door and motioned me to enter"                                                                                                                                                                          
 "Suddenly he called out:-- Look Madam Mina look look I sprang up and stood beside him on the rock he handed me his glasses and pointed."                                                                                                        
 "just outside stretched on a mattress lay mr morris wide awake"                                                                                                                                                                                 
 "i could see easily for we did not leave the room in darkness she had placed a warning hand over my mouth and now she whispered in my ear:-- hush there is someone in the corridor i got up softly and crossing the room gently opened the door"
 "i have to be away till the afternoon so sleep well and dream well with a courteous bow he opened for me himself the door to the octagonal room and i entered my bedroom"                                                                       

Up and Running! How the website was created

2019-08-04T00:00:00+00:00

Hello World!

This static website is hosted for free on GitHub Pages with Jekyll

The posts and pages are markdown files (.md) which makes them quick to compose. The super cool thing is that you can save your Notebooks as markdown files from Jupyter and the output file only requires a small amount of editing to make it look good on the Jekyll website.

Big shout out to Michael Rose for the theme Minimal Mistakes

To get the math equations looking good I use MathJax in the markdown.

Julia Flux Simple Regression Model

2019-08-04T00:00:00+00:00

Flux is a Neural Network Machine Learning library for the Julia programming language. Flux may be likened to TensorFlow but it shows potential to be easier as there is no additional ‘graphing’ language layer to learn – it’s just plain Julia.

Let’s get started with a simple example.

using Distributions, PyPlot, Random, Flux

#Display Flux Version
import Pkg ; Pkg.installed()["Flux"]

v"0.7.2"

Generate some data randomly distributed about the polynomial function $-0.1x^2 + 2x$

f(x) = -0.1*x^2 + 2*x
Random.seed!(1000)
x = collect(1:10)
y = [f(i) for i in x] .+ rand(Normal(0,0.75),10)

#Plot f(x) and models using n data points
n=100
x_rng=LinRange(1, 10, n)

figure(figsize=(3,3))
scatter(x,y)
plot(x_rng,f.(x_rng), color="gray")
show()

The Julia function below takes the inputs of our ‘random’ data $x, y$ and returns a one of two trained Flux models. The goal is to predict a fit close to the known polynomial f(x).

Model 1 is the most trivial with one dense input; i.e. $y = σ.(W * x .+ b)$

Model 2 has 1 hidden layer with a definable amount of neurons for experimentation

Training is done with the optimiser : Gradient Descent

NOTE: σ = identity = i.e. the identity matrix for regression

function train_model(x, y, hl_neurons=0)
    
    # x must be an `in` × N matrix
    x = x'
    
    # Create data iterator for 1000 epochs
    data_iterator = Iterators.repeated((x, y), 1000)
    
    # Set-up model layout
    if hl_neurons==0
        m = Chain(Dense(1,1), identity)
    else
        m = Chain(Dense(1, hl_neurons, tanh),
                  Dense(hl_neurons, 1, identity))
    end
    
    #Our loss function to minimize
    loss(x, y) = sum((m(x) .- y').^2)
    optimizer = Flux.Descent(0.0001)
    Flux.train!(loss, Flux.params(m), data_iterator, optimizer)
    return m
end

Make predictions and plot against our source data. Note, in the example I included 10 neurons.

model=train_model(x, y)
y_linear=reshape(model(x').data, length(x),)

model=train_model(x, y, 10)
y_hid=reshape(model(x_rng').data, n,)

figure(figsize=(12,5))

subplot(121)
scatter(x,y)
plot(x_rng,f.(x_rng), color="gray", label="Source Polynomial f(x)")
plot(x,y_linear, label="Predictions using Dense Layer Model")
legend()

subplot(122)
scatter(x,y)
plot(x_rng,f.(x_rng), color="gray", label="Source Polynomial f(x)")
plot(x_rng,y_hid, label="Predictions using Hidden Layer Model")
legend()
show()

The introduction of the hidden layer approximates our function well! Apparently, a one layer neural network can approximate any continuous function. I might put this to the test another day.

The trained parameters of the model can be obtained with Flux.params(model). For the 10-neuron model you end up with 10 sets of parameters for the trained weights and biases. You cannot approximate the original polynomial co-efficients of f(x) as such.