Wednesday, October 16, 2013

First (stumbling) steps with Julia

Julia is a new programming language. It's built for speed and simplicity. The general consensus seems to be that julia is not quite ready for prime-time, but it has great features and looks to be approaching a sweet spot: nearly as fast as really low level languages with the functionality and simplicity of higher level languages (R, for one).

As a first step, I recreated some simple simulations for epidemiologic bread and butter: logistic regression. I'm living dangerously and built from source the nightly (julia 0.2) build from github (my 2nd or 3rd experience with github - it's handy and simple if you don't care to learn the intricacies).

The goodies:


The first step is to install and load the necessary packages. This is very easy to do using the "Pkg.add" function (similar to R "install.packages") Julia has an integrated JIT (just in time) compiler (this is where the speed comes from). This means: you type your code, you run it, and the first time you run it, Julia will spend a little bit of time compiling bits of your code. The next time you run it, it is already in a language that the computer understands (not Julia), so *magic* it runs much faster. As a consequence of this, Julia code can be sped up by creating lots of functions to handle operations. Simulations have discrete tasks (simulate data, run regression, store results, repeat lots of times, summarize results) that naturally follow this framework.

There is no built in way to generate a Bernoulli variable, so I created one using a loop (this is (relatively) slow! more on that later).
#new generic function rand_bernoulli to generate values from a Bernoulli
# distribution with mean mu
#loopized
function rand_bernoulli(N, mu::Float64)
    x::Array{Int64,1}= Array(Int64,N) 
    for i in 1:N
        x[i] = rand()<mu
    end
    x
end
For example:
In [15]: rand_bernoulli(5, .1)
Out[15]:
5-element Array{Int64,1}:
 0
 0
 0
 1
 0
A couple of things to note:

  • Julia does well if you declare variable types (e.g. Int64), but it's not essential (the "N" argument, in this example, could be N::Int64, but do you really care that much? No, you don't.)
  • Julia becomes grumpy if you declare the wrong type (e.g. rand_bernoulli(5, 1) wouldn't work because it thinks the 1 is an Int64 variable (use 1.0 instead)


These functions a) generate a data set of size N (using the expit function for the mean of the Bernoulli variables to meet the model form assumptions) and b) perform logistic regression (from GLM package)
#using loops
TheTruth=log(2)
function mk_data(N)
    #be nice to Julia, declare your arrays!
    z=Array(Int64,N)  
    x=Array(Int64,N) 
    y=Array(Int64,N) 
    for i in 1:N
        z[i] = rand_bernoulli(1, 0.5)[1]
        x[i] = rand_bernoulli(1, 1/(1+exp(-(log(1) - log(2)*.5 + log(2)*z[i]))))[1]
        y[i] = rand_bernoulli(1, 1/(1+exp(-(log(1) + log(2)*.5 + TheTruth*.5 - log(2)*z[i] - TheTruth*x[i]))))[1]
    end
    dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])#found this using "index(data)"
end

function do_logistic(dat)
    res = GLM.glm(:(y ~ x + z), dat, Binomial(), LogitLink())
    GLM.coef(res)
end


The wrapper function does all the work - by nesting functions in this way, the Julia compiler gets to do a lot of work (since it compiles functions, rather than open code) and remains happy. The sum_results function just organizes the simulation results into some (not very beautiful, admittedly) readable output.
function rep_logistic(N, iter)
    res = Array(Float64,iter,3)
    for i in 1:iter
        dat::DataFrame = mk_data(N)
        res[i,1:3] = do_logistic(dat)
    end
    DataFrames.DataFrame([res], ["bz", "bx", "by"])
end

function sum_results(df::DataFrame, parm::Symbol, truth::Float64=TheTruth)
    mn = mean(df[parm])
    bias = mean(df[parm])-truth
    sd = std(df[parm])
    mse = bias*bias + sd*sd
    DataFrames.DataFrame(["TheTruth", "Mean", "Std", "Bias", "MSE"],[truth, mn, sd, bias, mse])
end

Now here are better methods for a couple of functions: julia (like R) likes vectors, so operating elementwise on vectors (rather than looping over values) will usually be faster.
#vectorized
function rand_bernoulli(N, mu::Array{Float64, 1})
    x::Array{Int64,1}= (.<(rand(N),mu))
    x
end

function mk_data(N, dummy::Int64)
        z::Array{Int64,1} = rand_bernoulli(N, 0.5)
        x::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) - log(2)*.5 + .*(log(2),z)))))
        y::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) + log(2)*.5 - TheTruth*.5 - .*(log(2),z) + .*(TheTruth,x)))))
    dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])
end
In the rand_bernoulli case, the only difference in the function call is that the second argument must be an array. "rand_bernoulli" can now be called two different ways (i.e. this function definition didn't overwrite the other one). This means that
rand_bernoulli(2, .2)
and
rand_bernoulli(2, [.2,.2])
will give the same result (a 2 length vector of 1/0 variables), but actually call two different functions (or two different versions of the generic function).

This functioning can be mimicked with a dummy argument, as shown in the vectorized mk_data function, where the second argument only lets Julia know that I want the data generating function that doesn't get all loopy:

There was a large difference between these two before I edited the functions to include specific variable types (i.e. changed the data generated to Int64 variables instead of the default Float64 - this shaved the difference by quite a bit). But there still is some difference (vectorized functions are faster):
@time results = rep_logistic(1000, 2000)
@time results2 = rep_logistic2(1000, 2000)
print(sum_results(results, :(bx)))
print(sum_results(results2, :(bx)))
elapsed time: 2.950533386 seconds (1073743064 bytes allocated)
elapsed time: 2.45951474 seconds (966245824 bytes allocated)
5x2 DataFrame:
                x1         x2
[1,]    "TheTruth"   0.693147
[2,]        "Mean"   0.698213
[3,]         "Std"   0.130677
[4,]        "Bias" 0.00506555
[5,]         "MSE"  0.0171022
5x2 DataFrame:
                x1         x2
[1,]    "TheTruth"   0.693147
[2,]        "Mean"   0.695558
[3,]         "Std"   0.134239
[4,]        "Bias" 0.00241109
[5,]         "MSE"  0.0180258


This difference will be magnified if you are simulating more complex data from more interesting distributions than the Bernoulli. Hey - it's easier than flipping a coin 2 million times!

The take home for me:
  • Julia + iPython notebook makes interactive julia really, really pleasant
  • Julia has some really neat features that are probably a bit beyond my usual needs
  • I want to translate more of my basic proof-of-concept programs into Julia
  • Julia is not going to save me any time until there are more stat packages
  • I don't care - learning is fun!
Final verdict: hot/not

No comments:

Post a Comment