As a first step, I recreated some simple simulations for epidemiologic bread and butter: logistic regression. I'm living dangerously and built from source the nightly (julia 0.2) build from github (my 2nd or 3rd experience with github - it's handy and simple if you don't care to learn the intricacies).
The goodies:
The first step is to install and load the necessary packages. This is very easy to do using the "Pkg.add" function (similar to R "install.packages") Julia has an integrated JIT (just in time) compiler (this is where the speed comes from). This means: you type your code, you run it,
There is no built in way to generate a Bernoulli variable, so I created one using a loop (this is (relatively) slow! more on that later).
#new generic function rand_bernoulli to generate values from a Bernoulli
# distribution with mean mu
#loopized
function rand_bernoulli(N, mu::Float64)
x::Array{Int64,1}= Array(Int64,N)
for i in 1:N
x[i] = rand()<mu
end
x
end
For example:
In [15]: rand_bernoulli(5, .1)
Out[15]:
5-element Array{Int64,1}:
0
0
0
1
0
A couple of things to note:- Julia does well if you declare variable types (e.g. Int64), but it's not essential (the "N" argument, in this example, could be N::Int64, but do you really care that much? No, you don't.)
- Julia becomes grumpy if you declare the wrong type (e.g. rand_bernoulli(5, 1) wouldn't work because it thinks the 1 is an Int64 variable (use 1.0 instead)
These functions a) generate a data set of size N (using the expit function for the mean of the Bernoulli variables to meet the model form assumptions) and b) perform logistic regression (from GLM package)
#using loops
TheTruth=log(2)
function mk_data(N)
#be nice to Julia, declare your arrays!
z=Array(Int64,N)
x=Array(Int64,N)
y=Array(Int64,N)
for i in 1:N
z[i] = rand_bernoulli(1, 0.5)[1]
x[i] = rand_bernoulli(1, 1/(1+exp(-(log(1) - log(2)*.5 + log(2)*z[i]))))[1]
y[i] = rand_bernoulli(1, 1/(1+exp(-(log(1) + log(2)*.5 + TheTruth*.5 - log(2)*z[i] - TheTruth*x[i]))))[1]
end
dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])#found this using "index(data)"
end
function do_logistic(dat)
res = GLM.glm(:(y ~ x + z), dat, Binomial(), LogitLink())
GLM.coef(res)
end
The wrapper function does all the work - by nesting functions in this way, the Julia compiler gets to do a lot of work (since it compiles functions, rather than open code) and remains happy. The sum_results function just organizes the simulation results into some (not very beautiful, admittedly) readable output.
function rep_logistic(N, iter)
res = Array(Float64,iter,3)
for i in 1:iter
dat::DataFrame = mk_data(N)
res[i,1:3] = do_logistic(dat)
end
DataFrames.DataFrame([res], ["bz", "bx", "by"])
end
function sum_results(df::DataFrame, parm::Symbol, truth::Float64=TheTruth)
mn = mean(df[parm])
bias = mean(df[parm])-truth
sd = std(df[parm])
mse = bias*bias + sd*sd
DataFrames.DataFrame(["TheTruth", "Mean", "Std", "Bias", "MSE"],[truth, mn, sd, bias, mse])
end
Now here are better methods for a couple of functions: julia (like R) likes vectors, so operating elementwise on vectors (rather than looping over values) will usually be faster.
#vectorized
function rand_bernoulli(N, mu::Array{Float64, 1})
x::Array{Int64,1}= (.<(rand(N),mu))
x
end
function mk_data(N, dummy::Int64)
z::Array{Int64,1} = rand_bernoulli(N, 0.5)
x::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) - log(2)*.5 + .*(log(2),z)))))
y::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) + log(2)*.5 - TheTruth*.5 - .*(log(2),z) + .*(TheTruth,x)))))
dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])
end
In the rand_bernoulli case, the only difference in the function call is that the second argument must be an array. "rand_bernoulli" can now be called two different ways (i.e. this function definition didn't overwrite the other one). This means that
rand_bernoulli(2, .2)and
rand_bernoulli(2, [.2,.2])will give the same result (a 2 length vector of 1/0 variables), but actually call two different functions (or two different versions of the generic function).
This functioning can be mimicked with a dummy argument, as shown in the vectorized mk_data function, where the second argument only lets Julia know that I want the data generating function that doesn't get all loopy:
There was a large difference between these two before I edited the functions to include specific variable types (i.e. changed the data generated to Int64 variables instead of the default Float64 - this shaved the difference by quite a bit). But there still is some difference (vectorized functions are faster):
@time results = rep_logistic(1000, 2000) @time results2 = rep_logistic2(1000, 2000) print(sum_results(results, :(bx))) print(sum_results(results2, :(bx)))
elapsed time: 2.950533386 seconds (1073743064 bytes allocated)
elapsed time: 2.45951474 seconds (966245824 bytes allocated)
5x2 DataFrame:
x1 x2
[1,] "TheTruth" 0.693147
[2,] "Mean" 0.698213
[3,] "Std" 0.130677
[4,] "Bias" 0.00506555
[5,] "MSE" 0.0171022
5x2 DataFrame:
x1 x2
[1,] "TheTruth" 0.693147
[2,] "Mean" 0.695558
[3,] "Std" 0.134239
[4,] "Bias" 0.00241109
[5,] "MSE" 0.0180258
This difference will be magnified if you are simulating more complex data from more interesting distributions than the Bernoulli. Hey - it's easier than flipping a coin 2 million times!
The take home for me:
- Julia + iPython notebook makes interactive julia really, really pleasant
- Julia has some really neat features that are probably a bit beyond my usual needs
- I want to translate more of my basic proof-of-concept programs into Julia
- Julia is not going to save me any time until there are more stat packages
- I don't care - learning is fun!



No comments:
Post a Comment