Thursday, October 24, 2013

Don't be suboptimal! Optimize with R and C (a tutorial)

The alternate title for this post is "How to skin a cat by starting a company that builds cat-skinning machines." For those of you who are not familiar with that cultural reference, just know to expect that this post includes a way of doing linear regression that is complete overkill.

That said, I recently had to perform an optimization of a non-linear function (if you don't understand what that means, don't worry), and "nlopt" is a really fast way to do that. I nearly pulled out my hair trying to get NLOPT to play nicely with R and julia on my computer, so I thought I'd share the end product to hopefully prevent a few meltdowns from other people who didn't grow up programming or have never taken a class in C.

Anywho, here's the story:

R has lots of ways to farm out things to existing C libraries, which are super fast. One way to perform opimization (for example, maximum likelihood) is to use C libraries that can optimize any function that can be minimized (for example, taking the negative of the likelihood and minimizing it). Smart people have written these algorithms, and, if you have some non-standard estimation to do, then the C libraries are a good option in R

The NLoptr package in R interfaces with NLOPT, which is a C library (collection of different functions written in C) that has some great optimization routines. For example, the Nelder-Mead simplex algorithm was the first non-linear optimization routine on which I cut my teeth. "Non-linear" here just means (*lots of hand waving*), but that it can also work with simpler functions, such as linear regression. Everyone (well, everyone who still cares at this point) understands linear regression, so it's a good starting point for learning the NLOPT routines.

Here are the steps I had to carry out (not necessarily the ones I DID carry out, since those involved heavy drinking and therapy):

  1. Buy a Mac and make yourself an admin user (search Google - I did this a while ago)
  2. Install Xcode command line tools (see above)
  3. Install the NLOPT library on your mac
  4. Install the "nloptr"package in R
  5. Grossly over-code a linear regression in order to understand how to use nloptr

Let's skip to step 2:

Download the nlopt source code file at into /Users/foo/Downloads.

In terminal, run this:

cd /Users/foo/Downloads
tar -xvf nlopt-2.3.tar.gz
cd nlopt-2.3
./configure --enable-shared --prefix="/usr"
make 
sudo make install

The "./configure" statement is key, this tells your computer to set up NLOPT as a shared library in the "/usr" folder of your computer. I have used some programs *cough* julia *cough* that could only handle shared libraries with this particular prefix (I'm not sure if R is the same, but try this way of configuring if it doesn't work right away). The sudo command works like this.

Now to the R. This is basic:

library("nloptr")
library("ggplot2")
#help(package="nloptr") #if you didn't know this one
Make the data and define the optimization algorithm. In this case, since we are doing linear regression, it's just the normal log-likelihood.
N=100
set.seed(8675309)
x = rnorm(N)
y = rnorm(N, 0 + .15*x)

normal_likelihood = function(beta, x, y){
 mu = beta[1] + beta[2]*x
 sigma=beta[3]
 #contribution of each individual to the log likelihood
 lli <- log(2*pi)/2 - log(sigma^2)/2 - 1/(2*sigma^2)*(y-mu)^2
 logLik <- sum(lli)
 logLik
 }

mystartvals = c(0, .4, 1)
normal_likelihood(beta=mystartvals, x, y)

This is the log-likelihood of the data, given the model, evaluated at the starting values
[1] 41.86465


Now let's have fun! One way to visualize optimization is by plotting the objective function (the log-likelihood, in this case) over a set of parameter values. I'll use the ggplot2 package and the stat_contour statement to make a contour plot. I usually find contour plots extremely hard to interpret, but this one is easy. This involves a "grid search" for the maximum likelihood estimate, i.e. I evaluate the objective function over a grid of points and then just pick out the grid points that give me the highest likelihood (or, equivalently lowest value of the deviance=2*LogLik, here).

#plotting the log likelihood (assume sigma=1)
alongb1 = seq(-.5, .5, 0.01) 
alongb2 = seq(-.5, .5, 0.01) 

store = matrix(nrow = length(alongb1)*length(alongb2), ncol=3)
idx = 1

for(i in alongb1){
 for(j in alongb2){
 store[idx,] = c(i,j,-2*normal_likelihood(beta=c(i,j, 1), x, y))
  idx=idx+1
 }
}

dat = data.frame(store)
names(dat) = c("beta1", "beta2", "Dev")
str(dat)

#this is the maximimum likelihood estimate along our grid of points
mindev = dat[dat$Dev==min(dat$Dev),]

z = ggplot(dat, aes(beta1, beta2, z=Dev), geom="contour") + stat_contour(binwidth=3.84) + theme_bw() +
 geom_point(data=mindev, aes(beta1, beta2)) + geom_text(data=mindev, aes(beta1, beta2), label=paste0("MLE =(", round(mindev$beta1, 2), ", ", round(mindev$beta2, 2), ")"), vjust=2)
#MLE for beta2, given that beta1 = 0
ggsave("/Users/foo/Desktop/contour.png", z)



Note here that I am setting "sigma" to be 1. Since I simulated the data, I know this to be true, but it is part of the assumptions necessary to make this function plottable (how would you plot in four dimensions?) Here's the result, very nice!
Now let's optimize this function! The point of optimization is to do basically what we did with the plot, so let's break down the steps:
  1. set up constraints (in our grid search, we decided to look on only over the range of beta1 and beta2 on the axes of the plot)
  2. evaluate that function over a range of values (in the grid search, these values were pre-defined, but during optimization these values will be chosen by the optimization algorithm according to the "shape" of the objective function).
  3. Pick the values of the parameters where the objective function is optimal (minimized or maximized, depending on how you set up the objective function)
Here that is with the "nloptr" package. First, set up starting values and constraints (upper and lower bounds on beta1, beta2, and sigma). Note that most of my bounds are infinite (or negatively infinite), except for sigma, which has a lower bound of 0 (you can't have a negative variance!)
#using NL opt
#starting values (repeated from above)
mystartvals = c(0, .4, 1)

#lower bounds 
mylb = c(-Inf, -Inf, 0)
#upper bounds 
myub = c(Inf, Inf, Inf)



Next, define the objective function to minimize (the default is minimization in NLOPT). The arguments for the functions should be only the parameters you are trying to minimize over. In this case, that is beta1, beta2, and sigma. Thus, I create a "wrapper" function for my original log likelihood where the only argument is a vector of length three that corresponds to beta1, beta2, and sigma.

#wrapper function (to avoid need to call x and y, and also to have a function to be minimized)
normal_likelihood_foropt  = function(...){
 dev = -2*normal_likelihood(..., x=x, y=y)
}


And without further ado, let's minimize!

nloptr(x0 = mystartvals, 
 eval_f = normal_likelihood_foropt, 
 eval_grad_f = NULL, #optional gradient function (needed for some algorithms)
 lb = mylb,
 ub = myub,
 opts = list(algorithm= "NLOPT_LN_NELDERMEAD", maxeval=1000, xtol_rel=10e-8) #options list available by submitting nloptr.print.options()
)


The only thing to note from the above is the "opts" part of the "nloptr" function call. I use the Nelder-Mead simplex algorithm (since it doesn't require me to use find any derivatives), and set the xtol_rel value to 10e-8 (10 to the minus 8). This makes sure that the optimization doesn't stop until the value of each parameter changes by less than 10e-8, relative to its absolute value (so it accepts larger error for very large parameter values). "maxeval" just tells the algorithm to not stop before 1000 iterations if it hasn't successfully minimized the function - easy enough! Because the Nelder-Mead algorithm does not require the objective function gradient (derivative), it is a true lazy persons' algorithm. Hence, my choice. It might also be your choice if there is no good way to calculate a derivative. One tradeoff is that it can take a while to converge (196 iterations, here - good thing NLOPT is fast!).

Here's the output from nloptr.
Call:
nloptr(x0 = mystartvals, eval_f = normal_likelihood_foropt, eval_grad_f = NULL, lb = mylb, ub = myub, opts = list(algorithm = "NLOPT_LN_NELDERMEAD", maxeval = 1000, xtol_rel = 0.0000001))


Minimization using NLopt version 2.3.0 

NLopt solver status: 4 ( NLOPT_XTOL_REACHED: Optimization stopped because xtol_rel or xtol_abs (above) was reached. )

Number of Iterations....: 196 
Termination conditions:  maxeval: 1000 xtol_rel: 0.0000001 
Number of inequality constraints:  0 
Number of equality constraints:    0 
Optimal value of objective function:  -87.0180037431632 
Optimal value of controls: -0.05436936 0.2180665 0.9839782

 
Now compare with a linear fit (using GLM so we can compare the sigma values)
> summary(glm(y ~ x, dat, family=gaussian))
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.05437    0.09956  -0.546   0.5862
x            0.21807    0.10752   2.028   0.0453

(Dispersion parameter for gaussian family taken to be 0.9879727)
...


As you can see, the results are very close for beta1 = (intercept), beta2 = x, and sigma = dispersion parameter (sort of, there are differences between my sigma estimates based on degrees of freedom - that's for a statistics text to explain). Both of these are also very close to the values we found on the grid search, plotted above (assuming sigma=1).

Optimization is worth learning if for no other reason than to help understand some of the black-boxiness that typically goes into models. It's still very black-boxy, but at least, by the time you are done with this process we've had to think about some of the assumptions that our results are based on. Happy optimizing!

Wednesday, October 23, 2013

Post-parallelism: the debacle that was yesterday's post

When you are someone who learns primarily through doing things the wrong way and then troubleshooting, you tend to miss the mark quite a bit. Take, for example, an excerpt from yesterday's post on parallel processing in julia

#vectorized, parallel
function mk_data_parallel(N, dummy::Int64)
#better way, using vectors, parallel processing
        Z::RemoteRef = @spawn rand_bernoulli(N, 0.5)
        z::Array{Int64,1} = fetch(Z)
        X::RemoteRef = @spawn rand_bernoulli(N, 1/(1+exp(-(log(1) - log(2)*.5 + .*(log(2),z)))))
        x::Array{Int64,1} = fetch(X)                
        Y::RemoteRef = @spawn rand_bernoulli(N, 1/(1+exp(-(log(1) + log(2)*.5 - TheTruth*.5 - .*(log(2),fetch(z)) + .*(TheTruth,fetch(x))))))
        y::Array{Int64,1} = fetch(Y)                
        dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])#found this using "index(data)"
end


Something's wrong here.




(If only life were this easy and the kernel could always just restart - image source: funnyjunk.com)

What I learned from this is how to think about parallel computing - i.e. when it will be beneficial, when it will be useless, and when it will be detrimental or even destructive. This last bit, the destructiveness, shouldn't result from programs that are designed to be idiot-proof, but may result when language designers fail to understand the vast pool of ineptness that is the beginning programmer. 

To understand when parallel computing might be beneficial, it helps to think of real life situations in which parallel processing might be useful. For this, it is helpful to imagine a set of analogies in which your computer is an office and

  • The manager is the programmer
  • The processors/ processor cores are underlings
  • The memory and hard drive space are the annual and total project budget

Now, consider ways to maximize the potential output under these constraints: you can hire, at most, 3 more underlings; each underling can perform at the same fast speed, but they all work from home and can't communicate with each other except for at weekly meetings; your total budget is nearly infinite, but the annual budget is tight. 

Under these constraints, things that would be poor candidates for parallel processing include: large matrix operations (these exist mainly in memory and thus suck up much of the annual budget), and loops in which you are farming out multiple tasks per loop to each processor (the workers only communicate at weekly meetings, so they won't really be working faster as a group - they'll just spend more time twiddling their thumbs between meetings), and matrix operations in which one matrix depends on the matrix defined in the last step (same reasons).

Things that might be good candidates: multiple, independent loops that don't require huge matrices.

Here's an example of julia code that probably won't benefit from parallelization:

function mk_data_parallel(N, dummy::Int64)
#better way, using vectors, parallel processing
        z1 = @spawn rand_bernoulli(N, 0.5)
        z2 = @spawn rand_bernoulli(N, 0.5)
        z3 = @spawn rand_bernoulli(N, 0.5)
        z::Array{Int64,1} = fetch(z1) + fetch(z2) + fetch(z3)
        x::Array{Int64,1} =  rand_bernoulli(N, 1/(1+exp(-(log(1) - log(2)*1.5 + .*(log(2),z)))))
        y::Array{Int64,1} =  rand_bernoulli(N, 1/(1+exp(-(log(1) + log(2)*.5 - TheTruth*.5 - .*(log(2),z) + .*(TheTruth,x)))))
    dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])#found this using "index(data)"
end


This is an almost identical example to yesterday's post, but I have added three "Z" variables. These are independent processes, so it might make sense to send them to different processors. However, they are also (potentially) large matrix processes, so, even if there are three workers operating independently, they are still subject to the annual budget (memory). To put this in clear terms, here is this function paired up against a function that is identical except for the "@spawn" macros (which is what allows this function to have parallel operations).


samples = 100000000
@time for(i in 1:50) mk_data(samples, 1);end
@time for(i in 1:50) mk_data_parallel(samples,1);end



With the result that the parallel process is actually much, much slower with 100,000,000 row matrix processing (yes, I waited a long time for these to finish).

#elapsed time: 479.088011559 seconds (92312591048 bytes allocated)
#elapsed time: 1380.626116475 seconds (96133008500 bytes allocated)



Now look at this profile of recent processor and memory activity to see why:


The left panel shows that the (4-core) processor is running at well below capacity, even though I've farmed out tasks to three of the cores. The right panel shows that the memory is pretty much maxed out. This shows that, while the workers in my office are all operating at below their capacity, they are still eating up the annual budget at incredible rates (I've never actually seen my RAM working this hard). Thus, the speed constraint here is memory, and not processor time, and parallel processing won't see benefits (or, it may be worse, as in my example).

I can't point to why the parallel version is so much worse, but I have a theory: 3 processes all trying to take all of the memory at once forces a sort of triage in memory in which the computer spends a good deal of time just managing all of the simultaneous requests. It's sort of like a localized denial of service attack whereby the sheer magnitude of requests has the effect of slowing down everything (to the point of choking). In real life, the way this plays out is that: a) I am working on a dissertation b) I have results due for a research assistanceship c) I am working on a few side-project papers d) I have a marathon coming up and e) something goes wrong with my dissertation that requires focused attention. For me, this results in neglecting my preparation for a-e and just taking a nap. The real challenge of my life is figuring out how to manage the requests so that the overwhelmed sensation never happens. I feel your pain, computer.


It should have been clear to me that this code is not prime for parallelization, but only now do I see why.

As a side note, something that does seem ripe for parallelization is a series of logistic models run on independent datasets, such as this function (a loop with processor hungry operations performed on independent pieces):
function rep_logistic2(N, iter)
#parallel version
function rep_logistic2notparallel(N, iter) #n should be even
    res = Array(Float64,int(iter/2),3)
    res2 = Array(Float64,int(iter/2),3)
    for i in int(linspace(1, int(iter/2), int(iter/2)))
        #use vectorized method of data generation by including the data-type 
        # which denotes a different type
        dat1::DataFrame = mk_data(N,1)
        dat2::DataFrame = mk_data(N,1)
        res[i,1:3] = do_logistic(dat1)
        res2[i,1:3] = do_logistic(dat2) #depends on number of processes
    end
    DataFrames.DataFrame([res;res2], ["bz", "bx", "by"])
end


Unfortunately, this was forcing a kernel restart - I thought it was due to parallelization. It turns out, though, that there seems to be a bug in the GLM package, since I get the same error when running just a single model.

Lessons learned

  • To know the benefits of a particular programming choice, come up with some apt analogies that relate to real life situations
  • Don't blame parallelization when there is an error, check the basic components first
  • PIBCAK (problem is between chair and keyboard) is a well known acronym for a reason - keep that in mind
  • Take more naps

Monday, October 21, 2013

Julia: sometimes it's just better not to multitask

As a follow up to my last post regarding julia, here is the faster, vectorized version of one of the functions I created last time to make a dataset for use in simulations:

#vectorized
function mk_data(N, dummy::Int64)
        z::Array{Int64,1} = rand_bernoulli(N, 0.5)
        x::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) - log(2)*.5 + .*(log(2),z)))))
        y::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) + log(2)*.5 - TheTruth*.5 - .*(log(2),z) + .*(TheTruth,x)))))
    dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])
end

Well - if you have a computer that can handle parallel processing (which, let's face it, if you are reading a write-up about a not-quite-yet-totally-useful programming language, you do - sorry for the syntax), then you can eke just a bit more speed out of some of the functions by utilizing more of your processors.

Here is the even more sped up version of the mk_data function (start up julia with "julia -p n" where n is the number of cores or "workers" you wish (and have) to make available to julia):

#vectorized, parallel
function mk_data_parallel(N, dummy::Int64)
#better way, using vectors, parallel processing
        Z::RemoteRef = @spawn rand_bernoulli(N, 0.5)
        z::Array{Int64,1} = fetch(Z)
        X::RemoteRef = @spawn rand_bernoulli(N, 1/(1+exp(-(log(1) - log(2)*.5 + .*(log(2),z)))))
        x::Array{Int64,1} = fetch(X)                
        Y::RemoteRef = @spawn rand_bernoulli(N, 1/(1+exp(-(log(1) + log(2)*.5 - TheTruth*.5 - .*(log(2),fetch(z)) + .*(TheTruth,fetch(x))))))
        y::Array{Int64,1} = fetch(Y)                
        dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])#found this using "index(data)"
end

How much faster, you ask? Well... a little:
samples = 5000000
@time for(i in 1:20) mk_data(samples, 1);end
@time for(i in 1:20) mk_data_parallel(samples, 1);end
elapsed time: 16.037076406 seconds (15263312584 bytes allocated)
elapsed time: 14.392936369 seconds (15263391460 bytes allocated)


This function makes a nice candidate for the "@spawn" macro in julia. @spawn works by sending away the calculation to one of the "workers" and then immediately starting in on the next process. This involves the extra step of fetching the results from the worker (the 'fetch' command, literally). Think of this as being the equivalent of sending one of your minions to do a job for you - you won't know if it's done (or how much faster they are than you) until you fetch the results from them.
So while the Z vector is being created, julia has already started multitasking and putting together X. This works nicely when these are large vectors. Under a smaller sample (but with more iterations run, to really show the difference):

samples = 50000
@time for(i in 1:2000) mk_data(samples, 1);end
@time for(i in 1:2000) mk_data_parallel(samples, 1);end
elapsed time: 26.610033651 seconds (15298193488 bytes allocated)
elapsed time: 27.616185355 seconds (15313162140 bytes allocated)

The parallel version is slower! The @spawn macro involves a bit of overhead, since it must undertake the task of moving things back and forth from the various workers - when this task is long relative to the time saved by having multiple processes operating simultaneously, then parallel operations may not help and may even cost you efficiency. The takeaway here is that all of the whizbang gadgetry in the world can't save us time if we spend too much time on the very human process of trudging from task to task. Efficiency is multi-dimensional. I'm looking at you: laptop!



One implementation note: in order to run code on multiple processes, each process needs to know the functions you are calling. To do this with user defined functions, this means placing your functions into a file (sample.jl) and then calling it with something like
require("sample.jl")

Which automatically makes the code in that file available to all processes. It need only contain the code that you will be calling with "@spawn" (or the remotecall_fetch function shown below). My sample.jl file looks like this:

function rand_bernoulli(N, mu::Float64)
    x::Array{Int64,1}= Array(Int64,N) 
    for i in 1:N
        x[i] = rand()\< mu
    end
    x
end
#vectorized
function rand_bernoulli(N, mu::Array{Float64, 1})
    x::Array{Int64,1}= (.\<
                        (rand(N),mu))
    x
end
function do_logistic(dat)
    res = GLM.glm(:(y ~ x + z), dat, Binomial(), GLM.LogitLink())
    GLM.coef(res)
end


A second way to do parallel processing with julia is through the remotecall or remotecall_fetch functions. The remote call function takes a process number (worker number) as an argument, so it is slightly more tunable than the @spawn. Here are two examples of the function I used last time: the first is familiar and the second is parallelized. The remotecall_fetch function distributes the "do_logistic" function on the data frame "dat" to the 1st and 2nd processes. Because the parallelized version uses a smaller loop and more vectorization, it's a slightly more efficient process, anyways.

function rep_logistic2(N, iter)
    res = Array(Float64,iter,3)
    for i in 1:iter
        #use vectorized method of data generation by including the data-type 
        # which denotes a different type
        dat::DataFrame = mk_data(N,1)
        res[i,1:3] = do_logistic(dat)
    end
    DataFrames.DataFrame([res], ["bz", "bx", "by"])
end
#parallel version
function rep_logistic2parallel(N, iter) #n should be even
    res = Array(Float64,int(iter/2),3)
    res2 = Array(Float64,int(iter/2),3)
    for i in int(linspace(1, int(iter/2), int(iter/2)))
        #use vectorized method of data generation by including the data-type 
        # which denotes a different type
        dat::DataFrame = mk_data(N,1)
        res[i,1:3] = remotecall_fetch(1,do_logistic, dat)
        res2[i,1:3] = remotecall_fetch(2,do_logistic, dat) #depends on number of processes

    end
    DataFrames.DataFrame([res;res2], ["bz", "bx", "by"])
end

Oops!


Turns out, I can't show any run-times comparing this last set of functions. I'm on the "nightly build" train with julia, which occasionally means that some features don't work. Before I tried out the previous function, I built an updated julia that apparently has a little bug in processes that were working for me in older versions. Such is the life on the bleeding edge!

[Update: I've since re-built julia and re-attained parallel functioning. However, julia is generally allowing me to run each parallelized function 1 time before suffering from a kernel re-start. There are still things to be worked out with parallel computing in julia. This is also one of those concrete signs that julia is not *quite* ready to become a primary language for many projects. Still... progress!]

Verdict for parallelism: sort of hot/not

Wednesday, October 16, 2013

First (stumbling) steps with Julia

Julia is a new programming language. It's built for speed and simplicity. The general consensus seems to be that julia is not quite ready for prime-time, but it has great features and looks to be approaching a sweet spot: nearly as fast as really low level languages with the functionality and simplicity of higher level languages (R, for one).

As a first step, I recreated some simple simulations for epidemiologic bread and butter: logistic regression. I'm living dangerously and built from source the nightly (julia 0.2) build from github (my 2nd or 3rd experience with github - it's handy and simple if you don't care to learn the intricacies).

The goodies:


The first step is to install and load the necessary packages. This is very easy to do using the "Pkg.add" function (similar to R "install.packages") Julia has an integrated JIT (just in time) compiler (this is where the speed comes from). This means: you type your code, you run it, and the first time you run it, Julia will spend a little bit of time compiling bits of your code. The next time you run it, it is already in a language that the computer understands (not Julia), so *magic* it runs much faster. As a consequence of this, Julia code can be sped up by creating lots of functions to handle operations. Simulations have discrete tasks (simulate data, run regression, store results, repeat lots of times, summarize results) that naturally follow this framework.

There is no built in way to generate a Bernoulli variable, so I created one using a loop (this is (relatively) slow! more on that later).
#new generic function rand_bernoulli to generate values from a Bernoulli
# distribution with mean mu
#loopized
function rand_bernoulli(N, mu::Float64)
    x::Array{Int64,1}= Array(Int64,N) 
    for i in 1:N
        x[i] = rand()<mu
    end
    x
end
For example:
In [15]: rand_bernoulli(5, .1)
Out[15]:
5-element Array{Int64,1}:
 0
 0
 0
 1
 0
A couple of things to note:

  • Julia does well if you declare variable types (e.g. Int64), but it's not essential (the "N" argument, in this example, could be N::Int64, but do you really care that much? No, you don't.)
  • Julia becomes grumpy if you declare the wrong type (e.g. rand_bernoulli(5, 1) wouldn't work because it thinks the 1 is an Int64 variable (use 1.0 instead)


These functions a) generate a data set of size N (using the expit function for the mean of the Bernoulli variables to meet the model form assumptions) and b) perform logistic regression (from GLM package)
#using loops
TheTruth=log(2)
function mk_data(N)
    #be nice to Julia, declare your arrays!
    z=Array(Int64,N)  
    x=Array(Int64,N) 
    y=Array(Int64,N) 
    for i in 1:N
        z[i] = rand_bernoulli(1, 0.5)[1]
        x[i] = rand_bernoulli(1, 1/(1+exp(-(log(1) - log(2)*.5 + log(2)*z[i]))))[1]
        y[i] = rand_bernoulli(1, 1/(1+exp(-(log(1) + log(2)*.5 + TheTruth*.5 - log(2)*z[i] - TheTruth*x[i]))))[1]
    end
    dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])#found this using "index(data)"
end

function do_logistic(dat)
    res = GLM.glm(:(y ~ x + z), dat, Binomial(), LogitLink())
    GLM.coef(res)
end


The wrapper function does all the work - by nesting functions in this way, the Julia compiler gets to do a lot of work (since it compiles functions, rather than open code) and remains happy. The sum_results function just organizes the simulation results into some (not very beautiful, admittedly) readable output.
function rep_logistic(N, iter)
    res = Array(Float64,iter,3)
    for i in 1:iter
        dat::DataFrame = mk_data(N)
        res[i,1:3] = do_logistic(dat)
    end
    DataFrames.DataFrame([res], ["bz", "bx", "by"])
end

function sum_results(df::DataFrame, parm::Symbol, truth::Float64=TheTruth)
    mn = mean(df[parm])
    bias = mean(df[parm])-truth
    sd = std(df[parm])
    mse = bias*bias + sd*sd
    DataFrames.DataFrame(["TheTruth", "Mean", "Std", "Bias", "MSE"],[truth, mn, sd, bias, mse])
end

Now here are better methods for a couple of functions: julia (like R) likes vectors, so operating elementwise on vectors (rather than looping over values) will usually be faster.
#vectorized
function rand_bernoulli(N, mu::Array{Float64, 1})
    x::Array{Int64,1}= (.<(rand(N),mu))
    x
end

function mk_data(N, dummy::Int64)
        z::Array{Int64,1} = rand_bernoulli(N, 0.5)
        x::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) - log(2)*.5 + .*(log(2),z)))))
        y::Array{Int64,1} = rand_bernoulli(N, 1/(1+exp(-(log(1) + log(2)*.5 - TheTruth*.5 - .*(log(2),z) + .*(TheTruth,x)))))
    dat = DataFrames.DataFrame(["z"=>z,"x"=>x,"y"=>y])
end
In the rand_bernoulli case, the only difference in the function call is that the second argument must be an array. "rand_bernoulli" can now be called two different ways (i.e. this function definition didn't overwrite the other one). This means that
rand_bernoulli(2, .2)
and
rand_bernoulli(2, [.2,.2])
will give the same result (a 2 length vector of 1/0 variables), but actually call two different functions (or two different versions of the generic function).

This functioning can be mimicked with a dummy argument, as shown in the vectorized mk_data function, where the second argument only lets Julia know that I want the data generating function that doesn't get all loopy:

There was a large difference between these two before I edited the functions to include specific variable types (i.e. changed the data generated to Int64 variables instead of the default Float64 - this shaved the difference by quite a bit). But there still is some difference (vectorized functions are faster):
@time results = rep_logistic(1000, 2000)
@time results2 = rep_logistic2(1000, 2000)
print(sum_results(results, :(bx)))
print(sum_results(results2, :(bx)))
elapsed time: 2.950533386 seconds (1073743064 bytes allocated)
elapsed time: 2.45951474 seconds (966245824 bytes allocated)
5x2 DataFrame:
                x1         x2
[1,]    "TheTruth"   0.693147
[2,]        "Mean"   0.698213
[3,]         "Std"   0.130677
[4,]        "Bias" 0.00506555
[5,]         "MSE"  0.0171022
5x2 DataFrame:
                x1         x2
[1,]    "TheTruth"   0.693147
[2,]        "Mean"   0.695558
[3,]         "Std"   0.134239
[4,]        "Bias" 0.00241109
[5,]         "MSE"  0.0180258


This difference will be magnified if you are simulating more complex data from more interesting distributions than the Bernoulli. Hey - it's easier than flipping a coin 2 million times!

The take home for me:
  • Julia + iPython notebook makes interactive julia really, really pleasant
  • Julia has some really neat features that are probably a bit beyond my usual needs
  • I want to translate more of my basic proof-of-concept programs into Julia
  • Julia is not going to save me any time until there are more stat packages
  • I don't care - learning is fun!
Final verdict: hot/not