Optimizing Barnes-Hut simulations for many-core super computers using the scalable parallel runtime