Sunday, March 25, 2018

Container-based Computing Platforms Anywhere!

Wanna Container-based Computing Platforms Anywhere?

Imaging that all you need is just some bare bone OS (Linux, Mac, or Windows) with only a tiny installation of Docker (Linux, Mac, or, Windows), and, within a few minutes, you can have an array of your favorite tools, IDE (Eclipse, ScalaIDE, IntelliJ, PyCharm, etc.), programming languages environments (Java 8/9, Python 2/3, Maven, etc.), Big Data / Machine Learning / Analytics tools (R, Weka, KNIME, RapidMiner, OpenRefine, etc.), Machine / Deep Learning Environment (Jupyter, Zeppelin, SparkNotebook, etc.) with Spark and/or Hadoop clusters, NLP tools, Logic Programming (Berkeley BLOG), RDF/OWL (Stanford's Protege, OntoText, Blazegraph), HPC (High-Performance Computing) using Singularity containers), or any other commonly used tools as portable agile software development, prototype, or testing computing environments.

And, your laptop, desktop, or server requires no local installation of any library or dependency to mess up your host machines' OS files - no conflicting versions of tools and libraries. And, most importantly, the agility and light-weight Docker-based tools, IDE, or clusters, or even deploy your favorite container to using enterprise container platforms like KubernetesDC/OS or OpenShift to have very large scale production environments.

My interests and goals are to enable users (developers or anyone) to do the above by rapidly standing up full-fledged computing platform either on simple laptop, desktop, server, cluster, or cloud infrastructures with the needed containers (e.g. from the GitHub) using source to build your own or using ready-to-run docker images (e.g. from the docker hub).

Interested?

You can try them out and they are open sources!

Overview of the above Open Source Docker Projects

In the GitHub projects, about 30% are unique creations and 70% are forking other GIT projects:
  • Simple Docker Github project templates
    • With the template files, docker.env (for variables), Dockerfile, build.sh, run.sh to enable you to have some working Docker project. The scripts (Bash) files are coded smartly so that you don't need to change anything (unless you want to customize the default). You can just leave build.sh and run.sh as it is.
    • To build, just, do in shell, "./build.sh" 
    • To run, just do in shell, "./run.sh"
    • You can try it out by git clone this "Docker Template GIT (git@github.com:DrSnowbird/docker-project-template.git)"
  • Basic Dockers
    • Java 8/9 (JDK) + Python (2 or 3) + Maven (3.5) containers
      • As the base container images to enable users to overlay extensions or domain specific add-on processing.
      • In the github home, just search for "java", "jre", "jdk" and your will see multiple choices.
  • X11 base container
    • As the base X11 desktop application, e.g., Eclipse, IntelliJ, etc., to have display of GUI on your host computer's screen.
    • In the github home, just search for "x11".
  • IDE containers (Eclipse, ScalaIDE, IntelliJ, PyCharm, etc.)
    • In the github home, just search for "eclipse", "IntelliJ", "pycharm", "scala".
  • Spark / Hadoop Cluster / NoSQL etc.
  • RDF/OWL/RDFS/OWLS Database and Tools
  • Big Data Platforms
  • HPC (High-Performance Computing - Super Computers) Docker for Singularity
    • Note that HPC docker for Singularity is still in high churning of revisions.
    • In the github home, just search for "hpc", "singularity"
  • Or, you can browse all the 170+ container-based Docker projects

Limitations

Currently, all the above Docker-based tools / IDE / projects are mainly focusing at any Linux-based or Mac OS. For Windows, the automated scripts, "build.sh" and "run.sh" are not having equivalent versions in Windows Power-shell yet. And, you are welcomed to fork the above GIT projects to add Windows' Power-shell to do identical automation for both scripts.

Thursday, April 13, 2017

Big Data Analytics using TDA as First Step

I recently got involved with the learning of Topological Data Analytics (TDA) to understand the "shape" of Big Data data set. In classical Machine Learning approach, we often start with trying to explore the given data set before we even try to create hypothesis and then select the entire pipeline of Machine Learning processing workflow. Actually, after we done initial poking of the data set format, and maybe collected some initial domain specific knowledge about the data set, we typically try to apply "dimension reduction" techniques such as PCA, SVM to identify what are the main dominating dimensions. However, when the size of the dimension of the given data set is very large such as tens of thousands in DNA related analysis, those Machine Learning algorithms for dimension reduction become not effective or helpful in providing "visual" shape of the data set. For example, when using clustering algorithms over the Twitter challenge data set (? reference) or Netflex challenge data set (?reference). Prof. Dr. Gunnar Carlsson with his students has published papers in this area. Also, you can check out his talk at University of Chicago about TDA - Data Shape.

With my initial understanding after studying a lot of TDA technical journal papers published in recent years and many YouTube TDA related videos. I am forming my opinion about how TDA can be a very effective technique for understanding or viewing the "shape" of the big data analytics and I also believe that the TDA and classical Machine Learning are complimentary to each other.

Container-based Computing Platforms Anywhere!

Wanna Container-based Computing Platforms Anywhere? Imaging that all you need is just some bare bone OS (Linux, Mac, or Windows) with only...