Saturday, December 12, 2020

Sanity of mindset to explore AI/ Machine / Deep Learning things

Sanity of mindset to explore AI/ Machine / Deep Learning things

Along with my many projects either at work or hobbies for exploring (let's say, try to learn) so many new AI/ML/DL (Artificial Intelligence, Machine Learning, Deep Learning) papers, algorithms, libraries,  and tools popping up like the greens growing out of melting ice glaciers due to the rising temperature in polar zones. Mostly, I either use either Google CoLab (time limits in timeout) or my DrSnowbird's (my open source) jupyter ML container (at your machine no-time-out) creating many Jupyter notebooks for trying out many of those new ML/DL algorithms, frameworks, papers, etc. At some point, I will share those hundreds of Jupyter notebooks that I created or transcribed other papers or algorithms. Let me stay focus on the topic, for now, I am just like many of my fellow human engineers and scientists in trying to either keep up or grow my "state-of-the-art" knowledge/skills in AI/ML/DL to "try not to be obsolete" or perhaps, out of curiosity of "what's this new tools/algorithms actually about?".

Well, often I lost myself in the exploding galaxy of AI/ML/DL new things popping up daily and I always have to come back to frame my mind to normal sanity after a while in chasing the endless learning of new AI/ML/DL algorithms, papers, libraries. I eventually build up my "guidelines" to frame my "base mindset" to "explore" any of those new things not just ML/DL ones as my dumb-and-simple mindset to not to lost along my "exploration" ranging from NLP (natural language processing, e.g., language understanding, semantics, question, and answering, etc.), CV (computer vision, object-tracking / recognition, large-scale processing), Graph-based deep learning (to do similar to NLP sub-words (as Transformer does) and image learning in DL CNN algorithms to apply features to create abstraction as our human vision perception understanding to try to learn and predict pattern in networks e.g., social networks, messaging flows, or some call deep graph learning). Sorry, let me back to the topic again.

Including myself like many other fellow human scientists and engineers, I will first maintain the basic camp of knowledge such as algebra (equations, matrices), numerical analysis (your old pales like finite difference maths, Newton, 1st / 2nd order differential/derivatives), and basic probability concepts. When in doubt, you can use those online mathematics tools to break down the fancy complicated mathematical equations in the papers to plot it out or try it out with dump-and-simple just a few actual numbers in variables. The most important for me to learn is actually not to remember the equations in the algorithms or papers, I always try to understand the fancy equations in simple "common sense". For example, as Deep Learning commonly used "discrete item Entropy Loss", I will use a simple pen and paper with 3 number to actually try it manually. And, then, I will just build a simple python, java, Excel spreadsheet to try out with just a few numbers (human common is good in comprehending 3~5 numbers unless your brain's specialty more). There are many good blogs explaining AI/ML/DL terms, say entropy loss with even visual explanation. Still, once you can comprehend the "common sense" you can just drop the specific equations since you comprehend, again, "in common-sense interpretation" with the conceptual flow in your mind about how and why it works the way it should.

This principle is the most important in learning any algorithm besides AI/ML/DL algorithms. If you understand the common-sense concept of why and how it works, you will be able to go into the algorithms to change or customize them later when the algorithms are not working as you want to. For example, when applying some pre-trained Deep Learning models to your own data and it doesn't work as it should. Now, it is time to recall your "common sense understanding" as your weapon to figure out why and how to fix the problems. Knowing how it works and why it works or why it should not is the most important principle and not how to use ready-to-use code or ML/DL code you created. Yet another example, when your ML/DL algorithms don't converge or reach higher precisions after all your trials in changing hyperparameters, it is time to apply the basic understanding to each part of algorithms or computations and use your common-sense understanding to help you to figure out why not working.

Well, not meant to overwhelm some of you with terms, in short, I always use simple-and-dumb small examples with my knowledge of basic maths I already master (or know). Humans are very good at "induction thinking". Once you comprehend the "common sense" of how it really works, your amazing (as your human fellow engineers and scientists) will naturally induct it to generalize equations as any fancy greeks in those technical papers. With this simple mindset as 1st principal, at least, for me, I was able to ride through one-by-one new technical algorithms even though I can't remember all those fancy complex greeks and equations. Once you build the habit as I did use the "dumb-and-simple" first principal, you will not be intimidated by any of those new paper's fancy greek equations. 

With that mindset and your basic camp of "foundation mathematics from your college or even high-school (depending on the things you try to comprehend)", the next principle for me is to "identify what this thing (new algorithm, technical paper) is really about". For example, there are many pre-trained DL models, before I invest my time to learn the details and coding in how to use it, I controlled my "impulse in trying to load it and learn how to use it". I will apply my simple first level buckets of AI/ML/DL catalog "type" buckets, e.g., what this "new thing" is about, e.g., NLP, Image, Data (general and broad) - binary, text (unstructured, semi-structured, fully structured (very raw!)), graph (network style, one thing having some link or relation to other, etc.), voice, signal stream (musical), video stream, etc. as my first level buckets. The next mental bucket is "what this new thing is about from ML/DL aspect". For example, the buckets like features processing, time-series related classification, precision, ML/DL of some kind of algorithms, etc., precision improvements of some type of ML/DL algorithms, turning of pre-trained, or application to some specific domains, etc. When in doubt, I use something like Google ML/DL online terms catalog to guide me to create my own multiple indexing tree with "buckets in label what the new thing is about". 

Thirdly, I will ask myself, if possible, to identify when I should use or when not to, what the limitations of the new thing, what the assumptions are. So, this will help me to have some knowledge when explaining what and why suitable ML/DL algorithms solve my ML/DL problems.

In all, for me, I found that the above simple-and-dumb mindset to navigate the exploration of the ever-growing exponentially exploding sky of new AI/ML/DL algorithms, libraries, papers, etc. Actually, I apply these simple-and-dumb principles before starting my journey in exploring any other knowledge areas to avoid push myself into insanity. Some of you may already have a similar mindset in handling the exploration journey as I am doing facing daily. Hopefully, this will make you feel your journey more enjoyable in exploring new AI/ML/DL knowledge.

Cheer! Enjoy your journey in navigating through the new galaxy of exploding AI/ML/DL things.

    DrSnowbird.

Sunday, March 25, 2018

Container-based Computing Platforms Anywhere!


Wanna Run Container-based (Big Data/ML) Computing GUI & Platforms Anywhere and accessible from your Tablets or Smartphones? 

(updated 2020-12-12)

During the past years, I earned with many actual deployments and realized my initial concepts in open-source projects in Github and Docker Hub including using many of my own 320+ diversified Container-based tools and projects (in programming Java, Python in AI/ML analytics applications, Interactive Machine Learning/Deep Learning Notebooks, in Ubuntu/CentOS, even HPC Containers computing and enabling various kinds of Containers to be actually running as back-end Servers, Desktop Applications (e.g., KNIME, Protege, Eclipse, Pycharm, etc) and VNC/No-VNC HTML-5 based Container Applications (e.g., many of vnc/no-vnc based Container in my Github and Docker Hub), I personally seeing more adoption by engineers, researchers, and on-lookers. As an A.I./ML/DL researcher and practitioner, I have converted many "doubters" about "Does Container really works as it promises?". We can't predict the future of new technologies, but we can confirm that "changing of available technologies is certainly for those who will adapt to thrive and rise above!". I will continue to create and publish more and evolve my open-source projects to adopt new technologies and adapt to the new needs to make them be practically useful for helping my fellow human, engineers, scientists, and anyone else.

One new trend in using Container is that the VNC/no-VNC HTML-5 based Container is having more growing downloads recently. The vnc/no-vnc based Containers in my 320+ projects are becoming more preferred mostly maybe it is due to its ubiquitously accessible from anywhere or any device with HTML-5 Web Browser. In my Docker Hub site's downloading, I have been seen the trending is picking up more recently. For example, openkbs/knime-vnc-docker (Web browser version using vnc/no-vnc HTML-5) downloading is rapidly growing recently from hundreds to now 2.5K downloads of images while openkbs/knime-docker (Desktop version using X11) has 50K+ downloads. You might also want to consider exploring those.

   DrSnowbird #QED


Imagine that all you need is just some bare bone OS (Linux, Mac, or Windows) with only a tiny installation of Docker (Linux, Mac, or, Windows), and, within a few minutes, you can have an array of your favorite tools, IDE (Eclipse, ScalaIDE, IntelliJ, PyCharm, etc.), programming languages environments (Java 8/9, Python 2/3, Maven, etc.), Big Data / Machine Learning / Analytics tools (R, Weka, KNIME, RapidMiner, OpenRefine, etc.), Machine / Deep Learning Environment (Jupyter, Zeppelin, SparkNotebook, etc.) with Spark and/or Hadoop clusters, NLP tools, Logic Programming (Berkeley BLOG), RDF/OWL (Stanford's Protege, OntoText, Blazegraph), HPC (High-Performance Computing) using Singularity containers), or any other commonly used tools as portable agile software development, prototype, or testing computing environments.

And, your laptop, desktop, or server requires no local installation of any library or dependency to mess up your host machines' OS files - no conflicting versions of tools and libraries. And, most importantly, the agility and light-weight Docker-based tools, IDE, or clusters, or even deploy your favorite container to using enterprise container platforms like Kubernetes, DC/OS, or OpenShift to have very large scale production environments.

My interests and goals are to enable users (developers or anyone) to do the above by rapidly standing up full-fledged computing platform either on a simple laptop, desktop, server, cluster, or cloud infrastructures with the needed containers (e.g. from the GitHub) using the source to build your own or using ready-to-run docker images (e.g. from the docker hub).

Accessing Big Data Analytics Platform GUI Tools (KINME, ...) or IDE (IntelliJ, Eclipse, Netbeans) with your Tablets or Smartphones?

  • VNC / noVNC-based docker containers (Newly launched! 2019)
    • Recently launched a few VNC/noVNC-based containers including KNIME, Eclipse, and more to come. So, you can use your Tablets those Desktop-based tools or IDEs with all kinds of internet-enabled devices or PCs including iPad/iPadPro, Pi, or even your large-screen smartphones to access KNIME Big data platform tools.
    • openkbs/knime-vnc-docker 
    • openkbs/eclipse-photon-vnc-docker
    • (more VNC-based data analytics / ML / AI containers to come).
  • With newly deployed VNC-based containers in openkbs docker repository, you expand your horizon of using IDE tools, or GUI tools for Big Data Analytics or Machine Learning to any device including iPad, any web-enabled Tablet, smartphones. However, due to the nature of most of those big data studio tools requiring a bigger screen, it is recommended you use larger screen devices such as iPad Pro, Microsoft Surface Pro, or any other similar larger screen device.

Interested?

You can try them out and they are open sources!

Overview of the above Open Source Docker Projects

In the GitHub projects, about 30% are unique creations and 70% are forking other GIT projects:
  • Simple Docker Github project templates
    • With the template files, docker.env (for variables), Dockerfile, build.sh, run.sh to enable you to have some working Docker project. The scripts (Bash) files are coded smartly so that you don't need to change anything (unless you want to customize the default). You can just leave build.sh and run.sh as it is.
    • To build, just, do in shell, "./build.sh" 
    • To run, just do in shell, "./run.sh"
    • You can try it out by git clone this "Docker Template GIT (git@github.com:DrSnowbird/docker-project-template.git)"
  • Basic Dockers
    • Java 8/9 (JDK) + Python (2 or 3) + Maven (3.5) containers
      • As the base container images to enable users to overlay extensions or domain-specific add-on processing.
      • In the github home, just search for "java", "jre", "jdk" and your will see multiple choices.
  • X11 based docker container
    • As the base X11 desktop application, e.g., Eclipse, IntelliJ, etc., to have a display of GUI on your host computer's screen.
    • In the github home, just search for "x11".
  • IDE docker containers (Eclipse, ScalaIDE, IntelliJ, PyCharm, etc.)
    • In the github home, just search for "eclipse", "IntelliJ", "pycharm", "scala".
  • Spark / Hadoop Cluster / NoSQL etc.
  • RDF/OWL/RDFS/OWLS Database and Tools
  • Big Data Platforms
  • HPC (High-Performance Computing - Super Computers) Docker for Singularity
    • Note that HPC docker for Singularity is still in high churning of revisions.
    • In the github home, just search for "hpc", "singularity"
  • Or, you can browse all the 170+ container-based Docker projects

Limitations

Currently, all the above Docker-based tools / IDE / projects are mainly focusing at any Linux-based or Mac OS. For Windows, the automated scripts, "build.sh" and "run.sh" are not having equivalent versions in Windows Power-shell yet. In Windows, you still can use Docker to launch any of the above Docker Containers. And, you are welcomed to fork the above GIT projects to add Windows' Power-shell to do identical automation both (run.sh or build.sh) scripts.

Thursday, April 13, 2017

Big Data Analytics using TDA as First Step

Big Data Analytics using TDA as First Step

I recently got involved with the learning of Topological Data Analytics (TDA) to understand the "shape" of Big Data data set. In the classical Machine Learning approach, we often start with trying to explore the given data set before we even try to create a hypothesis and then select the entire pipeline of Machine Learning processing workflow. Actually, after we did initial poking of the data set format, and may be collected some initial domain-specific knowledge about the data set, we typically try to apply "dimension reduction" techniques such as PCA, SVM to identify what are the main dominating dimensions. However, when the size of the dimension of the given data set is very large such as tens of thousands in DNA related analysis, those Machine Learning algorithms for dimension reduction become not effective or helpful in providing the "data visual" shape of the data set. There are similar studies using various approaches in "Reduction of High-Dimensional Data"  (see Google Scholar in Dimension Reduction researches). However, TDA using Topology Theory provides the simplicity of computation while preserving the "important relations intra-dimensional and inter-dimensional features. For example, when using clustering algorithms over the Twitter challenge data set (? reference) or Netflex challenge data set (?reference). Prof. Dr. Gunnar Carlsson with his students has published papers in this area. Also, you can check out his talk at University of Chicago about TDA - Data Shape.

With my initial understanding after studying a lot of TDA technical journal papers published in recent years and many YouTube TDA related videos. I am forming my opinion about how TDA can be a very effective technique for understanding or viewing the "shape" of the big data analytics and I also believe that the TDA and classical Machine Learning are complementary to each other.

  - Updated 2020/07/26 #QED

Sanity of mindset to explore AI/ Machine / Deep Learning things

Sanity of mindset to explore AI/ Machine / Deep Learning things Along with my many projects either at work or hobbies for exploring (let'...