A Guide for Optimizing your Data Science Workflow

It is a good idea to use pipx to install flake8 and mypy in your system. This way, you can reuse them across projects and only install them once. You can point to the pipx install location using the following setting in the user settings.json file,{“python.linting.flake8Enabled”: true,“python.linting.flake8Path”: “C:\Users\username\.local\pipx\venvs\flake8\Scripts\flake8.exe”,“python.linting.mypyEnabled”: true,“python.linting.mypyPath”: “C:\Users\username\.local\pipx\venvs\mypy\Scripts\mypy.exe”}With linting enabled through mypy, flake8, and pylance you can safely write code and catch bugs even during prototyping.FormattingBlackFormatting helps maintain code formatting standards when working in a team. Have you had teammates debate about code formatting over a PR? Black is a python tool that automates your code formatting using a set of predefined rules. It is an opinionated auto-formatting library. Black is particularly helpful during development as it can break down complex statements in a format that is easy to read. The reformatting is deterministic so that users with the same setting can get the same exact formatting, no matter the OS, IDE, or platform the formatting is run on. You can use pipx to set up black on your machine.Before black formatting,def very_important_function(template: str, *variables, file: os.PathLike, engine: str, header: bool = True, debug: bool = False):”””Applies `variables` to the `template` and writes to `file`.”””with open(file, ‘w’) as f:…After black formatting,def very_important_function(template: str,*variables,file: os.PathLike,engine: str,header: bool = True,debug: bool = False,):”””Applies `variables` to the `template` and writes to `file`.”””with open(file, “w”) as f:…I-sortBlack doesn’t format your imports. Users randomly import stuff into the project without any order. i-sort (import sort) provides some order into this chaos by providing a hierarchy in imports. It formats the imports in such a way that it is python standard library imports followed by third-party library imports and then user-defined library imports. In each category, the imports are further sorted in ascending order. This helps in identifying imports quickly when there is a bunch of them.Before i-sort,from my_lib import Objectimport osfrom my_lib import Object3from my_lib import Object2import sysfrom third_party import lib15, lib1, lib2, lib3, lib4, lib5, lib6, lib7, lib8, lib9, lib10, lib11, lib12, lib13, lib14import sysfrom __future__ import absolute_importfrom third_party import lib3print(“Hey”)print(“yo”)After i-sort,from __future__ import absolute_import# Python Standard libraryimport osimport sys# Third party libraryfrom third_party import (lib1, lib2, lib3, lib4, lib5, lib6, lib7, lib8,lib9, lib10, lib11, lib12, lib13, lib14, lib15)# User defined libarary/modulesfrom my_lib import Object, Object2, Object3print(“Hey”)print(“yo”)Use this setting in your setup.cfg file to configure i-sort to work with black. Black by default is configured for a 88 character line. Make sure that flake8 and i-sort are also configured to the same exact setting.[flake8]max-line-length = 88[isort]line_length = 88Image by AuthorTable of Contents:DotenvPre-commitTouch typingVIMVS Code extensionsThis is a very crucial step since this is where abstractions such as functions, class, and modules are designed. Data scientists can learn a lot from a python developer’s workflow during this step.DotenvLet’s say you have the following file structure,dream_ds_project–dev # Jupyter notebook folder–notebook1.py–notebook2.py–src # Source code folder–module1.py–module2.py–.env # Environment variable file–setup.cfg # Configuration file for python toolsAnd you start by prototyping a function test_prototypein dev/notebook1.py. You can then move that function to src/module1.py. Now when you have to import this function, set up PYTHONPATH in the .env file present on the root folder, like this.# Set pythonpath to a relative path. In this case it sets where .env # file is present as the root pathPYTHONPATH=.Now you can import test_prototype in dev/notebook1.py asfrom src.module1 import test_prototype.env is a special file. You can use it to store sensitive information such as passwords, keys, etc. This should not be part of git commits and should be kept private. Keep two .env files, one for production and one for development.Production .env file could be like,MONGODB_USER=prod_userMONGODB_PWD=prod_pwdMONGODB_SERVER=prod.server.comWhereas Development .env file could be like,MONGODB_USER=dev_userMONGODB_PWD=dev_pwdMONGODB_SERVER=dev.server.comYou can load these variables into your environment using the python-dotenv library. Inside your python code, you can access these variables like this,from dotenv import load_dotenvimport os# Call the function to read and load the .env file into local envload_dotenv()print(os.getenv(“MONGODB_SERVER”)) > >prod.server.com # For prod .env file > >dev.server.com # For dev .env fileThis helps in keeping the code common for prod and dev and just replace the .env file based on the environment in which the code is running.Pre-commitPre-commit helps in verifying your git commits. It helps in maintaining a clean git commit history and provides a mechanism for doing user-defined validations before each commit. It has a strong ecosystem and has a plugin for most of the common commit validations that you can think of.Builtin-hooks: Some of my favorite builtin pre-commit hooks are,detect-aws-credentials & detect-private-key which makes sure there is no accidental sensitive information included in the commits.check-added-large-files to make sure commits do not include file sizes that exceed 1MB, which can be controlled using maxkb argument. I found this very useful because code files are rarely larger than 1MB and this prevents accidental commits of large data files in a data science workflow.check-ast which makes sure that the code is syntactically valid python code.Install pre-commit, poetry add pre-commit — devCreate a .pre-commit-config.yamlfile and add this,repos:- repo: https://github.com/pre-commit/pre-commit-hooksrev: v3.2.0hooks:- id: detect-aws-credentials- id: detect-private-key- id: check-added-large-filesargs: [‘–maxkb=1000’]- id: check-astPlugins:On top of built-in hooks, pre-commit offers support for plugins as well. Some of my favorite plugins are,Black makes sure that the formatting of all the commit files follows black conventions.Mypy validates that the static type check has no errors.Flake8 ensures the coding standards are observed.pytest makes sure all the tests are passing before committing. This is particularly useful for small projects, where you do not have a CI/CD setup and testing can be done locally.- repo: https://github.com/psf/blackrev: 20.8b1hooks:- id: blackargs: [‘–check’]- repo: https://github.com/pycqa/isortrev: ‘5.6.3’hooks:- id: isortargs: [‘–profile’, ‘black’, ‘–check-only’]- repo: https://github.com/pre-commit/mirrors-mypyrev: v0.800hooks:- id: mypy- repo: https://gitlab.com/pycqa/flake8rev: ‘3.8.3’hooks:- id: flake8args: [‘–config=setup.cfg’]- repo: localhooks:- id: pytest-checkname: pytest-checkentry: pytestlanguage: systempass_filenames: falsealways_run: truePre-commit only reads the files and validates the commit, it never performs formatting or any write operation on the files. In case of a validation error, it cancels the commit and you can go back and fix the error before committing again.A sample pre-commit failure because of committing a large data file,dream_ds_project > git commit -m “precommit example – failure”Detect AWS Credentials……………………………………………PassedDetect Private Key……………………………………………….PassedCheck for added large files……………………………………….Failed- hook id: check-added-large-files- exit code: 1all_data.json (18317 KB) exceeds 1000 KB.Check python ast…………………….(no files to check)Skippedblack………………………………(no files to check)Skippedmypy……………………………….(no files to check)Skippedflake8……………………………..(no files to check)SkippedA sample pre-commit success,dream_ds_project > git commit -m “precommit example — success”Detect AWS Credentials……………………………………………PassedDetect Private Key……………………………………………….PassedCheck for added large files……………………………………….PassedCheck python ast…………………………………………………Passedblack…………………………………………………………..Passedmypy……………………………………………………………Passedflake8………………………………………………………….Passed[master] precommit example — success7 files changed, 54 insertions(+), 33 deletions(-)Touch typingTouch typing is an essential productivity tip that is generalizable to any computer task. It is vital if you spend a considerable amount of time in front of a computer. Just practicing for a few minutes every day, you can reach significant typing speed.Touch typing is vital for programmers since the emphasis is a lot more on typing special characters, and you don’t want to concentrate on your keyboard when you are thinking about logic. Keybr is a fantastic website for training touch typing.

Read More

[Paper Summary] Washington University Researchers propose a Deep Learning Model that automates Brain Tumor Classification

Biopsies are always the first call when it comes to diagnosing a case of brain cancer. Surgeons start by removing a thin layer of tissue from the tumor to find signs of disease closely under a microscope. Although biopsies are very presumptuous, the samples collected only represent a snatch of the whole tumor. MRI is a less bold but time-consuming process as radiologists have to manually map out the tumor area from the scan before the classification. 

Read More

Introduction to Deep Learning for Self Driving Cars (Part — 2)

Let’s take our neural networks one level deeper and learn about concepts that every expert knows. Things like activation function, normalization, regularization, and even things like dropouts to make the training more robust so that we can become much more proficient in training neural networks.

Read More

Introduction to Deep Learning for Self Driving Cars (Part - 1)

One of the coolest things that happened in last decade is that Google released a framework for deep learning called TensorFlow. TensorFlow makes all that hard work that we’ve done superfluous because now you have a software framework. They can very easily configure and train deep networks and TensorFlow can be run on many machines at the same time. So, in this medium article, we’ll focus on TensorFlow because if one becomes a machine learning expert, these are the tools that people in the trade use everyday.

Read More

Artificial Intelligence in Air Quality Control

Air quality

The earth is composed of a mixture of gases that encompass the atmosphere around us. Air is one of the most essential constituents that serve to preserve all life forms on earth. The air we breathe contains about 21% oxygen, which is utilized by the human body. Since we continuously breathe air for survival, it becomes critical to maintain the balance and quality of the air around us. However, with the pollution surrounding us, it becomes difficult to breathe in the natural air available.

Read More

[Paper Summary] A new Google AI Research Study discovers Anomalous Data using Self Supervised Learning

New Google AI research introduces a 2-stage framework that uses recent progress on self-supervised representation learning and classic one-class algorithms. This framework is simple to train and shows SOTA performance on various benchmarks, including CIFAR, f-MNIST, Cat vs. Dog, and CelebA. Following that, they offer a novel representation learning approach for a practical industrial defect detection problem using the same architecture. On the MVTec benchmark, the framework achieves a new state-of-the-art.

Read More

Approaches for building real-time ML Systems

Time piece

As an applied data scientist at Zynga, I’ve started getting hands on with building and deploying data products. As I’ve explored more and more use cases for machine learning, there’s been an increasing need for real-time machine learning (ML) systems, where the system performs feature engineering and model inference to respond to prediction requests within milliseconds. While I’ve previously used tools such as AWS SageMaker to do model inference in near real-time, I only recently explored options for also doing feature engineering on-the-fly for ML systems.

Read More

Artificial Intelligence: It is not MAGIC. It is MATHEMATICS.

Surely you have ever wanted to teach a skill or knowledge to a co-worker, a friend, or your children.

If it is explicit knowledge (that which is structured) you can transmit it orally or in writing through specific instructions. As if it were a kitchen recipe, describing the steps orderly and providing all the necessary information for each one of them. They can also be guidelines of the cause-effect style: “if this happens, do this, if instead, the situation is this other, then do that.” The instructions can be numerous and complicated, but you are able to give a solution for each of the possible situations.

Read More

[Paper Summary] An AI system trained by Loughborough University researchers recognizes the pre-movement patterns from an EEG

A group of researchers from the Intelligent Automation Center at Loughborough University has published a research paper focussed on possible results for training robots to ferret out the intention of arm movement before humans articulate the movement.

Read More

[Paper Summary] Stanford Researchers use Deep Learning to predict Biological Structures, like RNAs, more accurately than ever before

Determination of 3D structures of biological molecules, like RNA’s, is difficult and often requires millions of dollars for such extensive efforts. Stanford University researchers have devised a new deep learning algorithm called ARES (Atomic Rotationally Equivalent Scorer) for overcoming this challenge by computationally forecasting accurate structures. 

Read More

[Paper Summary] Google and Mayo Clinic Researchers propose a new AI Algorithm to improve Brain Stimulation devices to treat disease

lectrical simulation has the potential to widen treatment possibilities for millions of people with movement disorders, such as Parkinson’s disease, and epilepsy. In the future, this technology may help further treat psychiatric illness or even assist in recovery from brain injuries like stroke.

Read More

[Paper Summary] AI Researchers from ShanghaiTech and UC San Diego introduce SofGAN: A Portrait Image Generator with Dynamic Styling

Researchers in Shanghai and the United States have created a GAN-based portrait creation system that lets users build new faces with previously unattainable levels of control over specific features, including hair, eyes, spectacles, textures, and color.

Read More

Why does Markov Decision Process matter in Reinforcement Learning?

For most learners, the Markov Decision Process(MDP) framework is the first to know when diving into Reinforcement Learning (RL). However, can you explain why it is so important? Why not another framework? In this post, I will explain the advantages of MDP compared to the k-armed bandit problem, another popular RL framework. The post is inspired by an RL specialization offered by University of Alberta and Alberta Machine Intelligence Institute on Coursera. I wrote this post to summarize some of the videos and get a deeper understanding of the specialization.

Read More
1 2 3 35