A Guide for Optimizing your Data Science Workflow

towards-data-science

This post was originally published by Adiamaan Keerthi at Towards Data Science - Medium Tagged

It is a good idea to use pipx to install flake8 and mypy in your system. This way, you can reuse them across projects and only install them once. You can point to the pipx install location using the following setting in the user settings.json file,

{
“python.linting.flake8Enabled”: true,
“python.linting.flake8Path”: “C:Usersusername.localpipxvenvsflake8Scriptsflake8.exe”,
“python.linting.mypyEnabled”: true,
“python.linting.mypyPath”: “C:Usersusername.localpipxvenvsmypyScriptsmypy.exe”
}

With linting enabled through mypy, flake8, and pylance you can safely write code and catch bugs even during prototyping.

Formatting

  • Black
    Formatting helps maintain code formatting standards when working in a team. Have you had teammates debate about code formatting over a PR? Black is a python tool that automates your code formatting using a set of predefined rules. It is an opinionated auto-formatting library. Black is particularly helpful during development as it can break down complex statements in a format that is easy to read. The reformatting is deterministic so that users with the same setting can get the same exact formatting, no matter the OS, IDE, or platform the formatting is run on. You can use pipx to set up black on your machine.

Before black formatting,

def very_important_function(template: str, *variables, file: os.PathLike, engine: str, header: bool = True, debug: bool = False):
"""Applies `variables` to the `template` and writes to `file`."""
with open(file, 'w') as f:
...

After black formatting,

def very_important_function(
template: str,
*variables,
file: os.PathLike,
engine: str,
header: bool = True,
debug: bool = False,
):
"""Applies `variables` to the `template` and writes to `file`."""
with open(file, "w") as f:
...
  • I-sort
    Black doesn’t format your imports. Users randomly import stuff into the project without any order. i-sort (import sort) provides some order into this chaos by providing a hierarchy in imports. It formats the imports in such a way that it is python standard library imports followed by third-party library imports and then user-defined library imports. In each category, the imports are further sorted in ascending order. This helps in identifying imports quickly when there is a bunch of them.

Before i-sort,

from my_lib import Object
import os
from my_lib import Object3
from my_lib import Object2
import sys
from third_party import lib15, lib1, lib2, lib3, lib4, lib5, lib6, lib7, lib8, lib9, lib10, lib11, lib12, lib13, lib14

import sys

from __future__ import absolute_import
from third_party import lib3

print(“Hey”)
print(“yo”)

After i-sort,

from __future__ import absolute_import# Python Standard library
import os
import sys
# Third party library
from third_party import (lib1, lib2, lib3, lib4, lib5, lib6, lib7,    lib8,lib9, lib10, lib11, lib12, lib13, lib14, lib15)
# User defined libarary/modules
from my_lib import Object, Object2, Object3

print(“Hey”)
print(“yo”)

Use this setting in your setup.cfg file to configure i-sort to work with black. Black by default is configured for a 88 character line. Make sure that flake8 and i-sort are also configured to the same exact setting.

[flake8]
max-line-length = 88
[isort]
line_length = 88
Image by Author

Table of Contents:

Dotenv
Pre-commit
Touch typing
VIM
VS Code extensions

This is a very crucial step since this is where abstractions such as functions, class, and modules are designed. Data scientists can learn a lot from a python developer’s workflow during this step.

Dotenv

Let’s say you have the following file structure,

dream_ds_project
--dev # Jupyter notebook folder
--notebook1.py
--notebook2.py
--src # Source code folder
--module1.py
--module2.py
--.env # Environment variable file
--setup.cfg # Configuration file for python tools

And you start by prototyping a function test_prototypein dev/notebook1.py. You can then move that function to src/module1.py. Now when you have to import this function, set up PYTHONPATH in the .env file present on the root folder, like this.

# Set pythonpath to a relative path. In this case it sets where .env # file is present as the root path
PYTHONPATH=.

Now you can import test_prototype in dev/notebook1.py as

from src.module1 import test_prototype

.env is a special file. You can use it to store sensitive information such as passwords, keys, etc. This should not be part of git commits and should be kept private. Keep two .env files, one for production and one for development.

Production .env file could be like,

MONGODB_USER=prod_user
MONGODB_PWD=prod_pwd
MONGODB_SERVER=prod.server.com

Whereas Development .env file could be like,

MONGODB_USER=dev_user
MONGODB_PWD=dev_pwd
MONGODB_SERVER=dev.server.com

You can load these variables into your environment using the python-dotenv library. Inside your python code, you can access these variables like this,

from dotenv import load_dotenv
import os# Call the function to read and load the .env file into local env
load_dotenv()print(os.getenv("MONGODB_SERVER"))
>>prod.server.com # For prod .env file
>>dev.server.com # For dev .env file

This helps in keeping the code common for prod and dev and just replace the .env file based on the environment in which the code is running.

Pre-commit

Pre-commit helps in verifying your git commits. It helps in maintaining a clean git commit history and provides a mechanism for doing user-defined validations before each commit. It has a strong ecosystem and has a plugin for most of the common commit validations that you can think of.

Builtin-hooks:
Some of my favorite builtin pre-commit hooks are,

  • detect-aws-credentials & detect-private-key which makes sure there is no accidental sensitive information included in the commits.
  • check-added-large-files to make sure commits do not include file sizes that exceed 1MB, which can be controlled using maxkb argument. I found this very useful because code files are rarely larger than 1MB and this prevents accidental commits of large data files in a data science workflow.
  • check-ast which makes sure that the code is syntactically valid python code.

Install pre-commit, poetry add pre-commit — dev
Create a .pre-commit-config.yamlfile and add this,

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
- id: detect-aws-credentials
- id: detect-private-key
- id: check-added-large-files
args: ['--maxkb=1000']
- id: check-ast

Plugins:
On top of built-in hooks, pre-commit offers support for plugins as well. Some of my favorite plugins are,

  • Black makes sure that the formatting of all the commit files follows black conventions.
  • Mypy validates that the static type check has no errors.
  • Flake8 ensures the coding standards are observed.
  • pytest makes sure all the tests are passing before committing. This is particularly useful for small projects, where you do not have a CI/CD setup and testing can be done locally.
- repo: https://github.com/psf/black
rev: 20.8b1
hooks:
- id: black
args: ['--check']
- repo: https://github.com/pycqa/isort
rev: '5.6.3'
hooks:
- id: isort
args: ['--profile', 'black', '--check-only']
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.800
hooks:
- id: mypy
- repo: https://gitlab.com/pycqa/flake8
rev: '3.8.3'
hooks:
- id: flake8
args: ['--config=setup.cfg']
- repo: local
hooks:
- id: pytest-check
name: pytest-check
entry: pytest
language: system
pass_filenames: false
always_run: true

Pre-commit only reads the files and validates the commit, it never performs formatting or any write operation on the files. In case of a validation error, it cancels the commit and you can go back and fix the error before committing again.

A sample pre-commit failure because of committing a large data file,

dream_ds_project> git commit -m "precommit example - failure"
Detect AWS Credentials...................................................Passed
Detect Private Key.......................................................Passed
Check for added large files..............................................Failed
- hook id: check-added-large-files
- exit code: 1all_data.json (18317 KB) exceeds 1000 KB.Check python ast.........................(no files to check)Skipped
black....................................(no files to check)Skipped
mypy.....................................(no files to check)Skipped
flake8...................................(no files to check)Skipped

A sample pre-commit success,

dream_ds_project> git commit -m “precommit example — success”
Detect AWS Credentials……………………………………………Passed
Detect Private Key……………………………………………….Passed
Check for added large files……………………………………….Passed
Check python ast…………………………………………………Passed
black…………………………………………………………..Passed
mypy……………………………………………………………Passed
flake8………………………………………………………….Passed
[master] precommit example — success
7 files changed, 54 insertions(+), 33 deletions(-)

Touch typing

Touch typing is an essential productivity tip that is generalizable to any computer task. It is vital if you spend a considerable amount of time in front of a computer. Just practicing for a few minutes every day, you can reach significant typing speed.

Touch typing is vital for programmers since the emphasis is a lot more on typing special characters, and you don’t want to concentrate on your keyboard when you are thinking about logic. Keybr is a fantastic website for training touch typing.

Spread the word

This post was originally published by Adiamaan Keerthi at Towards Data Science - Medium Tagged

Related posts