16
The most Pythonic tools to solve ML problems
For any Machine Learning project there is a plethora of python libraries waiting to be exploited and there are also multiple articles in the internet which talks about the usual python libraries such as Numpy, Pandas, Scikit, Seaborn, Matplotlib, etc. However, many a times the basic python functionalities are skipped while reading about those libraries.
In this article we will unleash the full power of some internal functions and libraries of python, which are heavily underrated, for anyone starting a ML project. Without wasting any further time let's directly get down to business.
When a list of items needs to manipulated and stored in a different list, or it needs to be manipulated as an intermediate stage before some other operation, list comprehension is a handy tool.
Let's say there is a list and we want to square all the numbers in the list. The usual loop method would be:
list1 = [1,2,3,4,5]
list2 = []
for number in list1:
list2.append(number**2)
print(list2)
That's a lot of lines and it's time consuming. If we use list comprehension, we can write this in a single line as:
list2 = [number**2 for number in list1]
Cool, now what if we have a 2D list (list of list) and we want to convert it into a list that contains the squares of those numbers?
The normal solution would be:
list1 = [[1,2,3],[4,5,6]]
list2 = []
for data in list1:
for number in data:
list2.append(number**2)
print(list2)
A more pythonic solution would be:
list2 = [number**2 for data in list1 for number in data]
You can play around with list comprehension, and you'll never want to use the normal way !
This is the inbuilt debugger for python. Let's say there is a scenario where we don't know what is going on with a particular code snippet and it is outputting an unintended result. One way to debug is to put a print statement and print the variables along with some message like:
print('Looks like trouble_1...')
Maybe we need to run the whole code multiple times, before we realize where exactly the issue is. This, will surely delay the project by a substantial amount of time.
The 2nd method is a more pythonic way to debug snippets in a blink of an eye. The PDB, works on the usual principle on how debuggers actually work - by setting breakpoints, and printing call stacks but that's the geeky stuff. The functionality it provides is, after setting breakpoints, one can see all variables at that point in history alongside their values and also create new variables and run codes as they would in a standalone environment.
The way to add the PDB before the suspected error code is given below:
import pdb
# Correct code segment
pdb.set_trace()
# Code here might be a bit sus
While running ML codes, there will an urge to store intermediate files and artifacts to certain directories or check even if a directory exists or delete your office's files or run custom shell scripts to hack your neighbors machine, you're going to heavily rely on the os
library in python. It almost contains all the methods one is ever going to need to call the Operating System's operations.
One of the builtin data structures in python is sets. This is very similar to the mathematical set theory. Python sets supports various set operations like intersections, differences, unions, etc.
Sets come in handy when there is some comparison of data or finding unique entries in a file or extracting common entries in data or doing some extraction operations on data. Let's take an example, suppose there is are 2 sets:
fruits = {tomato, apple, banana, orange}
veggies = {tomato, cabbage, potato, onion}
Now, to find out which food item is a fruit as well as a veggie, we can easily do a set intersection like so fruits.intersection(veggies)
If we wanted to do this with the usual way, the most naive way would we would have to run 2 loops and compare the elements and take another list where we have to keep appending the common food items.
An ML Engineer's most important resource is time and there might be times where a script is taking way long to run. There can be performance issues with the code for various reasons. Before figuring out which part of the code is taking the longest it'll tougher to pinpoint the issue. For pinpointing the locations for longest running snippets, the time
library plays an important role.
This one is the most important library which every ML engineer uses. It creates an independent environment where one can run their scripts, this eliminates multiple dependency issues in a project.
Let's try to understand this with a scenario. Suppose, there are 2 projects and both of them require different versions of a python library in order to run with a constraint that at a point in time there can only be 1 version of the library installed. This looks impossible to run both scripts on a single machine, which is true. There are many ways to solve this problem, the simplest way is by creating 2 different environments and installing the required dependencies in the respective environments and then running the scripts in their respective environments.
As a ML Engineer, there will be multiple projects to work on simultaneously and it's always recommended to use a different environment for different projects so as to not run into any dependency issues.
These are a few tools I realized were basic, powerful yet underrated for a person starting any ML projects. These few tools will not only boost your productivity but also make you realize why python is the de facto language for ML projects !
16