Data Engineering and Secure Coding with a Vulnerability Database

We are living in the world of data and AI. Enterprise organizations worldwide are trying to gather as much data as possible to create a data abstraction layer, which business users and data analysts can easily access to extract value and insights from data. Although it seems to be a very straightforward solution to collect data and gather insights from it, enterprise organizations have to build and implement various data governance pillars to ensure data is collected efficiently, and is securely and easily accessible to end users.

There are various tools available in the market for code security, like the WhiteSource Vulnerability Database, Sonarqube, GitHub, GitLab, etc. Let’s delve into building a data engineering application with end-to-end implementation using Python and the WhiteSource Vulnerability Database.

Secure Data Engineering with Python

Most enterprise organizations use Python to build data solutions. Thus, it’s important to scan all Python vulnerabilities using open source and free databases to secure the code with automated checks about any potential vulnerabilities.

Data engineering teams in enterprise organizations and small-scale companies can perform the following activities to make data available and easily accessible for downstream users to perform business analytics:

Build automated and generic data pipelines to gather raw data in a data lake or data lake house platform.
Build ETL/ELT pipelines to pre-process, clean, transform, and load data to the destination store.
Machine learning engineers can use this data to build machine learning models.
Business analysts and data analysts can use this data to create interactive visual reports.

The day-to-day life of a data engineer includes processing data from different data sources in the form of batch processing and streaming. Data engineers focus more on building data pipelines ETL or ELT to create robust, reliable and efficient data processing mechanisms to make the data easily available for the business users to get more insights from the data.

Most public cloud providers like AWS and Azure work easily with Python bases, serverless computing, and microservices. Business data comes from a variety of sources like relational or non-relational databases, flat files, spreadsheets, and external systems like APIs. With the help of the Pandas dataframe, we can read a variety of files as it supports various column-oriented formats.

Once your application development is ready with Python/Spark code, you can build or integrate utility in Python to scan your code against the WhiteSource Vulnerability Database to check for potential security issues. The open source database allows us to use web scraping facilities, which we can integrate into our Python code.

Using the Vulnerability Database

The WhiteSource Vulnerability Database is a free, open source vulnerability database. It is also the largest database. It can be used for securing your application with code scanning. The database stores information in the following formats. You can find the full list of vulnerabilities available in the database online.

It scans for more than 200 programming languages. Additionally, the WhiteSource database also provides the following additional information to make developers’ lives easier:

CWE Type
Recommended fix
Support from the community
Exposure level

Let’s look at the trend of open source vulnerabilities per year from 2009 to 2020.

Considering the sharp increase in open source vulnerabilities, it is important to incorporate tools to improve code security before deploying the code and using them in higher environments.

If you want to explore the WhiteSource Vulnerability Database, you can simply go to the website and use the search functionalities. If you want to add a new vulnerability to their database, click on the link to add the new vulnerability. Once you provide information, it will be added to the database. Thus, with a combination of Python code with the database, we can proactively scan any Python source code and detect code vulnerabilities.

Although this approach works well, it becomes cumbersome to incorporate into day-to-day Agile work. It is better to create an automated mechanism using a CI pipeline to continuously scan the source code against the database.

We can create a CI pipeline in Azure DevOps that executes every time during the code commit to the Azure DevOps Repos. This way, we can securely scan our Python source code against the WhiteSource Vulnerability Database.

Conclusion

In summary, we have understood the WhiteSource Vulnerability Database and how it stores information about vulnerabilities for more than 200 programming languages. We have also discussed using a WhiteSource Vulnerability Database with Python for data engineering applications and using it to create secure Python applications. In the end, we also explored using Azure DevOps to automate scanning with CI pipelines.