18
LOADING DATA FOR PROCESSING IN PYTHON...
According to Wikipedia, Data mining is an important process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
It is, therefore, necessary in any machine learning convention to get the acquired data into our system. The data might be a log file, dataset file, or database which may be obtained from different multiple sources.
In this article, we’ll look at the different methods of loading different formatted data from a variety of sources, including CSV files and SQL databases.
We’ll majorly focus on python pandas extensive library methods for loading external data as well as loading scikit – learn and seaborn’s inbuilt datasets.
This involves loading data from preexisting datasets either in the scikit learn package or the seaborn package. The packages embeds some small toy datasets. The datasets are refered to as “toy” datasets because they are smaller and cleaner than the datasets we come across in the real world.
The sklearn.datasets
comes with a few small standard datasets that do not require downloading any file from some external website.
Sample datasets from sklearn include:
load_boston() - Load and return the Boston house-prices dataset. It contains 503 observations on Boston housing prices and a good dataset for exploring regression algorithms.
load_iris() - Load and return the iris dataset.It contains 150 observations on the measurements of Iris flowers. It is a good data‐ set for exploring classification algorithms.
load_diabetes() - Load and return the diabetes dataset (regression).
load_digits([n_class]) - Load and return the digits dataset (classification). load_digits Contains 1,797 observations from images of handwritten digits. It is a good data‐ set for teaching image classification.
load_linnerud() - Load and return the linnerud dataset (multivariate regression).
# Load scikit-learn's datasets
from sklearn import datasets
# Load digits dataset
digits = datasets.load_digits()
To load Iris dataset use:
from sklearn.datasets import load_iris
data = load_iris()
data.target[[10, 25, 50]]
Seaborn on the other hand comes with a few important datasets in the library. When Seaborn is installed, the datasets download automatically.Some of the datasets have a small amount of preprocessing applied to define a proper ordering for categorical variables.
To see a list of available datasets use get_dataset_names().
# Import the seaborn module
import seaborn as sns
sns.get_dataset_names()
The above line of code will generate the following output −
['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes',
'diamonds', 'dots', 'exercise',
'flights', 'fmri', 'gammas', 'geyser', 'iris',
'mpg', 'penguins', 'planets', 'tips', 'titanic']
Seaborn contains relational (table-format) datasets.
With the help of the load_dataset()
function you can load the required dataset.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
print df.head()
The above line of code will generate the following output −
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Pandas is a diverse data analysis library that offers various functions for reading files of different formats into our script for data manipulation.
The various reader functions include:
- read_csv
- read_excel
- read_hdf
- read_sql
- read_json
- read_html
- read_stata
- read_clipboard
- read_pickle
- read_msgpack
- read_gbq
CSV and Textual Files
CSV (comma-separated values) is data that is generally reported in tabular form and has values in a row separated by a comma.
Text files also contains tabular data separated by spaces or tabs and stored in a file with a .txt extension.
types (generally with the extension .txt).
For this,pandas provides a set of functions specific for this to load a local or hosted CSV.
- read_csv
- read_table
CSV files may use other characters as separators, like tabs. Panda's sep = ' '
parameter allows us to define the delimiter used in the file.
# Load library
import pandas as pd
# Create URL
url = 'https://tinyurl.com/titanic-csv'
# Load dataset
dataframe = pd.read_csv(url)
# View first two rows
dataframe.head(2)
The above line of code will generate the following output −
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
N/B : For a locally stored csv file, the url is the referenced location of the file in your working directory. For example if the csv file (named sample_1) is located in 'c:\users\rainer\desktop'
location, then the url to be used will be:
# Create url
url = 'c:\\users\\rainer\\desktop\\sample_1.csv'
# Load dataset
dataframe = pd.read_csv(url)
CSV files are tabulated data in
which the values on the same column are separated by commas. But since CSV files are considered text files,
you can also use the read_table() function while specifying the delimiter.
# Create url
url = 'c:\\users\\rainer\\desktop\\sample_1.csv'
# Load dataset
dataframe = pd.read_table(url, sep = ',')
Create a small csv file in your working directory and save it as sample_1.csv
sample_1.csv
item, white,blue,red,yellow
ruler,1,3,5,2
cup,2,5,9,2
pen,6,1,4,0
book,4,1,2,1
# Load dataset
dataframe = pd.read_csv('sample_1.csv')
print(dataframe)
This will generate:
item white blue red yellow
0 ruler 1 3 5 2
1 cup 2 5 9 2
2 pen 6 1 4 0
3 book 4 1 2 1
Loading Data from Microsoft Excel Files
The excel spreadsheet contains data in tabular form. Pandas provides the read_excel() function to load data of this format (.xls and .xlsx).
This can be achieved through pd.read_excel('data.xls')
which returns the DataFrame composed of the data tabulated in the spreadsheets. However, if you need to load the data in a spreadsheet, then specify the name of the specific sheet or the number of the sheet (index) just as the second argument as shown
pd.read_excel('data.xls','Sheet2')
or pd.read_excel('data.xls',1)
To do it practically, open an Excel file and enter the data shown in the figure below:
After entering the data into a spreadsheet, save and load onto jupyter notebook or preferred python manipulation tool using:
import pandas as pd
# Load file into Data variable
Data = pd.read_excel("Book1.xlsx","Sheet1")
# Print
Print(Data.head())
The output generated is
NAME AGE BEST COLOR BOOK NO. PAGE
0 Edwin 5yrs Green 7 2404
1 Muuo Ian 6yrs Purple 8 2405
2 Joel 5yrs Orange 9 2406
3 Mush 5yrs Red 10 2407
4 Nyandeng 10yrs Green 11 2408
To view other sheets available, just specify the sheet number.
import pandas as pd
# Load file into Data variable
Data = pd.read_excel("Book1.xlsx","Sheet2")
# Print
Print(Data.head())
The output generated is
NAME AGE BEST COLOR BOOK NO. PAGE
0 Fred 10yrs Yellow 25 2410
1 Bianca 13yrs Blue 26 2412
2 Martin 6yrs Green 27 2413
3 Genevieve 3yrs Violet 28 2414
4 Karen 12yrs Yellow 29 2415
CSV and Excel files are the most common data file formats used though there are so many formats available. Stay put for the next article which will cover the loading of data in the JSON format and from a database using Pandas Library.
Untill then, goodbye 👋👋👋. Happy Coding Week Ahead!!!
18