Python Cheat-Sheet for Data-Science

Dec 16, 2022 50 min read

In this document I listed a various number of Python codes, which should help any aspiring Data Scientists to perform common tasks demanded in the field.

I hope, you’ll find what you are looking for 🔥

Basic Python for Data-Science
Vanilla Data Exploration
Data Cleaning
Advanced Data Exploration
- Create a Correlation-Matrix from a Dataframe
Time Series Exploration
If-Conditions on a DataFrame
Data Visualization
Useful “tricks” I stumbled upon
- Reshape(-1,1) to create a list within a list
Math-Tricks for “Hacks”
- Round Downwards
- Calculate the Median
Python Magic Commands

Basic Python for Data-Science

F-Strings

F-strings can be used in Python to dynamically display results that are contained within variables.

An example of an F-String would be:

import math 

def day_converter(hour_lag):
    """
    What does this function do?: 
    
    This Function will convert the index of the cross-correlation to the 'real lag' and also 
    converst the hours blagged behind into a more human readable lagged value, which will 
    be in 'days' and 'hours' instead of only 'hours'.
    
    Why is it important?:

    Suppose we know that the **index's 167th lag** has the highest correlation. From this fact, 
    we can conclude that the "true" 168th lag - which corresponds to the 167th index of the lag 
    in the list - is the most important lag that we are interested in. However, the problem that 
    we have is: what day & hour corresponds to the 168th lag (here: 168 hours, because our data 
    is the HOURLY price, e.g. the data has the unit 'hours') in the past? This is where this 
    converter comes into play.
    
    Example:
    
    --> Input: day_converter(168)
    > 'Converted, the 168-th lag is 7 days and 0 hours in the past.'
    
    """
    true_lag = hour_lag + 1
    days = math.floor(true_lag/24)
    diff = true_lag - (days *24)
    print("Converted, the {0}-th lag is {1} days and {2} hours in the past.".format(true_lag, days, diff))

day_converter(169) # call the function, to check if it worked? --> you can play around and change the input 
                   # --> it is always correct and shows the power of f-strings!

List Comprehensions

List Comprehensions are a concept that is one of Python’s most beloved and unique features. It’s basically a loop where you apply a function to each of the elements during the loop. Its Output is a list.

Example:

List comprehension where we apply a function: x²

List comprehensions are really nice if you want to iterate over a column and apply a function on a particular column of your dataset.

squares = [n**2 for n in range(10)] # apply f(x) = x^2 on a list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
squares

## [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

print(type(squares)) # Check output: should be a 'list'

## <class 'list'>

List comprehension with an if condition:

planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'] # this is the list we will loop through

short_planets = [planet for planet in planets if len(planet) < 6]
short_planets

## ['Venus', 'Earth', 'Mars']

Dictionaries

Adding a new Key-Value Pair

dictionary_name[key] = value

Comparing 2 different DataFrames by their Rows & Columns

Context:
Here, we compare two DataFrames by their level of similarity. The key question I tried to answer with the code below is: was I able to replicate the same data frame - relative to some “true data frame”, that I tried to mimic - after some data cleaning?

### Step 1): create an empty(!!) dictionary where we will store the different results of the error-difference 
###          between the two ddsets.
result = dict()

### Step 2) let's select one error-metric of our choice. Here: "Mean-Squared-Error"
from sktime.performance_metrics.forecasting import MeanSquaredError
end = "2021-03-12 22:00:00+00:00" # let's create the dataframe of the same length as PW's benchmark-dataframe
for col in df_joffs_replikation: # we iterate over each column(!!)
    result[col] = MeanSquaredError(df_JM[col][:end]- df_PW[col][:end], square_root=True) # For EACH column passiert 
    # folgendes: the dictionary-KEYS will be saved as the columns' name (in the empty dictionary), while the 
    # dictionary-VALUES will be saved as the Outputs of the MeanSquaredError()-Function.
    # > Important to note: We slice our ddset until PW's benchmark-ddset (via the `end`-variable, which denotes 
    #   the 12.03.2021, e.g. the end of PW's notebook), when performing the "vector"-substraction!!

### Step 3) Check what the results are
result # check, if it worked?
print(result.keys()) # further check: did it REALLY work for all columns? --> yes, since it printed out all the 21 keys!

### Step 4) Berechne den durchschnittlichen Fehler (= meine DF-Values - PW's DF-Values)
"""
> Ziel: Durchschnittlich begangener Fehler (= MAPE) pro Variable (= Spalte):
    --> Wir wollen überprüfen, wie gross die Wert-Abweichungen von Philipp's Notebook VS. Joffrey's Notebook 
    sind (jeweils für <u>jeden</u> Wert, in allen 18 Spalten). Dh, **uns interessiert die _Verteilung_ der 
    Abweichungen** (also, wie gross im Durchschnitt sind die Fehler und wie hoch ist deren Streuung).

Für den durchschnittlichen Fehler / Abweichung meiner replizierten Tabelle VS. diejenige von Philipps habe ich 
den *Mean Absolute Percentage Error (MAPE)* gewählt (siehe 'Step 2' oben).
"""
for value in result.values(): # iteriere über den oben erstellten Dictionary und berechne die durchschnittliche 
                                   # absolute Abweichung für jede Spalte
    print(value.mean()) # sollte 18 Resultate printen, da wir 18 Spalten überprüfen --> Wir erwarten hier überall 
    # etwa die Ziffer "0", was also jeweils einem durchschnittlichen (absoluten) Fehler von '0' entspricht, PRO Spalte!
    
### Step 5): Berechnung der Standardabweichung der (absoluten) Abweichungen für jede Spalte
for value in result.values():
    print(value.std())

Note: The above code can only be executed on DataFrames that fulfill the following conditions:
- All Columns of both DataFrames should have the same names!
- Both DFs should have the same number of rows (= observations). I achieved this by using a variable called end that was used via df_JM[col][:end].

String Manipulation

Transform a `list` that contains strings into a WHOLE string, only separated by commas

# step 0: create a list with many strings in it
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']

# Step 1: Transform the whole list into a single "string"
str(planets) # should output: "['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']"

# Step 2: Now, we replace all characters that are unnecessary - such as ' [ and ] -such that we return a whole string,
        # only separated by commas, with no whitespace in between them:

## "['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']"

n = str(planets).replace("'", "").replace('[', '').replace(']', '').replace(' ', '') # replace everthing by empty-strings
print(n) # Final output

## Mercury,Venus,Earth,Mars,Jupiter,Saturn,Uranus,Neptune

Split a String at a specific point

timestamps = time_series_str_column.str.split("+", expand=True) 
    ### Erklärung der Inputs:
    # > "+" is the Buchstabe, at which we will split the string.
    # > "expand = True" says that we want to keep both parts of the string that was being split.

Cut a String in half

Here, we define 2 functions, which will cut the string in half:

def splitstring_p1(value):
  string1, string2 = value[:len(value) // 2], value[len(value) // 2:]
  return string1

def splitstring_p2(value):
  string1, string2 = value[:len(value) // 2], value[len(value) // 2:]
  return string2

Convert a String into a `Date-Time`-Format

df['Date-Time'] = pd.to_datetime(df['Date-Time'], format='%Y-%m-%d %H:%M:%S')
df['Date-Time'] # check, if it worked?

Installation of Packages

Installing with package-manager `conda`

Conda is a package manager, which will - before starting to install a package that you want - check which dependencies are needed for the package to be able to be used.

Example: If you use sktime, then conda can detect that there ma be a conflict with the current version of - for example - the package Tornado. Hence, conda will not only download sktime, but will also bring the package Tornado onto a newer version, such tht it will become compatible with sktime.
Note: The installation via conda can take a while, since many dependencies will be checked!
Improtant: Use the terminal for the following code.

To install a specific package - for example - sktime into an existing virtual-environment called "myenv", type the following into the terminal:

conda install --name myenv sktime

Install the package sktime into the current (global) environment (= dh “normally”), type the following into the terminal:

conda install sktime

Installing packages with `pip`

Disadvantage of using pip instead of conda?
pip will not make background checks, whether the different versions over the different packages that you use will be compatible. Hence, using pip may break your project when updating packages OR installing new ones…

To install a package with pip, simply type in the following within the terminal:

pip install myPackage

Install some Packages // Modules, or Sub-Modules

import numpy as np # this is a package // module
import matplotlib.pyplot as plt
from matplotlib.dates import date2num # this is an example of a sub-module
import pandas as pd
from scipy import stats
from datetime import datetime

Uninstall Packages with `Pip`

pip uninstall fbprophet prophet cmdstanpy

Need Help & Documentation?

help()

Path Handling

Why is path handling crucial to know?
Because you will need to load OR save models, DFs or files. In order to be able to do path-handling tasks, you will need to know about the os package, which is already built-into “base Python”.

The main formats to save & load files in Python, will be:

The .csv-Format.
The .pkl-Format.
etc…
- There are alot of different other formats, but the two mentioned above are the most common ones you will encounter.

Set the Path, where your Computer should Save Data / Models

# Read data // Load data
import os # this is the library that can handle paths

save_folder = os.path.expanduser(os.path.join("~", # our User's "home"-directory --> for my Mac it is: "/Users/jomaye" 
                                              "Dokumente", # Next, we jump 1 directory called "Dokumente" further below
                                              "Programming")) # Allgemeint: each string after a Komma is a new directory you can set. This can go on infinitively ;)


save_folder # check if it worked? --> yes! Now you see your path =)

## '/Users/jomaye/Dokumente/Programming'

Save data

Save as CSV-File

dynamic_name = "name_of_a_specific_characteristic_of_the_variable_that_you_want_to_save" # this is needed for "dynamisches speichern" via F-String  
YOUR_VAR_NAME.to_csv(f"Name-of-the-CSV-{dynamic_name}.csv") # I use f-strings since it allows me to adapt the CSV-filenames to a specific characteristic, that I used for a model / method etc...

Read data // Load data

In order to load datasets, you will need the library pandas. For some cases, additional libraries may be needed.

Load data from Excel-Sheets

epex_df = pd.read_excel("./Data_V2/Preis_aktuell_Spot_EEX_CH-19-ver-mac.xlsx", # plug in the correct path
                        header=[1], # The dataset Column-names beginnt ab 2. Zeile (--> 1. Zeile ist der Titel des Excel-Files)
                        sheet_name='Prices', # If you have more than 1 Excel-Sheet within the Excel-File, you need to specify
                        # which sheet you want to load
                        engine='openpyxl') # This input will (sometimes) be needed if you load data from an Excel-File via 
                                           # a Windows-Computer, otherwise it can print an error!
epex_df # output the ddset a

Load Dataset via a CSV-File

test = pd.read_csv(
    "C:/Users/u235051/Downloads/ETS_Database_v38/ETS_Database_v38.csv", 
    sep='\t' # Im CSV-File waren die Spalten via "Tab"-Taste separiert, deshalb diese option zwingend anzugeben ist (ansonsten Error!)
) 
test # check, if it worked?

Load Dataset via an URL from a Website

In order to be able to access a dataset stored as a csv-file on a website, we will need to use - besides pandas - an additional library called requests.

Step 1:

Make the file ready to be downloaded be sending a query to the remote-server // website that hosts the csv-file.

import requests # load the library needed

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv" # absolute URL
target_csv_path = "nba_all_elo.csv" # name of the .csv-file that contains the data

response = requests.get(download_url) # using an API that "gets" (= http-protocol language) the data from the server
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

Step 2:

Load the actual data.

import pandas as pd # load the library needed

nba = pd.read_csv("nba_all_elo.csv") # load the data --> ddset is called 'nba'

type(nba) # check if it worked? --> should output: <class 'pandas.core.frame.DataFrame'>

Let’s also check the dataframe:

nba.head()

Load Pickle-Files

filename = "Put_Your_Pickle-File_Name_Here" # Alternativ: os.path.join(data_folder_variable_where_pkl_is_saved, filename)
test_loaded_pkl =  pickle.load(open(filename, 'rb'))

Create a DataFrame

df = pd.DataFrame({'LoadCH': ts_loadFc[time_index_hourly], 
                   'LoadD': ts_DE_loadFc.resample('H').mean()[time_index_hourly], 
                   'LoadF': ts_FR_loadFc[time_index_hourly], 
                   'LoadIT': ts_IT_loadFc[time_index_hourly],
                   'GenCH': ts_genFc[time_index_hourly], 
                   'GenD': ts_DE_genFc[time_index_hourly], 
                   'GenF': ts_FR_genFc[time_index_hourly], 
                   'GenIT': ts_IT_genFc[time_index_hourly],
                   'RenGenCH': ts_RenGenAllCH[time_index_hourly], # we take the aggregated renewable generation
                   'RenGenD': ts_RenGenAllDE[time_index_hourly], 
                   'RenGenF': ts_RenGenAllFR[time_index_hourly], 
                   'RenGenIT': ts_RenGenAllIT[time_index_hourly], 
                   'TransFromDach': ts_TransFromDACH[time_index_hourly], 
                   'TransToDach': ts_TransToDACH[time_index_hourly], 
                   'TransToIT': ts_TransToIT[time_index_hourly],
                   'SeasonAndProduct': ts_season_prod[time_index_hourly], # TO-DO: re-name this column, since its name is confusing!
                   'tInfoDaySin' : ts_timeInfoDayI_4forecast[time_index_hourly],
                   'tInfoDayCos' : ts_timeInfoDayII_4forecast[time_index_hourly], 
                   'tInfoYearSin' : ts_timeInfoYearI_4forecast[time_index_hourly], 
                   'tInfoYearCos' : ts_timeInfoYearII_4forecast[time_index_hourly],
                   'PricesCH': ts_DA_Price[time_index_hourly]})

Transformations

Convert a Series into an array

Series.to_numpy() # this also works on a column of a dataframe =)

Transform a Pandas Series into a DataFrame

import pandas as pd

s = pd.Series(["a", "b", "c"],
              name="vals")
s.to_frame()

Selection & Filtering Rows and Columns

When you will work wit dataframes in pandas, one of the most important things you will need to master is how to select some columns, as well as print out subsets // particular columns from the dataframe.

To get started and become acquainted with common techniques, I recommend you to watch this beginner tutorial for filtering & selecting columns.

Load the next cell to be able to run the examples that follow:

import pandas as pd # load package needed

file_path = './data/filter-and-selection/sample_orders.csv' # type in the path-location of your data
dd = pd.read_csv(file_path) # load the data

dd.head() # check if it worked? --> yes! --> should print the first 5 rows of your ddset

dd.info() # check also all data-types

Select only 1 single column of your Dataframe

In this example, I will select the column order_id from my ddset dd:

print(
    dd['order_id'] # take the dataframe 'dd' and print me only the column called 'order_id' --> this is a subset
)

2nd Possibility: Selection of columns with `dot-notation`

One disadvantage of this method is that it won’t work, IF the columns contain whitespaces!

test = dd.order_id # ACHTUNG: funktioniert nicht, wenn es Leerschläge gibt!!
print(test)

Select multiple Columns from your Dataframe

Similarly to R, we need to pass a “set” into the selector of columns.

print(
    dd[['order_id', 'order_total']] # take the dataframe 'dd' and print me only the columns called 'order_id' && 'order_total'
)

Select AND filter for a particular row, with one condition

Let’s say, we want to select the row // observation with the "order_id" == 1004.

print(
    dd[
        dd['order_id'] == 1004 # note that this INNER bracket will run a function that searches through and would only print 
                               # a Boolean-List of "True" or "False" for all rows  --> example-video: ab 2:36-3:34 --> https://www.youtube.com/watch?v=htyWDxKVttE
    ]                          # This outer selector will tell Python: "Select" only the row where the column 'order_id' == 1004
)

# short-version:

print(dd[dd['order_id'] == 1004])

Quelle: Basics of selecting and filtering rows and columns in Python, ab 2:36-3:34

Select AND filter for rows with more than one condition

Now, we want to select the rows // observations, where "order_total" >= 50.00 AND "order_date" == 12/1/17 are fulfilled // filtered. Basically, it is the same as with the selection & filtering with one condition, you just need to wrap up the INNER brackets with an additional (...) brackets (for each condition!).

print(
    dd[
        (dd['order_total'] >= 50) & (dd['order_date'] == '12/1/17') # in contrast to **one condition**, we just wrap up a (...) for each condition
    ]                          # This outer selector will tell Python: "Select" only the row where the column 'order_id' == 1004
)

Alternative with “OR”: Instead of using AND, you can also - for example - use the OR-condition.

print(dd[(dd['order_total'] >= 50) | (dd['order_date'] == '12/1/17')])

Select only rows with Missing-Values

null_data = df[df.isnull().any(axis = 1)] # this will only select the rows that contain at least one missing-value
null_data # check, if it worked?

Selection with “accessor operators” `iloc` & `loc`

The package Pandas uses so-called accessor operators to filter DFs. There are 2 types of accessor operators:

iloc (= based on the postition (= number) of the row’s and column’s respective index),
- Note: The order of the inputs within iloc are:
  1. ‘rows’, and then
  2. ‘columns’.
loc (= based on the index-names of the DF’s rows and columns).
- Note: The order of the inputs within loc are
  1. ‘rows’, and then
  2. ‘columns’.

Selection of the `entry` situated in the 1st row and 5th column only with `iloc`

To be able to use iloc, the key is to know the position of the row- & column-labels (= index of columns & rows).

Note: iloc & loc are often used to print a particular entry WITHIN a dataframe. However, loc and iloc also are able to print entire rows // observations and not just one specific value within a row, as we will see.

test = dd.iloc[0,4] # Reihenfolge der inputs == 1) 'rows' (--> "0" == 1st row), then 2) 'columns' 
                    # (--> "4" == 5th column)
print(test) # check if it worked? --> should print the value '65'

Selection of the `entry` situated in the 1st row and 5th column only with `loc`

The key to use loc is that you know the names of the columns & rows.

Important note:
- The default-name of each row within a DF are simply the numbers 0,1,2...,10,11,12,....
- Similarly, the index of each row are ALSO the numbers 0,1,2...,10,11,12,...!
  - Hence, it is usually the case that loc and iloc have THE SAME first input! 🤓

test = dd.loc[0,'order_total'] # Reihenfolge der inputs == 1) 'rows' (--> "0" == NAME der 1st row), 
                               # then 2) 'columns' (--> "order_total" == NAME der gewünschten 5th column)
print(test) # check if it worked? --> should print the value '65'

Apply a Threshold-Filter using `loc` on each row

This can be useful, when you want to replace some values in certain columns → see ‘Outliers’-chapter

# Only display all rows, where the 'pressure'-column is > than the threshold of 1051 bar

df_weather.loc[df_weather.pressure > 1051, 'pressure']

Selection of an ENTIRE row // record // observation from a ddset with `iloc`

#####
## Möglichkeit 1: selection of only 1 row

test_row1 = dd.iloc[0,] # Important: 1) rows, then 2) columns --> we want the entire 1st row, which includes ALL columns
                        # Note: das Komma NACH dem "0" zeigt an, dass wir ALLE columns selektieren wollen!

#####
## Möglichkeit 2: selection of > 1 row --> notice the additional wrap with [...] WITHIN iloc[]!

test_multiRow = dd.iloc[[0,1,2,3,5,8],] # '[0,1,2,3,5,8]' will select the '1st, 2nd, 3rd, 4th, 6th and 9th' row
                                            # while also selecting ALL columns simultaneously

#####   
## check if it worked? --> yes!
print(test_row1) # should print only 1 row BUT with ALL the column --> weird output, because the columns sind abgebildet als rows xD
print(test_multiRow)

Selection of an ENTIRE row // record // observation from a ddset with `loc`

Tip: Out of all possibilties, I highly recommend you to use “Möglichkeit 3” in the below example!!

#####
## Möglichkeit 1: selection of only 1 row

test_row1 = dd.loc[0,] # das Komma NACH dem "0" zeigt an, dass wir ALLE columns selektieren wollen!

#####
## Möglichkeit 2: selection of > 1 row --> notice the additional wrap [...] WITHIN loc[]!

test_multiRow = dd.loc[[0,1,2,3,5,8],] # Weil - per default - die 'row-labels' (= name des Indexes 
                                            # der Zeilen) dieselben sind, wie die Position, ist der Code 
                                            # für 'loc' derselbe, wie für 'iloc' hier...
        
#####
## Möglichkeit 3: Beste & schönste Solution (meiner Meinung nach!)
rows = list(range(0,16)) # will create a list that goes from 0 to 99 --> this will be for the row-labels
columns = ['order_id', 'order_date', 'order_total'] # this will be for the column-labels
                                                    # Pro-Tipp: columns = list(data.columns)
df = dd.loc[rows, columns]
        
        
#####   
## check if it worked? --> yes!
print(test_row1) # should print only 1 row BUT with ALL the column --> weird output, because the columns sind abgebildet als rows xD
print(test_multiRow)
print(df)

Replace values within a Column with some new Values

df.loc[df.ungewichtet > 1, 'ungewichtet'] = 1 # hier werde ich alle Werte der Spalte "ungewichtet" > 1 mit dem Wert "1" ersetzen!

Different Slicing when using `iloc` VS. `loc`

Remember
When using iloc, the range 0:5 will select entries 0,...,4 that is: it indexes EXCLUSIVELY. On the other hand, loc, meanwhile, indexes INCLUSIVELY. So the same range 0:5 will select entries 0,...,5!!!

Hence, if you want the SAME output with loc and iloc, you simply need to slightly change the range()-function.

Example:

## Möglichkeit 1: with 'iloc'
iloc_test = dd.iloc[0:5,0] # row-position == 0:5 --> first 5 rows; EXCLUDES '5' from the range "0,1,2,3,4,5" 
                                          # --> hence range(0:5) results in --> "0,1,2,3,4"
                                          # column-position == 0 --> 1st row --> remember: indexing in 
                                          # Python starts at '0'!

    # IMPORTANT: 'iloc' uses the 'Python stdlib' indexing scheme, where the first element of the range is 
    # included and the last one excluded. So 0:5 will select entries 0,...,4 (= these are the first *5* 
    # entries!!).

## Möglichkeit 2: to get the SAME output with 'loc', we need a slightly DIFFERENT range!
loc_test = dd.loc[0:4,'order_id'] # row-position == 0:4 --> first 5 rows; INCLUDES '4' 
                                  # --> hence range(0:4) results in --> "0,1,2,3,4"

## check if the output are the same, even though "range()" has slightly different inputs? --> yes!
print(iloc_test)
print(loc_test)

Sorting & Filtering

How to sort a Data-Frame Column (hier: Pearson-Correlation Matrix) from ‘most important’ to ‘least important’?

### Find the correlations' ranking for the day-ahead electricity price and the rest of the features:

# Step 1: Create a Pearson-Korrelation Matrix out of your dataframe:
correlations = df.corr(method='pearson') # the variable 'correlations' is a dataframe!

# Step 2: use 'sort_values' to sort the column from "most important" (highest value) to "least important":
print(correlations['pricesCH'].sort_values(ascending=False).to_string())

Filter the WHOLE Dataframe after a condition (hier: Correlations > 0.75), select ONLY all observations that fullfill the condition (= ‘stack them’) and sort them left in your Data Frame

highly_correlated = correlations[correlations > 0.75]
print(highly_correlated[highly_correlated < 1.0].stack().to_string())

Vanilla Exploration of your Dataset

After having told Python to read in your dataset, the first thing you will want to do is to get (very) familiar with your dataset. This is key, otherwise you will not be able to perform a good data analysis!

Find out the number of `observations` // rows in your dataset

len(nba) # to get the number of observations // rows 

    # note: 'nba' is the name of the ddset

Find out the number of `rows` & `columns` within your dataset

nba.shape # to get number of rows AND columns

    # note: 'nba' is the name of the ddset

`View` the whole Dataset

nba # just type in the name of the variable in which your dataframe is stored in --> 'nba' is the name of your dataframe here

`View` the first 5-rows of your dataset

nba.head() # head() is often used to check whether your dataset really contains data you care about 

    # here we check: does the ddset really contains data about the NBA?

Note: If you have alot of columns, Python will not display them all when using head(). However you can change the settings via:

pd.set_option("display.max.columns", None) # this will tell Python: "show me ALL the columns!"

nba.head() # execute 'head()' again to check if the setting changed correclty? --> yes!

You can also `View` the 5 last rows of your dataset by using `tail()`

nba.tail() # View last 5 rows

# Viewing very specific rows is also possible: 
nba.tail(3) # Here, we view the last 3 rows

Make a `Summary` out of all the variables in your dataset

Note that, in order to be able to do some summary-statistics, you will need the additional library numpy.

import numpy as np # load numpy for summary-statistics

nba.describe().round(2) # results will be rounded onto 2 digits

Note: If you have columns from the type object, you will need a slightly different version of the describe() function to display some summary-statistics from such columns.

nba.describe(include=object) # if you have some weird columns being of the type 'object'

Find the unique values from a column

This can be useful, when you want to filter all unique categories within a column. You can then put all those categories within a new variable in order to loop through them to apply some function to those.

gapminder['continent'].unique()

Calculating the `mean` or `median` from a particular column

For the mean, you will need the following code.

mean_points = reviews.points.mean() # calculate the mean of the column 'points' within the ddset 'reviews'

For the median:

median_points = reviews.points.median() # calculate the median of the column 'points' within the ddset 'reviews'

Transform a column based on conditions with `numpy`

import numpy as np # In order to be able to perform a transformation, we will need the `numpy`-package

# set-up:
x = np.arange(10) # this is our column 'x' --> Output of this: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
condlist = [x<3, x>5] # set of conditions, that need to be fullfilled --> which are the value-ranges, on which you will apply 
# a custom-function [which will be defined next]? --> all numbers below 3 AND all numbers above 5
choicelist = [x, x**2] # the custom-function you will apply here: x^2

# output:
np.select(condlist, choicelist, default=np.nan) # apply x^2 on: x < 3 AND x > 5

Data-Cleaning

Find out the `Data Type` of each column within your dataset & number of `non-missing values`

With the following simple code, we can find out, whether the columns are from the type of an integer, a string, a boolean etc…

nba.info() # this will output all the types of each column in your dataset & how many NON-missings you have per column

    # note: Pay attention to any columns that are from the type 'object'! --> lese bemerkung unten...

To get a beautiful overview over all types that exist in Python, I recommend you to visit the website of W3Schools.com

Note: It can be, that columns are from the type object. In practice, it often means that all of the values in an object-column are strings. If you encounter any object-columns, it is strongly recommended that you convert them into a more apropriate data-type, otherwise some of the functions won’t work on these object-columns…

Find the `type` of any object in Python:

x = 5
print(type(x))

Drop specific Columns

Note that axis = 1 denotes the columns that will be droped, while axis = 0 (default), will denote the rows that should be dropped.

df_energy = df_energy.drop(['tInfoDaySin', 'tInfoDayCos', 'tInfoYearSin', 'tInfoYearCos'], axis=1)

Re-Name the Columns of a DataFrame

df2 = df.rename({'oldName1': 'newName1', 'oldName2': 'newName2'}, axis='columns') # not all columns have to be renamed, only those with a new name

Set & Reset the Index / Row-Label

df_energy = df_energy.set_index('Date-Time') # to set the index

df_energy = df_energy.reset_index() # to reset the index --> df.reset_index(drop= True) will drop the index, which would 
                                    # otherwise become a new column instead of just dropping it!

Assign a unique ID

For a single Column:

df['id'] = df.groupby(['date']).ngroup() # of Course, you could simply use 1 column fo the assignment of an unique ID.

For multiple Columns:

df['id'] = df.groupby(['LastName','FirstName']).ngroup() # here, we use the column 'LastName' & 'FirstName' together, to create 
# a unique ID.

# Quelle: https://stackoverflow.com/questions/45685254/q-pandas-how-to-efficiently-assign-unique-id-to-individuals-with-multiple-ent

Apply a `function` on a Column

This can be useful, when you need to do some transformation to a column of a df:

# Step 1: Define any function, that you will need to be applied, in order to transform 

def my_function():
  ... # hier kommen die verschiedenen Anweisungen
  ... # hier kommen die verschiedenen Anweisungen
  return some_variable

# Step 2: Apply this newly specified function on your column of your data-frame, that you wish to transform

df['new_column'] = df['column_1'].apply(my_function)

Dealing with Missings

Number of Missing Values across the WHOLE Data-Frame

print('There are {} missing values or NaNs in df_energy.'
      .format(df_energy.isnull().values.sum()))

Count the TOTAL number of Missings in each column

df_energy.isnull().sum(axis=0) # outputs the number of NaNs for each column

Count the non-missings (and non-zero values) in each column

Note: the “Code-Trick” below only works, if your columns don’t contain values, that are ‘0’! This is because the number ‘0’ - as a boolean - will be printed out to False, and hence, we will get the “wrong” number of missing values.

Thus, you should only use the below code, IF you verified that NONE of your columns will contain “0”-values!

# Display the number of non-missing values in each column

print('Non-zero values in each column:\n', df_energy.astype(bool).sum(axis=0), sep='\n')

Since the above cell only gives out the “correct” number of non-missing values, if you have no 0 in your columns, here is code to count how many 0 you have in each column:

(df_energy == 0).astype(int).sum(axis=0) # count the numbers of '0s' in each column [axis = 0, for columns...]

Display each row with a Missing-Value

# Display the rows with null // missing values:

df_energy[df_energy.isnull().any(axis=1)].tail()

Replace all Missings in a Column with 0s

df['ColumnWithMissings'] = df_tot['ColumnWithMissings'].fillna(0) # replaces all missing-values within the column with 0s.

Replace all Values in a Column with `Missings`

Assume that we have a time-series that has a row-index with time-stamps!

### Step 1: define the range in which you want to replace values
start = '2020-01-01' # 01. Januar 2020 --> ab hier wollen wir die values der Time-Series mit Missings ersetzen
stop = '2020-08-01' # 01. August 2020 --> bis zu diesem Datum sollen die Missings eingefügt werden

### Step 2: replace the values with missings via the ".loc[row-indexer, column-indexer]"
df.loc[start:stop, 'y_hat'] = None # This will replace all the values within the 'y_hat'-column - in the range from 
                                   # 01.01.2020-01.08.2020 with Missing-values (instead of "normal"-values)

Missing Imputation

This list will grow with time, the more I stumble onto various codes:

Missing Interpolation

For a Time Series

# Fill null values using interpolation:

df_energy.interpolate(method='linear', limit_direction='forward', inplace=True, axis=0) # since we have

Data-Cleaning: Duplicates

Number of Duplicates across the WHOLE Data-Frame

temp_energy = df_energy.duplicated(keep='first').sum()

print('There are {} duplicate rows in df_energy based on all columns.'
      .format(temp_energy))

Drop Duplicate values

# Variante 1: mit reset_index & neuer set_index
df_weather_2 = df_weather.reset_index().drop_duplicates(
    subset=['time', 'city_name'], # Drop the duplicate, if all the rows are the same // have the same 
                                  # values (Achtung: we only look at the duplicates in the 'time' & 
                                  # 'city_name'-column from this analysis!).
    keep='last').set_index('time') # if you have duplicate, keep only the last of the duplicated-rows.

# Variante 2: man dropt den "alten" Index (VOR merge) und neuem set_index 
df_unique_dates = df_ferien.drop_duplicates(
    subset='datum', # only consider the column "datum" [= column that has duplicates] when dropping the duplicates 
    keep='first').reset_index(drop=True) # reset the index, otherwise you get weird indizes (mit 10'000 für manche)
                                         # 'drop = True' means that we do not keep the "old" index as a separate 
                                         # 'column'
df_unique_dates

Data-Cleaning: Outliers

Draw a Boxplot for a specific column

import seaborn as sns

sns.boxplot(x=df_weather['pressure'])
plt.show()

Key-Question:
Are there Outliers? → Yes / No

"Trick" to answer the question: If you deal with temperature for example, google for "highest Temperature on earth" and look it up in Wikipedia to dertermine whether your value is an outlier.

Example:

Even a pressure of approximately 100,000 HPa or 10 MPa, which is clearly visible in the above figure, corresponds to a quantity greater than the atmospheric pressure of Venus. In order to be sure, we will set as NaNs every value in the pressure-column which is higher than 1051 hPa, which is just above the highest air pressure ever recorded in the Iberian peninsula. While outliers on the low side are not visible in the boxplot above, it is a good idea to also replace the values which are lower than 931 hPa, i.e. the lowest air pressure ever recorded in the Iberian peninsula.

Step 2:

If the answer to the above question is 'yes', then set the value above values to `NaN`s, which are above a certain "unprobable" threshold.

# Replace outliers in the `Set_Name_of_Column_with_the_Outlier_hier`-column with `NaN`s

df_weather.loc[df_weather.pressure > 1051, 'pressure'] = np.nan
df_weather.loc[df_weather.pressure < 931, 'pressure'] = np.nan

Merging & Splitting

Important note before you start: If you want to merge 2 dataframes, where one is smaller than the other, then you CANNOT make the bigger dataset become smaller with merge(). I did spend alot of time, but without success. However, there is another solution: you just need to drop the duplicates of the bigger dataframe (see chapter Duplicates in order to be able to make the dataset smaller! =)

Splitting a Dataset into smaller parts, by using categories to split it up

# Split the df_weather into 5 dataframes (one for each of the 5 cities):

df_1, df_2, df_3, df_4, df_5 = [x for _, x in df_weather.groupby('city_name')]
dfs = [df_1, df_2, df_3, df_4, df_5]

Merge 2 separate Data Sets together via their row-index (= row-label)

Note:
You can only use this type of merging, IF both DFs have the same row-index!

Goal: With this type of merging, you will be able to add new columns from a different DF (which is very common in practice).

# Let's merge all the y- & X-Variables from the training-set together:
test = pd.merge(
    y_train, # der Trainingsdatensatz für die y-Variable
    x_train, # der Trainingsdatensatz für alle X-Variablen
    how="outer",
    left_on=y_train.index, # merging via index // row-label des DataFrames der y-Variable
    right_on=x_train.index, # merging via index // row-label des DataFrames der x-Variablen
).set_index('key_0') # optional: da wir hier eine Zeitreihe haben, dessen Row-Index den Column-Name 'key_0' animmt 
                     # beim Merging, wird die Spalte 'key_0' hier als neuer Row-Index für den gemerged Dataframe 
                     # gesetzt.
test # check if it worked

Merge different datasets simultaneously together

Starting Situation:
Assume that - initially - we have 2 datasets, 1 is for the weather and 1 is for the Energy-Prices. Furthermore, they those two datasets are time-series. Hence, they have the same time-index (= row-label), formatted in UTC.

Step 1: Split up the weather-dataset, sorted by cities

# Split the df_weather into 5 dataframes (one for each of the 5 cities):

df_1, df_2, df_3, df_4, df_5 = [x for _, x in df_weather.groupby('city_name')]
dfs = [df_1, df_2, df_3, df_4, df_5]

Step 2:
Merge the 5 sub-datasets with the Energy-Price Dataset

# Step 1: save a copy, in case you do a wrong merging!
df_final = df_energy 

# Step 2: make a for-loop, to merge all the 6 datasets simultaneously
for df in dfs: # here, we loop through every city-group of our list of data frames (see step 1)
    city = df['city_name'].unique() # we store the names of the 5 cities - as a list - in a variable
    city_str = str(city).replace("'", "").replace('[', '').replace(']', '').replace(' ', '') # we perform some 
    # string-manipulation to eliminate all the characters that are not necessary
    df = df.add_suffix('_{}'.format(city_str)) # we re-name the columns, by adding the name of the city to each column.
    df_final = df_final.merge(df, # this is the merging-part!
                              on=['time'], # we want to merge via the index // row-label of both datasets --> since 
                                           # they are both in UTC-time, this will work!
                              how='outer') # 'outer' means: we want the 'union' --> see this youtube-video for a 
                                           # good explanation: https://www.youtube.com/watch?v=h4hOPGo4UVU
    df_final = df_final.drop('city_name_{}'.format(city_str), axis=1) # let's drop some columns that we don't need anymore
    
# Step 3: "final results"-check
df_final.columns # show the merging-results, by displaying all the column-names --> DONE! =)

Step 3: Make some final-checks

Key-Question:
Did the merging really worked? → Yes / No

"Trick" to answer the question: Look at Missings AND Duplicates.

# Display the number of NaNs and duplicates in the final dataframe

print('There are {} missing values or NaNs in df_final.'
      .format(df_final.isnull().values.sum()))

temp_final = df_final.duplicated(keep='first').sum()

print('\nThere are {} duplicate rows in df_energy based on all columns.'
      .format(temp_final))

If you have no missings AND no duplicates, then merging should have worked and the answer to the above question should be 'yes'.

If the answer to the above question is 'no', then you need to go back to step 2 and try to figure out what you did wrong when merging the datasets together.

Scaling Variables

When applying machine learning, one condition to use those fancy models will be to scale our variables first.

The main methods for scaling can be seen in this Stack-Exchange Post.

Normalization

If you normalize variables, then it means that the RANGE of possible values (= “Definitionsbereich”), is set between 0 and 1.

Scale all X-Variables and your Y-variable

from sklearn.preprocessing import MinMaxScaler # Der Name der Klasse für Normalisierung = 'MinMaxScaler' (--> mega weird xDD)

# step 1: initialize the Class
scaler_X = MinMaxScaler(feature_range=(0, 1))
scaler_y = MinMaxScaler(feature_range=(0, 1))

# step 2: wende den Scaler uf die X- & Y-Variablen (dabei nur auf das Training-Set) an (Achtung: funktioniert nur auf numy-arrays, 
# aber nicht auf Series // Data-Frame Columns!)
scaler_X.fit(X_compl_df[:train_end_idx]) # alternativ: scaler_y.fit(X_train.to_numpy().reshape(-1, 1))
scaler_y.fit(y_compl_df[:train_end_idx]) # alternativ: scaler_y.fit(y_train.to_numpy().reshape(-1, 1))

# step 3: nach dem "fit" haben wir die Werte noch nicht als "DataFrame"-Typ gespeichert, weshalb wir nun noch 'transform' anwenden
X_norm = scaler_X.transform(X_compl_df)
y_norm = scaler_y.transform(y_compl_df)

Advanced Exploration of your data

How to create & plot a `Pearson Correlation Matrix` out of a Dataframe?

# Step 1: construct a 'Pearson Correlation Matrix' as a Dataframe:
correlations = df.corr(method='pearson')

# Step 2: Load Libraries & plot Pearson correlation matrix:
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(24, 24)) #
sns.heatmap(correlations, annot=True, fmt='.2f') # import seaborn as sns
plt.title('Pearson Correlation Matrix')
plt.show()

Time-Series Data

Indexing for Time-Series

If you are dealing with any type of “time”-variables, you will often need to use so-called Date-Time Objects in Python when doing the data cleaning.

Creating an index for Time-Series

start = '2020-01-01'
stop = '2021-01-01' # ACHTUNG: wird am 31.12.2020 enden
ts_index = pd.date_range(start, stop, freq = 'h', closed = 'left', tz = 'UTC')
ts_index

(Optional) extension: Index-Slicing

The disadvantage of the above index is that it always starts at 0:00. The problem here, is that you won’t always need an index which starts exactly at 0:00, but - one day, maybe - you will need one that starts at 23:00.

Solution
We can use the slicing-technique, to select the specific time at which we want the DF to start, for example: DF2 = DF1.loc[:'2019-05-26 13:00:00+00:00']. In case of a (pandas) series, we can also use the slicing-technique, for instance like this: ts_index[4:].

Make a copy of the index as a Column of a DataFrame

p_train = p_train.reset_index()
p_train['date'] = p_train['Date-Time']
p_train.set_index('Date-Time')

Converting a Time-Column into a Datetime-Column and set it as an Index (= row label)

If you work with time-series, it will oftentimes be the case that you will need to convert your date-column - which oftentimes are strings - into an actual date-type column!

df_energy['Date-Time'] = pd.to_datetime(df_energy['Date-Time'], utc=True, infer_datetime_format=True)
df_energy = df_energy.set_index('Date-Time') # set the 'Date-Time'-Column as the index // row-label of our data-frame

If you want to set the format by yourself:

#convert Delivery day to a date time column
epex_df['Delivery day'] = pd.to_datetime(epex_df['Delivery day'], format = '%Y%m%d')

Convert a `Date-Time`-Column into a `String`-Format

df['Date-Time'] = df['Date-Time'].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S%z'))

Why would we do this?

Sometimes, there will be the problem, that we want to merge different time-series together, but they have different Date-Time-Formats to begin with!

Hence, BEFORE we apply the UTC-universal Date-Time-Format, we will need to bring every separate data-frame into the same Date-Time-Format.

Since Python is really good for handling transformations on the str-format, we can use this to our advantages and:

1) Bring the `Date-Time`-**String** into the way we want it to be on all our _separate_ `Date-Time`-Columns across those different datasets.
2) And then, we will be able to convert them back into the `UTC`-Format! =)

In order to see, how to manipulate strings, I strongly recommend you to look at the sub-chapter String-Manipulation.

Creating a range of dates in Python

Specify start and end, with the default daily frequency.

import pandas as pd

pd.date_range(start='1/1/2018', end='1/08/2018')

Alternatively, we could also do it like this:

# Step 1: We need a starting & endpoint for the range of dates
the_start_date = '2019-01-01'
the_end_date = str(datetime.today().date() + timedelta(days = 2)) # this is simply "today" + 2 days into the future 

# Step 2: Let's create the range of dates based on the starting- & end-point
date_range_of_interest = pd.date_range(the_start_date, # starting date-point (= is fixed)
                                         the_end_date, # end date-point (= can change dynamically // updates with new data from entsoe)
                                         freq = 'h', # we want to have 'hours'
                                         closed = 'left', # The input 'closed' let's you choose whether you want to include the start and end dates. 
                                         # --> 'left' will exclude the end-date, which is good, since this date is in the "future" (= today + 2)!
                                         tz = 'Europe/Berlin') # this is the time-zone

Date-Range mit 15-min Abstände

date_range_15Min = pd.date_range(the_start_date,
                                        the_end_date, 
                                        freq = '15Min', # only thing that changes --> will be used for Germany, since DE has quarterly-houred observations
                                        closed = 'left', 
                                        tz = 'Europe/Berlin')

Plotting a Time Series

Step 1:
Transform the date-column into a date-time-column.

from datetime import datetime as dt # this is the package we need to convert a column into a 'date-time'

# Step 1: Define a function that is reading the date-column correctly

def parse_date(date):
    data=str(date)
    if date=="": # this condition will never be true, since the column 'Date-Time' NEVER has an empty string; 
    ### raise exception
        return None
    else:
        return pd.to_datetime(date, format='%Y-%m-%d %H:%M:%S', yearfirst = True, utc = True)

# Step 2: apply the above function on the Column 'Date-Time' to transform the column into a 'date-time'-type
raw_dd["Date-Time"] = raw_dd["Date-Time"].apply(parse_date)

Step 2:
Set the ’time’-column as the index (= row-label), otherwise the plotting won’t work.

# set the date-column as our row-label:
raw_dd = raw_dd.set_index("Date-Time")

Step 3:
Visualize the Time-Series

from sktime.utils.plotting import plot_series # use sktime and the 'plot_series'-module for this task

# Define our y-variable (= DA-Prices in CH):
y = raw_dd["pricesCH"] 

plot_series(y) # visual check if it worked? --> yes!

Bonus:
Alternatively, you can also use the following function to plot a time-series:

# Define a function to plot different types of time-series:

def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to plot the series in a monthly- or weekly-frequency, instead of hourly...
                label=None, ylabel=None, title=None, start=0, end=None):
    """
    Plots a certain time-series which has either been loaded in a dataframe
    and which constitutes one of its columns or it a custom pandas series 
    created by the user. The user can define either the 'df' and the 'column' 
    or the 'series' and additionally, can also define the 'label', the 
    'ylabel', the 'title', the 'start' and the 'end' of the plot.
    """
    sns.set()
    fig, ax = plt.subplots(figsize=(30, 12))
    ax.set_xlabel('Time', fontsize=16)
    if column:
        ax.plot(df[column][start:end], label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if series.any():
        ax.plot(series, label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if label:
        ax.legend(fontsize=16)
    if title:
        ax.set_title(title, fontsize=24)
    ax.grid(True)
    return ax

Plot Auto-Correlation & Partial-Auto-Correlation Functions

The partial autocorrelation plot of the eletricity price time series shows that the direct relationship between an observation at a given hour (t) is strongest with the observations at t-1, t-2, t-24 and t-25 time-steps and diminishes afterwards. Thus, we are going to use the 25 previous values of each time series which will constitute a feature for our models.

# Step 1: load library
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf # to plot 

# Step 2: Plot autocorrelation and partial autocorrelation plots
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 6))
plot_acf(df_energy['pricesCH'], lags=50, ax=ax1)
plot_pacf(df_energy['pricesCH'], lags=50, ax=ax2)
plt.tight_layout()
plt.show()

Plot the Cross-Correlation (Y_t, X_{t - k})

It would quite definitely be more beneficial if we only chose to use specific past values (observations at certain time-lags) of a given feature, based on the cross-correlation between the electricity price and each one of the features in the dataset. For example, below we can see the cross-correlation between the electricity price and the Renewable Generation from Switzerland. We see that there are many time-lags with a correlation which is close to zero and could be ommited.

from statsmodels.tsa.stattools import ccf # to plot the cross-correlation function

# Step 2: plot the cross-correlation between the day-ahead price "today" and with each of the first 50-lags of the column 'RenGen' 
cross_corr = ccf(df_energy['RenGenCH'], df_energy['pricesCH'])
plt.plot(cross_corr[0:50])
plt.show()

Feature Engineering for Time-Series

Math with Date-Time Objects: Addition & Subtraction of Time

Why should you care?
Oftentimes, you will have to define time points in the future or past. Therefore, it is essential that you learn how to use math in the context of ‘date-time’-objects.

Find the point in Time “now” // “today”

import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"

jetzt = datetime.datetime.now() # current time

print(jetzt) # check: should output the current time

Math with Dates: Addition while using “now” as our Reference-Point // Starting-Point

import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"

jetzt = datetime.datetime.now() # current time
added_time = datetime.timedelta(hours= 1, minutes= 1) # anzahl an Stunden & Minuten, die du dazu addieren willst
ende = jetzt + added_time

print(ende) # check: should output "current time + 5h 2 min"

Math with Dates: Subtraction while using “now” as our Reference-Point // Starting-Point

import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"

fixe_zeit = datetime.datetime(2021, 7, 1, 8, 12) # datetime(year, month, day, hour, minute, second, microsecond), 
# wobei die ersten 3 Argumente (Jahr, Monat, Tag) OBLIGATORISCH sind! --> hier interessiert uns die letzten 2 Inputs, 
# nämlich die "8" und die "12" --> diese wiederspiegeln meine fixe 8h 12 min Arbeitszeit bei der SBB
added_time = datetime.timedelta(hours= 1, minutes= 1) # anzahl an Stunden, die du noch arbeiten musst
wie_lange_noch = fixe_arbeitszeit - added_time

print(wie_lange_noch.strftime("%H%M")) # check: should output 7h 11min --> yes!

Day of the Week given a Date

If you want to know - for example - which “Wochentag” z.B. the 21.03.2020 was, simply use this code:

import datetime

fixe_zeit = datetime.datetime(2020, 12, 12) # datetime(year, month, day, hour, minute, second, microsecond)
fixe_zeit.weekday() 

# --- Gemäss Documentation

# 0 = Monday
# 1 = Tuesday 
# 2 = Wednesday
# 3 = Thursday
# 4 = Friday
# 5 = Saturday
# 6 = Sunday

addierte_anzahl_tage = datetime.timedelta(days = 28)
next_day = fixe_zeit + addierte_anzahl_tage
next_next_day = next_day + addierte_anzahl_tage
next_next_next_day = next_next_day + addierte_anzahl_tage
next_next_next_next_day = next_next_next_day + addierte_anzahl_tage
next_next_next_next_day

We see that the 21.03.2020 was a Saturday. I checked it with the calendar: it’s correct =)

Trends

Weekly Rolling Mean // Gleitender Durchschnitt

### Step 1): Berechne - aus deiner stündlichen(!!) Zeitreihe - einen 1 wöchigen rolling mean:
rolling = p_train['pricesCH'].rolling(24*7, center=True).mean() # this is the rolling (weekly) mean

### Step 2): Define a function to plot different data-types (for time-series):

def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to 
                                                            # plot the series in a monthly- or weekly-frequency, 
                                                            # instead of hourly...
                label=None, ylabel=None, title=None, start=0, end=None):
    """
    Plots a certain time-series which has either been loaded in a dataframe
    and which constitutes one of its columns or it a custom pandas series 
    created by the user. The user can define either the 'df' and the 'column' 
    OR the 'series' and additionally, can also define the 'label', the 
    'ylabel', the 'title', the 'start' and the 'end' of the plot.
    """
    sns.set()
    fig, ax = plt.subplots(figsize=(30, 12))
    ax.set_xlabel('Time', fontsize=16)
    if column:
        ax.plot(df[column][start:end], label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if series.any():
        ax.plot(series, label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if label:
        ax.legend(fontsize=16)
    if title:
        ax.set_title(title, fontsize=24)
    ax.grid(True)
    return ax

### Step 3): Plot the rolling mean
ax = plot_series(p_train, 'pricesCH', label='Hourly', ylabel='Actual DA-Price (€/MWh)',
                 title='Day-Ahead Hourly Electricity Price and Weekly Rolling Mean')
ax.plot(rolling, linestyle='-', linewidth=2, label='Weekly rolling mean') # add the rolling mean to the hourly time-series
plt.show()

Saisonality

Weekly Seasonality

### Step 1: Define a function to plot different data-types (for time-series):

def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to 
                                                            # plot the series in a monthly- or weekly-frequency, 
                                                            # instead of hourly...
                label=None, ylabel=None, title=None, start=0, end=None):
    """
    Plots a certain time-series which has either been loaded in a dataframe
    and which constitutes one of its columns or it a custom pandas series 
    created by the user. The user can define either the 'df' and the 'column' 
    OR the 'series' and additionally, can also define the 'label', the 
    'ylabel', the 'title', the 'start' and the 'end' of the plot.
    """
    sns.set()
    fig, ax = plt.subplots(figsize=(30, 12))
    ax.set_xlabel('Time', fontsize=16)
    if column:
        ax.plot(df[column][start:end], label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if series.any():
        ax.plot(series, label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if label:
        ax.legend(fontsize=16)
    if title:
        ax.set_title(title, fontsize=24)
    ax.grid(True)
    return ax

### Step 2: Plot the (hourly) Time Series over 2 weeks:
ax = plot_series(p_train, 'pricesCH', label='Hourly', ylabel='Actual Price (€/MWh)',
                 start= 1, end=1 + 24 * 14, # zoom in into 2 weeks, 750-765
                 title='Hourly Electricity Price (Zoomed - 2 Weeks)')
plt.show()

Volatility

### Step 1): calculate the volatility (= 'change'-variable)
change = (p_train['pricesCH'].div(p_train['pricesCH'].shift(1)) - 1)*100 # Divide the 'pricesCH'-colum with the 
# lagged version of itself, substract 1, then multiply by 100, to get the growth-rate in %. Store it into a 
# new variable called 'change'.

### Step 2): Plot the percentage of the hourly change in the actual electricity price
ax = plot_series(series=change, ylabel='Hourly Change (%)', 
                 title='Percentage of the hourly change in the day-ahead electricty price')
plt.show()

Daily Volatility

Here, we “Re-Sample” our hourly Volatility into daily volatility:

# Plot the electricity price (monthly frequence) along with its 1-year lagged series

# Step 1: Aggregate the data
daily_volatility = p_train['change'].asfreq('d') # here, we aggregated the hourly DA-price over days -> save it as a series

# Step 2: Define a function to plot different data-types (for time-series):

def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to plot the series in a monthly- or weekly-frequency, instead of hourly...
                label=None, ylabel=None, title=None, start=0, end=None):
    """
    Plots a certain time-series which has either been loaded in a dataframe
    and which constitutes one of its columns or it a custom pandas series 
    created by the user. The user can define either the 'df' and the 'column' 
    OR the 'series' and additionally, can also define the 'label', the 
    'ylabel', the 'title', the 'start' and the 'end' of the plot.
    """
    sns.set()
    fig, ax = plt.subplots(figsize=(30, 12))
    ax.set_xlabel('Time', fontsize=16)
    if column:
        ax.plot(df[column][start:end], label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if series.any():
        ax.plot(series, label=label)
        ax.set_ylabel(ylabel, fontsize=16)
    if label:
        ax.legend(fontsize=16)
    if title:
        ax.set_title(title, fontsize=24)
    ax.grid(True)
    return ax

# Step 3: Add the aggregated data to the plot
ax = plot_series(series=daily_volatility, ylabel='Weekly Volatility',
                 title='DAILY Volatility on the Day-Ahead Electricity Price')

# Step 4: final plot is ready
plt.show()

Dummy-Variables

Create a “normal” Dummy:

In Python, you need the where-method to create a dummy on a single condition.

This is equivalent to the ifelse()-Function in R:

import numpy as np

np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')

Create a Dummy by `merging` & `flagging`

This can be useful, when you need to track which of the variable was part of the “left” and which one was part of the “right” dataset. You literally create a dummy-variable when you do this and it is called the _merge-column.

Use Case 1:
Create a dummy for holidays.

df2 = df.merge(df_year_2019, 
                 on='date', # in this case, we merge over the "date"-column; note that 'df' has 365 rows, while 'df_year_2019' only 270 rows
                 how='outer', # we need to set
                 indicator=True) # the flag is "indicator"
df2 # we get a ddset with 360 rows. On the outside, the 'testo'-ddset seems NOT to be different form the 'df', 
# however, the purpose of the merge was to create a new dummy-variable to see from which dataset it came from.

# step 2: we want some more meaningful values within the 'merge'-column --> apply an if-statement onto the (string-) column

df2.loc[df2['_merge'] == 'both', 'ferien'] = 1 # Here, we create a column 'Ferien', which will be equal to "1", if 
                                               # the value in the '_merge'-column is equal to the string "both"
df2.loc[df2['_merge'] != 'both', 'ferien'] = 'False' # Ferien-Dummy equals zero otherwise

Dummy-Variables: Create Columns for specific hours, seasons of the year etc…

In the following, I will use a column called Date-Time to create new columns to extract some ’time-information’ (hours, quarters etc…) needed.

Extract `hours` and put it into a separate column

raw_dd['Hour'] = raw_dd['Date-Time'].apply(lambda x: x.hour) # Werte von 0 [= Stunde 0:00-1:00] bis 23 [= Stunde 23:00-24:00]

Extract `Seasons` (= “qarters”) and put it into a separate column

raw_dd['Quarter'] = raw_dd['Date-Time'].apply(lambda x: x.quarter) # Werte von 1 [Januar bis März] bis 4 [Oktober bis Dezember]

Extract `Days` and put it into a separate column

raw_dd['day_of_week'] = raw_dd['Date-Time'].apply(lambda x: x.weekday()) # Werte von 0 [= Montag] bis 6 [= Sonntag]

Apply If-Conditions on a DataFrame

Here is a really good reference: https://datatofish.com/if-condition-in-pandas-dataframe/

Visualization of the Data

Plot & Save an Image

# Step 1: import the library matplotlib for visualization
import matplotlib.pyplot as plt # for the settings of the graph, such as title, windows-size etc...

# Step 2: make the plot
plt.figure(figsize=(14,6))
fig, ax = plot_series(y_train[10290:10320], # only take the last 30 observations from the trainings-set
            y_test, 
            y_pred, 
            labels=["y_train", "y_test", "y_pred"])
plt.xticks(rotation=90) # rotates Beschriftung der X-Achse um 90-Grad, damit man überhaupt die X-Achse lesen kann

# Step 3: save the plot
fig.savefig("monochrome-2-test.png") # save the graph

For-Loops to iterate over lots of variables and save them

# step 1: create a list of all the columns, you want to iterate through

cov_list = X_train.columns # I select all columns in my dataframe 'X_train'

# step 2:

for i in cov_list:
    print(i) # just as a check, to see how the loop progresses
    covariate_name = i # everytime I iterate over a new variable, I store it into this variable --> I will use this variable later to save each 
    covariate = X_train[i] # I need each variable to be of type "Series" when creating a plot
    x = list(range(0, 720, 24)) # this will put a tick-mark after each day
    x.append(719) # I also add '719', since the range()-function does not include this number
    fig, ax = plot_series(covariate[-720:]) # this plots the last month
    plt.xticks(x); # to only display some dates
    plt.xticks(rotation=90); # we need to rotate the x-axis, otherwise you cannot read it
    fig.savefig("visualisation_30_days-{}.png".format(covariate_name)) # save the graph for each column. The "trick" here is, that we can use 
    # f-strings to create a unique name for each file =)

Plot a Histogram // Distribution-Plot

# Plot the histogram of the day-ahead electricity price

ax = p_train['pricesCH'].plot.hist(bins=18, alpha=0.65)

Python Magic Commands

Show ALL Columns of a DataFrames

Why should you care?
When we try to print out large data sets to the screen in Python, not all the columns are shown in the output, because there is simply not enough space to display all the columns. However, there we can change some default-settings in pandas in order to make ALL columns visible:

import pandas as pd

pd.set_option('display.max_rows', 500)

pd.set_option("display.max.columns", None) # this will tell Python: "show me ALL the columns!"
pd.set_option('display.max_columns', 500)

pd.set_option('display.width', 1000)

Working with HTML within Jupyter-Notebooks

How to use anchor-tags to make references within your Jupyter-Notebook?

Klick here to move to the very top of this Jupyter-Notebook. This is achieved by using the <a>-HTML-Tag, and some unique ID-name & CSS-styling on the <a>-Tag.

<a id="gebe-hier-passenden-id-namen" style="color:black; text-decoration: none;">Hier kommt der Text, auf welchem du zeigen willst...</a>

[And this is the text which will re-direct the user that clicks on this link](#top)

This Magic Commands allows you to share any variable between different Jupyter Notebooks. You need to pass the original variable with the magic command. To retrieve the variable, you need to pass the same command with the -r parameter.

myData = "The World Makes Sense"
%store myData

Now that you ‘saved’ the variable myData with the Magic Commands, go and open another of your Jupyter Notebooks and type in the following:

%store -r myData # step one: run this line in the first cell
myData # step 2: run this line in the second cell

Useful “Tricks” I stumbled upon

Reshape(-1, 1)

What does reshape(-1,1) do?

If you have an array or list, like this: [1,2,3,4,5]…

Using reshape(-1,1) on it, will create a new list of lists, with only 1 element in each sub-list, e.g. the output will be [[1], [2], [3], [4], [5]].

See also this Stack-Overflow answer.

Math Tricks for Hacks

How to always round downwards?

By using the round-function, you can round up- OR downwards, like we are normally used to do:

round(47/24, ndigits = 2) # runde auf 2 Nachkommastellen

However, by using the floor-function, the result will ALWAYS be rounded downwards, like it is illustrated by the following example:

import math # um 'floo'

math.floor(47/24)

Calculate the Median?

Since there is no built-in function for the median, you can write a function which will be able to calculate the median:

# Step 1: make a function for the case "median calculation for an ODD (= ungerade) list of numbers"

def _median_odd(xs) -> float:
    return sorted(xs)[len(xs) // 2]

# Step 2: make a function for the case "median calculation for an EVEN (= gerade) list of numbers"

def _median_even(xs) -> float:
    sorted_xs = sorted(xs)
    hi_midpoint = len(xs) // 2
    return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2

# Step 3: Use the 2 above functions to finally be able to build the median-function

def median(v) -> float:
    return _median_even(v) if len(v) % 2 == 0 else _median_odd(v)

assert median([1, 10, 2, 9, 5]) == 5 # check if the above function was written correctly (you can change the number '5' and see
# what happens, if this expression becomes incorrect)

Fun

In this section, I will just put some important concepts & some of my “inventions” that I found useful.

import random 
import numpy as np

df = pd.DataFrame(np.random.randn(10, 2),
                  columns=['Col1', 'Col2'])
df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A',
                     'B', 'B', 'B', 'B', 'B'])

df

for i,el in enumerate(list(df.columns.values)[:-1]):
    a = df.boxplot(el, by ='type')

Arbeitszeit (left to work…) Berechnung

In order to check, how many working-hours & -minutes I have left during a day, I used a function called hours_to_go(), which can calculate how much time I have left to work (for the current day):

4*30*24

import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"


def hours_to_go(Stunden, Minuten): # Input: wie viele Stunden & Minuten hast du heute bereits gearbeitet?
    """gibt an, wie lange ich noch heute "Schaffen" muss --> gebe dafür bloss die Anzahl stunden & minuten ein, die du "heute" 
    bereits gearbeitet hast
    > hours_to_go(3,5) --> 0507 
    interpretation: ich habe fix 8h 12min zu arbeiten. Davon ziehe ich nun 3h 5 min ab --> 0507 heisst, ich habe noch 5h 7min 
    zu arbeiten, was stimmt!"""
    
    fixe_arbeitszeit = datetime.datetime(2021, 7, 1, 8, 12) # datetime(year, month, day, hour, minute, second, microsecond), 
    # wobei die ersten 3 Argumente (Jahr, Monat, Tag) OBLIGATORISCH sind! --> hier interessiert uns die letzten 2 Inputs, 
    # nämlich die "8" und die "12" --> diese wiederspiegeln meine fixe 8h 12 min Arbeitszeit bei der SBB
    gearbeitet = datetime.timedelta(hours= Stunden, minutes= Minuten) # anzahl an Stunden, die du noch arbeiten musst
    wie_lange_noch = fixe_arbeitszeit - gearbeitet
    print(wie_lange_noch.strftime("%H%M"))
    
hours_to_go(0,5) # call the function to output, how many hours are left to work

To see at what time I may finish work today, I have created the work_end()-function:

import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"

def arbeit_ende(Stunden, Minuten): 
    """gibt an, wann ich heute mit "Schaffen" fertig bin --> gebe dafür bloss die Anzahl stunden & minuten ein, die du "heute" 
    noch arbeiten musst, zum Beispiel arbeite ich heute noch '4h 44 min':
    > arbeit_ende(4, 44) --> 2021-07-01 17:53:02.907698 """
    
    jetzt = datetime.datetime.now() # current time
    added_time = datetime.timedelta(hours= Stunden, minutes= Minuten) # anzahl an Stunden, die du noch arbeiten musst
    ende = jetzt + added_time
    print(ende)
    
    
arbeit_ende(8,7) # call the function to output, until when I need to work today

Python Cheat-Sheet for Data-Science

Table of Contents

Basic Python for Data-Science

F-Strings

List Comprehensions

Example:

Dictionaries

Adding a new Key-Value Pair

Comparing 2 different DataFrames by their Rows & Columns

String Manipulation

Transform a list that contains strings into a WHOLE string, only separated by commas

Split a String at a specific point

Cut a String in half

Convert a String into a Date-Time-Format

Installation of Packages

Installing with package-manager conda

Installing packages with pip

Install some Packages // Modules, or Sub-Modules

Uninstall Packages with Pip

Need Help & Documentation?

Path Handling

Set the Path, where your Computer should Save Data / Models

Save data

Save as CSV-File

Read data // Load data

Load data from Excel-Sheets

Load Dataset via a CSV-File

Load Dataset via an URL from a Website

Step 1:

Step 2:

Load Pickle-Files

Create a DataFrame

Transformations

Convert a Series into an array

Transform a Pandas Series into a DataFrame

Selection & Filtering Rows and Columns

Select only 1 single column of your Dataframe

2nd Possibility: Selection of columns with dot-notation

Select multiple Columns from your Dataframe

Select AND filter for a particular row, with one condition

Select AND filter for rows with more than one condition

Select only rows with Missing-Values

Selection with “accessor operators” iloc & loc

Selection of the entry situated in the 1st row and 5th column only with iloc

Selection of the entry situated in the 1st row and 5th column only with loc

Apply a Threshold-Filter using loc on each row

Selection of an ENTIRE row // record // observation from a ddset with iloc

Selection of an ENTIRE row // record // observation from a ddset with loc

Replace values within a Column with some new Values

Different Slicing when using iloc VS. loc

Sorting & Filtering

How to sort a Data-Frame Column (hier: Pearson-Correlation Matrix) from ‘most important’ to ‘least important’?

Filter the WHOLE Dataframe after a condition (hier: Correlations > 0.75), select ONLY all observations that fullfill the condition (= ‘stack them’) and sort them left in your Data Frame

Vanilla Exploration of your Dataset

Find out the number of observations // rows in your dataset

Find out the number of rows & columns within your dataset

View the whole Dataset

View the first 5-rows of your dataset

You can also View the 5 last rows of your dataset by using tail()

Make a Summary out of all the variables in your dataset

Find the unique values from a column

Calculating the mean or median from a particular column

Transform a column based on conditions with numpy

Data-Cleaning

Find out the Data Type of each column within your dataset & number of non-missing values

Find the type of any object in Python:

Drop specific Columns

Re-Name the Columns of a DataFrame

Set & Reset the Index / Row-Label

Assign a unique ID

Apply a function on a Column

Dealing with Missings

Number of Missing Values across the WHOLE Data-Frame

Count the TOTAL number of Missings in each column

Count the non-missings (and non-zero values) in each column

Display each row with a Missing-Value

Replace all Missings in a Column with 0s

Replace all Values in a Column with Missings

Missing Imputation

Missing Interpolation

Transform a `list` that contains strings into a WHOLE string, only separated by commas

Convert a String into a `Date-Time`-Format

Installing with package-manager `conda`

Installing packages with `pip`

Uninstall Packages with `Pip`

2nd Possibility: Selection of columns with `dot-notation`

Selection with “accessor operators” `iloc` & `loc`

Selection of the `entry` situated in the 1st row and 5th column only with `iloc`

Selection of the `entry` situated in the 1st row and 5th column only with `loc`

Apply a Threshold-Filter using `loc` on each row

Selection of an ENTIRE row // record // observation from a ddset with `iloc`

Selection of an ENTIRE row // record // observation from a ddset with `loc`

Different Slicing when using `iloc` VS. `loc`

Find out the number of `observations` // rows in your dataset

Find out the number of `rows` & `columns` within your dataset

`View` the whole Dataset

`View` the first 5-rows of your dataset

You can also `View` the 5 last rows of your dataset by using `tail()`

Make a `Summary` out of all the variables in your dataset

Calculating the `mean` or `median` from a particular column

Transform a column based on conditions with `numpy`

Find out the `Data Type` of each column within your dataset & number of `non-missing values`

Find the `type` of any object in Python:

Apply a `function` on a Column

Replace all Values in a Column with `Missings`

If the answer to the above question is 'yes', then set the value above values to `NaN`s, which are above a certain "unprobable" threshold.

If you have no missings AND no duplicates, then merging should have worked and the answer to the above question should be 'yes'.

If the answer to the above question is 'no', then you need to go back to step 2 and try to figure out what you did wrong when merging the datasets together.

How to create & plot a `Pearson Correlation Matrix` out of a Dataframe?

Convert a `Date-Time`-Column into a `String`-Format

Plot the Cross-Correlation (Y_t, X_{t - k})

Create a Dummy by `merging` & `flagging`

Extract `hours` and put it into a separate column

Extract `Seasons` (= “qarters”) and put it into a separate column

Extract `Days` and put it into a separate column