Python Cheat-Sheet for Data-Science
In this document I listed a various number of Python codes, which should help any aspiring Data Scientists to perform common tasks demanded in the field.
I hope, you’ll find what you are looking for 🔥
Table of Contents
- Basic Python for Data-Science
- F-Strings
- List Comprehensions
- Dictionaries
- String Manipulation
- Package Installation
- Help
- Path handling
- Save Data
- Read Data
- Create a DataFrame
- Transformations
- Selection & Filtering of Rows & Columns
- Select 1 Column in a Dataframe
- Select multiple Columns in a Dataframe
- Select AND Filter for Rows with 1 Condition within Dataframe
- Select AND Filter for Rows with multiple Conditions within a Dataframe
- Select only Rows with Missing-Values
- Selection with “Accessor Operators”
loc
&iloc
- Replace Values within Rows
- The differences in Slicing between
loc
VS.iloc
- Sorting & Filtering
- Vanilla Data Exploration
- Data Cleaning
- Find the
Data Type
of a Variable AND the number of missings per Column - Drop 1 OR multiple Columns
- Re-Name 1 OR multiple Columns
- Control the Row-Index with
.reset_index()
and.set_index()
- Assign a
unique ID
- Apply a
function
on a Column of aDataFrame
- Dealing with Missings
- Duplicate Data
- Outliers
- Merging & Splitting
- Scaling Variables
- Find the
- Advanced Data Exploration
- Time Series Exploration
- If-Conditions on a DataFrame
- Data Visualization
- Useful “tricks” I stumbled upon
- Math-Tricks for “Hacks”
- Python Magic Commands
Basic Python for Data-Science
F-Strings
F-strings can be used in Python to dynamically display results that are contained within variables.
- An example of an
F-String
would be:
import math
def day_converter(hour_lag):
"""
What does this function do?:
This Function will convert the index of the cross-correlation to the 'real lag' and also
converst the hours blagged behind into a more human readable lagged value, which will
be in 'days' and 'hours' instead of only 'hours'.
Why is it important?:
Suppose we know that the **index's 167th lag** has the highest correlation. From this fact,
we can conclude that the "true" 168th lag - which corresponds to the 167th index of the lag
in the list - is the most important lag that we are interested in. However, the problem that
we have is: what day & hour corresponds to the 168th lag (here: 168 hours, because our data
is the HOURLY price, e.g. the data has the unit 'hours') in the past? This is where this
converter comes into play.
Example:
--> Input: day_converter(168)
> 'Converted, the 168-th lag is 7 days and 0 hours in the past.'
"""
true_lag = hour_lag + 1
days = math.floor(true_lag/24)
diff = true_lag - (days *24)
print("Converted, the {0}-th lag is {1} days and {2} hours in the past.".format(true_lag, days, diff))
day_converter(169) # call the function, to check if it worked? --> you can play around and change the input
# --> it is always correct and shows the power of f-strings!
List Comprehensions
List Comprehensions are a concept that is one of Python
’s most beloved and unique features. It’s basically a loop where you apply a function to each of the elements during the loop. Its Output is a list
.
Example:
List comprehension where we apply a
function
: x2
List comprehensions are really nice if you want to iterate over a column and apply a function on a particular column of your dataset.
squares = [n**2 for n in range(10)] # apply f(x) = x^2 on a list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
squares
## [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
print(type(squares)) # Check output: should be a 'list'
## <class 'list'>
List comprehension with an
if
condition:
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'] # this is the list we will loop through
short_planets = [planet for planet in planets if len(planet) < 6]
short_planets
## ['Venus', 'Earth', 'Mars']
Dictionaries
Adding a new Key-Value Pair
dictionary_name[key] = value
Comparing 2 different DataFrames by their Rows & Columns
Context:
Here, we compare two DataFrames by their level of similarity. The key question I tried to answer with the code below is: was I able to replicate the same data frame - relative to some “true data frame”, that I tried to mimic - after some data cleaning?
### Step 1): create an empty(!!) dictionary where we will store the different results of the error-difference
### between the two ddsets.
result = dict()
### Step 2) let's select one error-metric of our choice. Here: "Mean-Squared-Error"
from sktime.performance_metrics.forecasting import MeanSquaredError
end = "2021-03-12 22:00:00+00:00" # let's create the dataframe of the same length as PW's benchmark-dataframe
for col in df_joffs_replikation: # we iterate over each column(!!)
result[col] = MeanSquaredError(df_JM[col][:end]- df_PW[col][:end], square_root=True) # For EACH column passiert
# folgendes: the dictionary-KEYS will be saved as the columns' name (in the empty dictionary), while the
# dictionary-VALUES will be saved as the Outputs of the MeanSquaredError()-Function.
# > Important to note: We slice our ddset until PW's benchmark-ddset (via the `end`-variable, which denotes
# the 12.03.2021, e.g. the end of PW's notebook), when performing the "vector"-substraction!!
### Step 3) Check what the results are
result # check, if it worked?
print(result.keys()) # further check: did it REALLY work for all columns? --> yes, since it printed out all the 21 keys!
### Step 4) Berechne den durchschnittlichen Fehler (= meine DF-Values - PW's DF-Values)
"""
> Ziel: Durchschnittlich begangener Fehler (= MAPE) pro Variable (= Spalte):
--> Wir wollen überprüfen, wie gross die Wert-Abweichungen von Philipp's Notebook VS. Joffrey's Notebook
sind (jeweils für <u>jeden</u> Wert, in allen 18 Spalten). Dh, **uns interessiert die _Verteilung_ der
Abweichungen** (also, wie gross im Durchschnitt sind die Fehler und wie hoch ist deren Streuung).
Für den durchschnittlichen Fehler / Abweichung meiner replizierten Tabelle VS. diejenige von Philipps habe ich
den *Mean Absolute Percentage Error (MAPE)* gewählt (siehe 'Step 2' oben).
"""
for value in result.values(): # iteriere über den oben erstellten Dictionary und berechne die durchschnittliche
# absolute Abweichung für jede Spalte
print(value.mean()) # sollte 18 Resultate printen, da wir 18 Spalten überprüfen --> Wir erwarten hier überall
# etwa die Ziffer "0", was also jeweils einem durchschnittlichen (absoluten) Fehler von '0' entspricht, PRO Spalte!
### Step 5): Berechnung der Standardabweichung der (absoluten) Abweichungen für jede Spalte
for value in result.values():
print(value.std())
-
Note: The above code can only be executed on DataFrames that fulfill the following conditions:
- All
Columns
of both DataFrames should have the same names! - Both DFs should have the same number of rows (= observations). I achieved this by using a variable called
end
that was used viadf_JM[col][:end]
.
- All
String Manipulation
Transform a list
that contains strings into a WHOLE string, only separated by commas
# step 0: create a list with many strings in it
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
# Step 1: Transform the whole list into a single "string"
str(planets) # should output: "['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']"
# Step 2: Now, we replace all characters that are unnecessary - such as ' [ and ] -such that we return a whole string,
# only separated by commas, with no whitespace in between them:
## "['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']"
n = str(planets).replace("'", "").replace('[', '').replace(']', '').replace(' ', '') # replace everthing by empty-strings
print(n) # Final output
## Mercury,Venus,Earth,Mars,Jupiter,Saturn,Uranus,Neptune
Split a String at a specific point
timestamps = time_series_str_column.str.split("+", expand=True)
### Erklärung der Inputs:
# > "+" is the Buchstabe, at which we will split the string.
# > "expand = True" says that we want to keep both parts of the string that was being split.
Cut a String in half
Here, we define 2 functions
, which will cut the string in half:
def splitstring_p1(value):
string1, string2 = value[:len(value) // 2], value[len(value) // 2:]
return string1
def splitstring_p2(value):
string1, string2 = value[:len(value) // 2], value[len(value) // 2:]
return string2
Convert a String into a Date-Time
-Format
df['Date-Time'] = pd.to_datetime(df['Date-Time'], format='%Y-%m-%d %H:%M:%S')
df['Date-Time'] # check, if it worked?
Installation of Packages
Installing with package-manager conda
Conda
is a package manager, which will - before starting to install a package that you want - check which dependencies are needed for the package to be able to be used.
- Example: If you use
sktime
, thenconda
can detect that there ma be a conflict with the current version of - for example - thepackage Tornado
. Hence,conda
will not only downloadsktime
, but will also bring the packageTornado
onto a newer version, such tht it will become compatible withsktime
. - Note: The installation via
conda
can take a while, since many dependencies will be checked! - Improtant: Use
the terminal
for the following code.
To install a specific package - for example -
sktime
into an existing virtual-environment called"myenv"
, type the following into the terminal:
conda install --name myenv sktime
Install the package
sktime
into the current (global) environment (= dh “normally”), type the following into the terminal:
conda install sktime
Installing packages with pip
Disadvantage of using
pip
instead ofconda
?
pip
will not make background checks, whether the different versions over the different packages that you use will be compatible. Hence, usingpip
may break your project when updating packages OR installing new ones…
To install a package with pip
, simply type in the following within the terminal:
pip install myPackage
Install some Packages // Modules, or Sub-Modules
import numpy as np # this is a package // module
import matplotlib.pyplot as plt
from matplotlib.dates import date2num # this is an example of a sub-module
import pandas as pd
from scipy import stats
from datetime import datetime
Uninstall Packages with Pip
pip uninstall fbprophet prophet cmdstanpy
Need Help & Documentation?
help()
Path Handling
Why is path handling crucial to know?
Because you will need to load OR save models, DFs or files. In order to be able to do path-handling tasks, you will need to know about theos
package, which is already built-into “base Python”.
The main formats to save & load files in Python, will be:
- The
.csv
-Format. - The
.pkl
-Format. - etc…
- There are alot of different other formats, but the two mentioned above are the most common ones you will encounter.
Set the Path, where your Computer should Save Data / Models
# Read data // Load data
import os # this is the library that can handle paths
save_folder = os.path.expanduser(os.path.join("~", # our User's "home"-directory --> for my Mac it is: "/Users/jomaye"
"Dokumente", # Next, we jump 1 directory called "Dokumente" further below
"Programming")) # Allgemeint: each string after a Komma is a new directory you can set. This can go on infinitively ;)
save_folder # check if it worked? --> yes! Now you see your path =)
## '/Users/jomaye/Dokumente/Programming'
Save data
Save as CSV-File
dynamic_name = "name_of_a_specific_characteristic_of_the_variable_that_you_want_to_save" # this is needed for "dynamisches speichern" via F-String
YOUR_VAR_NAME.to_csv(f"Name-of-the-CSV-{dynamic_name}.csv") # I use f-strings since it allows me to adapt the CSV-filenames to a specific characteristic, that I used for a model / method etc...
Read data // Load data
In order to load datasets, you will need the library pandas
. For some cases, additional libraries may be needed.
Load data from Excel-Sheets
epex_df = pd.read_excel("./Data_V2/Preis_aktuell_Spot_EEX_CH-19-ver-mac.xlsx", # plug in the correct path
header=[1], # The dataset Column-names beginnt ab 2. Zeile (--> 1. Zeile ist der Titel des Excel-Files)
sheet_name='Prices', # If you have more than 1 Excel-Sheet within the Excel-File, you need to specify
# which sheet you want to load
engine='openpyxl') # This input will (sometimes) be needed if you load data from an Excel-File via
# a Windows-Computer, otherwise it can print an error!
epex_df # output the ddset a
Load Dataset via a CSV-File
test = pd.read_csv(
"C:/Users/u235051/Downloads/ETS_Database_v38/ETS_Database_v38.csv",
sep='\t' # Im CSV-File waren die Spalten via "Tab"-Taste separiert, deshalb diese option zwingend anzugeben ist (ansonsten Error!)
)
test # check, if it worked?
Load Dataset via an URL from a Website
In order to be able to access a dataset stored as a csv-file on a website, we will need to use - besides pandas
- an additional library called requests
.
Step 1:
Make the file ready to be downloaded be sending a query to the remote-server // website that hosts the csv-file.
import requests # load the library needed
download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv" # absolute URL
target_csv_path = "nba_all_elo.csv" # name of the .csv-file that contains the data
response = requests.get(download_url) # using an API that "gets" (= http-protocol language) the data from the server
response.raise_for_status() # Check that the request was successful
with open(target_csv_path, "wb") as f:
f.write(response.content)
print("Download ready.")
Step 2:
Load the actual data.
import pandas as pd # load the library needed
nba = pd.read_csv("nba_all_elo.csv") # load the data --> ddset is called 'nba'
type(nba) # check if it worked? --> should output: <class 'pandas.core.frame.DataFrame'>
Let’s also check the dataframe:
nba.head()
Load Pickle-Files
filename = "Put_Your_Pickle-File_Name_Here" # Alternativ: os.path.join(data_folder_variable_where_pkl_is_saved, filename)
test_loaded_pkl = pickle.load(open(filename, 'rb'))
Create a DataFrame
df = pd.DataFrame({'LoadCH': ts_loadFc[time_index_hourly],
'LoadD': ts_DE_loadFc.resample('H').mean()[time_index_hourly],
'LoadF': ts_FR_loadFc[time_index_hourly],
'LoadIT': ts_IT_loadFc[time_index_hourly],
'GenCH': ts_genFc[time_index_hourly],
'GenD': ts_DE_genFc[time_index_hourly],
'GenF': ts_FR_genFc[time_index_hourly],
'GenIT': ts_IT_genFc[time_index_hourly],
'RenGenCH': ts_RenGenAllCH[time_index_hourly], # we take the aggregated renewable generation
'RenGenD': ts_RenGenAllDE[time_index_hourly],
'RenGenF': ts_RenGenAllFR[time_index_hourly],
'RenGenIT': ts_RenGenAllIT[time_index_hourly],
'TransFromDach': ts_TransFromDACH[time_index_hourly],
'TransToDach': ts_TransToDACH[time_index_hourly],
'TransToIT': ts_TransToIT[time_index_hourly],
'SeasonAndProduct': ts_season_prod[time_index_hourly], # TO-DO: re-name this column, since its name is confusing!
'tInfoDaySin' : ts_timeInfoDayI_4forecast[time_index_hourly],
'tInfoDayCos' : ts_timeInfoDayII_4forecast[time_index_hourly],
'tInfoYearSin' : ts_timeInfoYearI_4forecast[time_index_hourly],
'tInfoYearCos' : ts_timeInfoYearII_4forecast[time_index_hourly],
'PricesCH': ts_DA_Price[time_index_hourly]})
Transformations
Convert a Series into an array
Series.to_numpy() # this also works on a column of a dataframe =)
Transform a Pandas Series into a DataFrame
import pandas as pd
s = pd.Series(["a", "b", "c"],
name="vals")
s.to_frame()
Selection & Filtering Rows and Columns
When you will work wit dataframes in pandas
, one of the most important things you will need to master is how to select some columns, as well as print out subsets
// particular columns from the dataframe.
To get started and become acquainted with common techniques, I recommend you to watch this beginner tutorial for filtering & selecting columns.
Load the next cell to be able to run the examples that follow:
import pandas as pd # load package needed
file_path = './data/filter-and-selection/sample_orders.csv' # type in the path-location of your data
dd = pd.read_csv(file_path) # load the data
dd.head() # check if it worked? --> yes! --> should print the first 5 rows of your ddset
dd.info() # check also all data-types
Select only 1 single column of your Dataframe
In this example, I will select the column order_id
from my ddset dd
:
print(
dd['order_id'] # take the dataframe 'dd' and print me only the column called 'order_id' --> this is a subset
)
2nd Possibility: Selection of columns with dot-notation
One disadvantage of this method is that it won’t work, IF the columns contain whitespaces
!
test = dd.order_id # ACHTUNG: funktioniert nicht, wenn es Leerschläge gibt!!
print(test)
Select multiple Columns from your Dataframe
Similarly to R
, we need to pass a “set” into the selector of columns.
print(
dd[['order_id', 'order_total']] # take the dataframe 'dd' and print me only the columns called 'order_id' && 'order_total'
)
Select AND filter for a particular row, with one condition
Let’s say, we want to select the row // observation with the "order_id" == 1004
.
print(
dd[
dd['order_id'] == 1004 # note that this INNER bracket will run a function that searches through and would only print
# a Boolean-List of "True" or "False" for all rows --> example-video: ab 2:36-3:34 --> https://www.youtube.com/watch?v=htyWDxKVttE
] # This outer selector will tell Python: "Select" only the row where the column 'order_id' == 1004
)
# short-version:
print(dd[dd['order_id'] == 1004])
Select AND filter for rows with more than one condition
Now, we want to select the rows // observations, where "order_total" >= 50.00
AND "order_date" == 12/1/17
are fulfilled // filtered. Basically, it is the same as with the selection & filtering with one condition, you just need to wrap up the INNER brackets with an additional (...)
brackets (for each condition!).
print(
dd[
(dd['order_total'] >= 50) & (dd['order_date'] == '12/1/17') # in contrast to **one condition**, we just wrap up a (...) for each condition
] # This outer selector will tell Python: "Select" only the row where the column 'order_id' == 1004
)
Alternative with “OR”: Instead of using AND, you can also - for example - use the OR-condition.
print(dd[(dd['order_total'] >= 50) | (dd['order_date'] == '12/1/17')])
Select only rows with Missing-Values
null_data = df[df.isnull().any(axis = 1)] # this will only select the rows that contain at least one missing-value
null_data # check, if it worked?
Selection with “accessor operators” iloc
& loc
The package Pandas
uses so-called accessor operators
to filter DFs. There are 2 types of accessor operators
:
iloc
(= based on the postition (= number) of the row’s and column’s respective index),- Note: The order of the inputs within
iloc
are:- ‘rows’, and then
- ‘columns’.
- Note: The order of the inputs within
loc
(= based on the index-names of the DF’s rows and columns).- Note: The order of the inputs within
loc
are- ‘rows’, and then
- ‘columns’.
- Note: The order of the inputs within
Selection of the entry
situated in the 1st row and 5th column only with iloc
To be able to use iloc
, the key is to know the position of the row- & column-labels (= index of columns & rows).
- Note:
iloc
&loc
are often used to print a particular entry WITHIN a dataframe. However,loc
andiloc
also are able to print entire rows // observations and not just one specific value within a row, as we will see.
test = dd.iloc[0,4] # Reihenfolge der inputs == 1) 'rows' (--> "0" == 1st row), then 2) 'columns'
# (--> "4" == 5th column)
print(test) # check if it worked? --> should print the value '65'
Selection of the entry
situated in the 1st row and 5th column only with loc
The key to use loc
is that you know the names of the columns & rows.
- Important note:
- The default-name of each row within a DF are simply the numbers
0,1,2...,10,11,12,...
. - Similarly, the index of each row are ALSO the numbers
0,1,2...,10,11,12,...
!- Hence, it is usually the case that
loc
andiloc
have THE SAME first input! 🤓
- Hence, it is usually the case that
- The default-name of each row within a DF are simply the numbers
test = dd.loc[0,'order_total'] # Reihenfolge der inputs == 1) 'rows' (--> "0" == NAME der 1st row),
# then 2) 'columns' (--> "order_total" == NAME der gewünschten 5th column)
print(test) # check if it worked? --> should print the value '65'
Apply a Threshold-Filter using loc
on each row
This can be useful, when you want to replace some values in certain columns → see ‘Outliers’-chapter
# Only display all rows, where the 'pressure'-column is > than the threshold of 1051 bar
df_weather.loc[df_weather.pressure > 1051, 'pressure']
Selection of an ENTIRE row // record // observation from a ddset with iloc
#####
## Möglichkeit 1: selection of only 1 row
test_row1 = dd.iloc[0,] # Important: 1) rows, then 2) columns --> we want the entire 1st row, which includes ALL columns
# Note: das Komma NACH dem "0" zeigt an, dass wir ALLE columns selektieren wollen!
#####
## Möglichkeit 2: selection of > 1 row --> notice the additional wrap with [...] WITHIN iloc[]!
test_multiRow = dd.iloc[[0,1,2,3,5,8],] # '[0,1,2,3,5,8]' will select the '1st, 2nd, 3rd, 4th, 6th and 9th' row
# while also selecting ALL columns simultaneously
#####
## check if it worked? --> yes!
print(test_row1) # should print only 1 row BUT with ALL the column --> weird output, because the columns sind abgebildet als rows xD
print(test_multiRow)
Selection of an ENTIRE row // record // observation from a ddset with loc
Tip: Out of all possibilties, I highly recommend you to use “Möglichkeit 3” in the below example!!
#####
## Möglichkeit 1: selection of only 1 row
test_row1 = dd.loc[0,] # das Komma NACH dem "0" zeigt an, dass wir ALLE columns selektieren wollen!
#####
## Möglichkeit 2: selection of > 1 row --> notice the additional wrap [...] WITHIN loc[]!
test_multiRow = dd.loc[[0,1,2,3,5,8],] # Weil - per default - die 'row-labels' (= name des Indexes
# der Zeilen) dieselben sind, wie die Position, ist der Code
# für 'loc' derselbe, wie für 'iloc' hier...
#####
## Möglichkeit 3: Beste & schönste Solution (meiner Meinung nach!)
rows = list(range(0,16)) # will create a list that goes from 0 to 99 --> this will be for the row-labels
columns = ['order_id', 'order_date', 'order_total'] # this will be for the column-labels
# Pro-Tipp: columns = list(data.columns)
df = dd.loc[rows, columns]
#####
## check if it worked? --> yes!
print(test_row1) # should print only 1 row BUT with ALL the column --> weird output, because the columns sind abgebildet als rows xD
print(test_multiRow)
print(df)
Replace values within a Column with some new Values
df.loc[df.ungewichtet > 1, 'ungewichtet'] = 1 # hier werde ich alle Werte der Spalte "ungewichtet" > 1 mit dem Wert "1" ersetzen!
Different Slicing when using iloc
VS. loc
Remember
When usingiloc
, the range 0:5 will select entries0,...,4
that is: it indexes EXCLUSIVELY. On the other hand,loc
, meanwhile, indexes INCLUSIVELY. So the same range 0:5 will select entries0,...,5
!!!
Hence, if you want the SAME output with loc
and iloc
, you simply need to slightly change the range()
-function.
Example:
## Möglichkeit 1: with 'iloc'
iloc_test = dd.iloc[0:5,0] # row-position == 0:5 --> first 5 rows; EXCLUDES '5' from the range "0,1,2,3,4,5"
# --> hence range(0:5) results in --> "0,1,2,3,4"
# column-position == 0 --> 1st row --> remember: indexing in
# Python starts at '0'!
# IMPORTANT: 'iloc' uses the 'Python stdlib' indexing scheme, where the first element of the range is
# included and the last one excluded. So 0:5 will select entries 0,...,4 (= these are the first *5*
# entries!!).
## Möglichkeit 2: to get the SAME output with 'loc', we need a slightly DIFFERENT range!
loc_test = dd.loc[0:4,'order_id'] # row-position == 0:4 --> first 5 rows; INCLUDES '4'
# --> hence range(0:4) results in --> "0,1,2,3,4"
## check if the output are the same, even though "range()" has slightly different inputs? --> yes!
print(iloc_test)
print(loc_test)
Sorting & Filtering
How to sort a Data-Frame Column (hier: Pearson-Correlation Matrix) from ‘most important’ to ‘least important’?
### Find the correlations' ranking for the day-ahead electricity price and the rest of the features:
# Step 1: Create a Pearson-Korrelation Matrix out of your dataframe:
correlations = df.corr(method='pearson') # the variable 'correlations' is a dataframe!
# Step 2: use 'sort_values' to sort the column from "most important" (highest value) to "least important":
print(correlations['pricesCH'].sort_values(ascending=False).to_string())
Filter the WHOLE Dataframe after a condition (hier: Correlations > 0.75), select ONLY all observations that fullfill the condition (= ‘stack them’) and sort them left in your Data Frame
highly_correlated = correlations[correlations > 0.75]
print(highly_correlated[highly_correlated < 1.0].stack().to_string())
Vanilla Exploration of your Dataset
After having told Python to read in your dataset, the first thing you will want to do is to get (very) familiar with your dataset. This is key, otherwise you will not be able to perform a good data analysis!
Find out the number of observations
// rows in your dataset
len(nba) # to get the number of observations // rows
# note: 'nba' is the name of the ddset
Find out the number of rows
& columns
within your dataset
nba.shape # to get number of rows AND columns
# note: 'nba' is the name of the ddset
View
the whole Dataset
nba # just type in the name of the variable in which your dataframe is stored in --> 'nba' is the name of your dataframe here
View
the first 5-rows of your dataset
nba.head() # head() is often used to check whether your dataset really contains data you care about
# here we check: does the ddset really contains data about the NBA?
Note: If you have alot of columns, Python will not display them all when using head()
. However you can change the settings via:
pd.set_option("display.max.columns", None) # this will tell Python: "show me ALL the columns!"
nba.head() # execute 'head()' again to check if the setting changed correclty? --> yes!
You can also View
the 5 last rows of your dataset by using tail()
nba.tail() # View last 5 rows
# Viewing very specific rows is also possible:
nba.tail(3) # Here, we view the last 3 rows
Make a Summary
out of all the variables in your dataset
Note that, in order to be able to do some summary-statistics, you will need the additional library numpy
.
import numpy as np # load numpy for summary-statistics
nba.describe().round(2) # results will be rounded onto 2 digits
Note: If you have columns from the type object
, you will need a slightly different version of the describe()
function to display some summary-statistics from such columns.
nba.describe(include=object) # if you have some weird columns being of the type 'object'
Find the unique values from a column
This can be useful, when you want to filter all unique categories
within a column. You can then put all those categories within a new variable in order to loop through them to apply some function to those.
gapminder['continent'].unique()
Calculating the mean
or median
from a particular column
For the mean, you will need the following code.
mean_points = reviews.points.mean() # calculate the mean of the column 'points' within the ddset 'reviews'
For the median:
median_points = reviews.points.median() # calculate the median of the column 'points' within the ddset 'reviews'
Transform a column based on conditions with numpy
import numpy as np # In order to be able to perform a transformation, we will need the `numpy`-package
# set-up:
x = np.arange(10) # this is our column 'x' --> Output of this: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
condlist = [x<3, x>5] # set of conditions, that need to be fullfilled --> which are the value-ranges, on which you will apply
# a custom-function [which will be defined next]? --> all numbers below 3 AND all numbers above 5
choicelist = [x, x**2] # the custom-function you will apply here: x^2
# output:
np.select(condlist, choicelist, default=np.nan) # apply x^2 on: x < 3 AND x > 5
Data-Cleaning
Find out the Data Type
of each column within your dataset & number of non-missing values
With the following simple code, we can find out, whether the columns are from the type of an integer
, a string
, a boolean
etc…
nba.info() # this will output all the types of each column in your dataset & how many NON-missings you have per column
# note: Pay attention to any columns that are from the type 'object'! --> lese bemerkung unten...
Note: It can be, that columns are from the type object
. In practice, it often means that all of the values in an object
-column are strings. If you encounter any object
-columns, it is strongly recommended that you convert them into a more apropriate data-type
, otherwise some of the functions won’t work on these object
-columns…
Find the type
of any object in Python:
x = 5
print(type(x))
Drop specific Columns
Note that axis = 1
denotes the columns that will be droped, while axis = 0
(default), will denote the rows that should be dropped.
df_energy = df_energy.drop(['tInfoDaySin', 'tInfoDayCos', 'tInfoYearSin', 'tInfoYearCos'], axis=1)
Re-Name the Columns of a DataFrame
df2 = df.rename({'oldName1': 'newName1', 'oldName2': 'newName2'}, axis='columns') # not all columns have to be renamed, only those with a new name
Set & Reset the Index / Row-Label
df_energy = df_energy.set_index('Date-Time') # to set the index
df_energy = df_energy.reset_index() # to reset the index --> df.reset_index(drop= True) will drop the index, which would
# otherwise become a new column instead of just dropping it!
Assign a unique ID
- For a single Column:
df['id'] = df.groupby(['date']).ngroup() # of Course, you could simply use 1 column fo the assignment of an unique ID.
- For multiple Columns:
df['id'] = df.groupby(['LastName','FirstName']).ngroup() # here, we use the column 'LastName' & 'FirstName' together, to create
# a unique ID.
# Quelle: https://stackoverflow.com/questions/45685254/q-pandas-how-to-efficiently-assign-unique-id-to-individuals-with-multiple-ent
Apply a function
on a Column
This can be useful, when you need to do some transformation to a column of a df
:
# Step 1: Define any function, that you will need to be applied, in order to transform
def my_function():
... # hier kommen die verschiedenen Anweisungen
... # hier kommen die verschiedenen Anweisungen
return some_variable
# Step 2: Apply this newly specified function on your column of your data-frame, that you wish to transform
df['new_column'] = df['column_1'].apply(my_function)
Dealing with Missings
Number of Missing Values across the WHOLE Data-Frame
print('There are {} missing values or NaNs in df_energy.'
.format(df_energy.isnull().values.sum()))
Count the TOTAL number of Missings in each column
df_energy.isnull().sum(axis=0) # outputs the number of NaNs for each column
Count the non-missings (and non-zero values) in each column
Note: the “Code-Trick” below only works, if your columns don’t contain values, that are ‘0’! This is because the number ‘0’ - as a boolean
- will be printed out to False
, and hence, we will get the “wrong” number of missing values.
Thus, you should only use the below code, IF you verified that NONE of your columns will contain “0”-values!
# Display the number of non-missing values in each column
print('Non-zero values in each column:\n', df_energy.astype(bool).sum(axis=0), sep='\n')
Since the above cell only gives out the “correct” number of non-missing values, if you have no 0
in your columns, here is code to count how many 0
you have in each column:
(df_energy == 0).astype(int).sum(axis=0) # count the numbers of '0s' in each column [axis = 0, for columns...]
Display each row with a Missing-Value
# Display the rows with null // missing values:
df_energy[df_energy.isnull().any(axis=1)].tail()
Replace all Missings in a Column with 0s
df['ColumnWithMissings'] = df_tot['ColumnWithMissings'].fillna(0) # replaces all missing-values within the column with 0s.
Replace all Values in a Column with Missings
Assume that we have a time-series that has a row-index with time-stamps!
### Step 1: define the range in which you want to replace values
start = '2020-01-01' # 01. Januar 2020 --> ab hier wollen wir die values der Time-Series mit Missings ersetzen
stop = '2020-08-01' # 01. August 2020 --> bis zu diesem Datum sollen die Missings eingefügt werden
### Step 2: replace the values with missings via the ".loc[row-indexer, column-indexer]"
df.loc[start:stop, 'y_hat'] = None # This will replace all the values within the 'y_hat'-column - in the range from
# 01.01.2020-01.08.2020 with Missing-values (instead of "normal"-values)
Missing Imputation
This list will grow with time, the more I stumble onto various codes:
Missing Interpolation
For a Time Series
# Fill null values using interpolation:
df_energy.interpolate(method='linear', limit_direction='forward', inplace=True, axis=0) # since we have
Data-Cleaning: Duplicates
Number of Duplicates across the WHOLE Data-Frame
temp_energy = df_energy.duplicated(keep='first').sum()
print('There are {} duplicate rows in df_energy based on all columns.'
.format(temp_energy))
Drop Duplicate values
# Variante 1: mit reset_index & neuer set_index
df_weather_2 = df_weather.reset_index().drop_duplicates(
subset=['time', 'city_name'], # Drop the duplicate, if all the rows are the same // have the same
# values (Achtung: we only look at the duplicates in the 'time' &
# 'city_name'-column from this analysis!).
keep='last').set_index('time') # if you have duplicate, keep only the last of the duplicated-rows.
# Variante 2: man dropt den "alten" Index (VOR merge) und neuem set_index
df_unique_dates = df_ferien.drop_duplicates(
subset='datum', # only consider the column "datum" [= column that has duplicates] when dropping the duplicates
keep='first').reset_index(drop=True) # reset the index, otherwise you get weird indizes (mit 10'000 für manche)
# 'drop = True' means that we do not keep the "old" index as a separate
# 'column'
df_unique_dates
Data-Cleaning: Outliers
Draw a Boxplot for a specific column
import seaborn as sns
sns.boxplot(x=df_weather['pressure'])
plt.show()
- Key-Question:
Are there Outliers? → Yes / No - "Trick" to answer the question: If you deal with temperature for example, google for "highest Temperature on earth" and look it up in Wikipedia to dertermine whether your value is an outlier.
- Example:
Even a pressure of approximately 100,000 HPa or 10 MPa, which is clearly visible in the above figure, corresponds to a quantity greater than the atmospheric pressure of Venus. In order to be sure, we will set as
NaN
s every value in thepressure
-column which is higher than 1051 hPa, which is just above the highest air pressure ever recorded in the Iberian peninsula. While outliers on the low side are not visible in the boxplot above, it is a good idea to also replace the values which are lower than 931 hPa, i.e. the lowest air pressure ever recorded in the Iberian peninsula.
Step 2:
If the answer to the above question is 'yes', then set the value above values to NaN
s, which are above a certain "unprobable" threshold.
# Replace outliers in the `Set_Name_of_Column_with_the_Outlier_hier`-column with `NaN`s
df_weather.loc[df_weather.pressure > 1051, 'pressure'] = np.nan
df_weather.loc[df_weather.pressure < 931, 'pressure'] = np.nan
Merging & Splitting
- Important note before you start: If you want to merge 2 dataframes, where one is smaller than the other, then you CANNOT make the bigger dataset become smaller with
merge()
. I did spend alot of time, but without success. However, there is another solution: you just need to drop the duplicates of the bigger dataframe (see chapterDuplicates
in order to be able to make the dataset smaller! =)
Splitting a Dataset into smaller parts, by using categories to split it up
# Split the df_weather into 5 dataframes (one for each of the 5 cities):
df_1, df_2, df_3, df_4, df_5 = [x for _, x in df_weather.groupby('city_name')]
dfs = [df_1, df_2, df_3, df_4, df_5]
Merge 2 separate Data Sets together via their row-index (= row-label)
Note:
You can only use this type of merging, IF both DFs have the same row-index!
- Goal: With this type of merging, you will be able to add new columns from a different DF (which is very common in practice).
# Let's merge all the y- & X-Variables from the training-set together:
test = pd.merge(
y_train, # der Trainingsdatensatz für die y-Variable
x_train, # der Trainingsdatensatz für alle X-Variablen
how="outer",
left_on=y_train.index, # merging via index // row-label des DataFrames der y-Variable
right_on=x_train.index, # merging via index // row-label des DataFrames der x-Variablen
).set_index('key_0') # optional: da wir hier eine Zeitreihe haben, dessen Row-Index den Column-Name 'key_0' animmt
# beim Merging, wird die Spalte 'key_0' hier als neuer Row-Index für den gemerged Dataframe
# gesetzt.
test # check if it worked
Merge different datasets simultaneously together
Starting Situation:
Assume that - initially - we have 2 datasets, 1 is for the weather and 1 is for the Energy-Prices. Furthermore, they those two datasets are time-series. Hence, they have the same time-index (= row-label), formatted in UTC.
Step 1: Split up the weather-dataset, sorted by cities
# Split the df_weather into 5 dataframes (one for each of the 5 cities):
df_1, df_2, df_3, df_4, df_5 = [x for _, x in df_weather.groupby('city_name')]
dfs = [df_1, df_2, df_3, df_4, df_5]
Step 2:
Merge the 5 sub-datasets with the Energy-Price Dataset
# Step 1: save a copy, in case you do a wrong merging!
df_final = df_energy
# Step 2: make a for-loop, to merge all the 6 datasets simultaneously
for df in dfs: # here, we loop through every city-group of our list of data frames (see step 1)
city = df['city_name'].unique() # we store the names of the 5 cities - as a list - in a variable
city_str = str(city).replace("'", "").replace('[', '').replace(']', '').replace(' ', '') # we perform some
# string-manipulation to eliminate all the characters that are not necessary
df = df.add_suffix('_{}'.format(city_str)) # we re-name the columns, by adding the name of the city to each column.
df_final = df_final.merge(df, # this is the merging-part!
on=['time'], # we want to merge via the index // row-label of both datasets --> since
# they are both in UTC-time, this will work!
how='outer') # 'outer' means: we want the 'union' --> see this youtube-video for a
# good explanation: https://www.youtube.com/watch?v=h4hOPGo4UVU
df_final = df_final.drop('city_name_{}'.format(city_str), axis=1) # let's drop some columns that we don't need anymore
# Step 3: "final results"-check
df_final.columns # show the merging-results, by displaying all the column-names --> DONE! =)
Step 3: Make some final-checks
- Key-Question:
Did the merging really worked? → Yes / No - "Trick" to answer the question: Look at Missings AND Duplicates.
# Display the number of NaNs and duplicates in the final dataframe
print('There are {} missing values or NaNs in df_final.'
.format(df_final.isnull().values.sum()))
temp_final = df_final.duplicated(keep='first').sum()
print('\nThere are {} duplicate rows in df_energy based on all columns.'
.format(temp_final))
If you have no missings AND no duplicates, then merging should have worked and the answer to the above question should be 'yes'.
If the answer to the above question is 'no', then you need to go back to step 2 and try to figure out what you did wrong when merging the datasets together.
Scaling Variables
When applying machine learning, one condition to use those fancy models will be to scale our variables first.
The main methods for scaling can be seen in this Stack-Exchange Post.
Normalization
If you normalize variables, then it means that the RANGE of possible values (= “Definitionsbereich”), is set between 0 and 1.
Scale all X-Variables and your Y-variable
from sklearn.preprocessing import MinMaxScaler # Der Name der Klasse für Normalisierung = 'MinMaxScaler' (--> mega weird xDD)
# step 1: initialize the Class
scaler_X = MinMaxScaler(feature_range=(0, 1))
scaler_y = MinMaxScaler(feature_range=(0, 1))
# step 2: wende den Scaler uf die X- & Y-Variablen (dabei nur auf das Training-Set) an (Achtung: funktioniert nur auf numy-arrays,
# aber nicht auf Series // Data-Frame Columns!)
scaler_X.fit(X_compl_df[:train_end_idx]) # alternativ: scaler_y.fit(X_train.to_numpy().reshape(-1, 1))
scaler_y.fit(y_compl_df[:train_end_idx]) # alternativ: scaler_y.fit(y_train.to_numpy().reshape(-1, 1))
# step 3: nach dem "fit" haben wir die Werte noch nicht als "DataFrame"-Typ gespeichert, weshalb wir nun noch 'transform' anwenden
X_norm = scaler_X.transform(X_compl_df)
y_norm = scaler_y.transform(y_compl_df)
Advanced Exploration of your data
How to create & plot a Pearson Correlation Matrix
out of a Dataframe?
# Step 1: construct a 'Pearson Correlation Matrix' as a Dataframe:
correlations = df.corr(method='pearson')
# Step 2: Load Libraries & plot Pearson correlation matrix:
import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure(figsize=(24, 24)) #
sns.heatmap(correlations, annot=True, fmt='.2f') # import seaborn as sns
plt.title('Pearson Correlation Matrix')
plt.show()
Time-Series Data
Indexing for Time-Series
If you are dealing with any type of “time”-variables, you will often need to use so-called Date-Time Objects
in Python when doing the data cleaning.
Creating an index for Time-Series
start = '2020-01-01'
stop = '2021-01-01' # ACHTUNG: wird am 31.12.2020 enden
ts_index = pd.date_range(start, stop, freq = 'h', closed = 'left', tz = 'UTC')
ts_index
(Optional) extension: Index-Slicing
The disadvantage of the above index is that it always starts at 0:00. The problem here, is that you won’t always need an index which starts exactly at 0:00, but - one day, maybe - you will need one that starts at 23:00.
Solution
We can use the slicing-technique, to select the specific time at which we want the DF to start, for example:DF2 = DF1.loc[:'2019-05-26 13:00:00+00:00']
. In case of a (pandas) series, we can also use the slicing-technique, for instance like this:ts_index[4:]
.
Make a copy of the index as a Column of a DataFrame
p_train = p_train.reset_index()
p_train['date'] = p_train['Date-Time']
p_train.set_index('Date-Time')
Converting a Time-Column into a Datetime-Column and set it as an Index (= row label)
If you work with time-series, it will oftentimes be the case that you will need to convert your date-column - which oftentimes are strings - into an actual date-type
column!
df_energy['Date-Time'] = pd.to_datetime(df_energy['Date-Time'], utc=True, infer_datetime_format=True)
df_energy = df_energy.set_index('Date-Time') # set the 'Date-Time'-Column as the index // row-label of our data-frame
If you want to set the format by yourself:
#convert Delivery day to a date time column
epex_df['Delivery day'] = pd.to_datetime(epex_df['Delivery day'], format = '%Y%m%d')
Convert a Date-Time
-Column into a String
-Format
df['Date-Time'] = df['Date-Time'].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S%z'))
Why would we do this?
Sometimes, there will be the problem, that we want to merge different time-series together, but they have different Date-Time
-Formats to begin with!
Hence, BEFORE we apply the UTC
-universal Date-Time
-Format, we will need to bring every separate data-frame into the same Date-Time
-Format.
Since Python
is really good for handling transformations on the str
-format, we can use this to our advantages and:
1) Bring the `Date-Time`-**String** into the way we want it to be on all our _separate_ `Date-Time`-Columns across those different datasets.
2) And then, we will be able to convert them back into the `UTC`-Format! =)
In order to see, how to manipulate strings, I strongly recommend you to look at the sub-chapter String-Manipulation.
Creating a range of dates in Python
Specify start and end, with the default daily frequency.
import pandas as pd
pd.date_range(start='1/1/2018', end='1/08/2018')
- Alternatively, we could also do it like this:
# Step 1: We need a starting & endpoint for the range of dates
the_start_date = '2019-01-01'
the_end_date = str(datetime.today().date() + timedelta(days = 2)) # this is simply "today" + 2 days into the future
# Step 2: Let's create the range of dates based on the starting- & end-point
date_range_of_interest = pd.date_range(the_start_date, # starting date-point (= is fixed)
the_end_date, # end date-point (= can change dynamically // updates with new data from entsoe)
freq = 'h', # we want to have 'hours'
closed = 'left', # The input 'closed' let's you choose whether you want to include the start and end dates.
# --> 'left' will exclude the end-date, which is good, since this date is in the "future" (= today + 2)!
tz = 'Europe/Berlin') # this is the time-zone
Date-Range mit 15-min Abstände
date_range_15Min = pd.date_range(the_start_date,
the_end_date,
freq = '15Min', # only thing that changes --> will be used for Germany, since DE has quarterly-houred observations
closed = 'left',
tz = 'Europe/Berlin')
Plotting a Time Series
Step 1:
Transform thedate
-column into adate-time
-column.
from datetime import datetime as dt # this is the package we need to convert a column into a 'date-time'
# Step 1: Define a function that is reading the date-column correctly
def parse_date(date):
data=str(date)
if date=="": # this condition will never be true, since the column 'Date-Time' NEVER has an empty string;
### raise exception
return None
else:
return pd.to_datetime(date, format='%Y-%m-%d %H:%M:%S', yearfirst = True, utc = True)
# Step 2: apply the above function on the Column 'Date-Time' to transform the column into a 'date-time'-type
raw_dd["Date-Time"] = raw_dd["Date-Time"].apply(parse_date)
Step 2:
Set the ’time’-column as the index (= row-label), otherwise the plotting won’t work.
# set the date-column as our row-label:
raw_dd = raw_dd.set_index("Date-Time")
Step 3:
Visualize the Time-Series
from sktime.utils.plotting import plot_series # use sktime and the 'plot_series'-module for this task
# Define our y-variable (= DA-Prices in CH):
y = raw_dd["pricesCH"]
plot_series(y) # visual check if it worked? --> yes!
Bonus:
Alternatively, you can also use the following function to plot a time-series:
# Define a function to plot different types of time-series:
def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to plot the series in a monthly- or weekly-frequency, instead of hourly...
label=None, ylabel=None, title=None, start=0, end=None):
"""
Plots a certain time-series which has either been loaded in a dataframe
and which constitutes one of its columns or it a custom pandas series
created by the user. The user can define either the 'df' and the 'column'
or the 'series' and additionally, can also define the 'label', the
'ylabel', the 'title', the 'start' and the 'end' of the plot.
"""
sns.set()
fig, ax = plt.subplots(figsize=(30, 12))
ax.set_xlabel('Time', fontsize=16)
if column:
ax.plot(df[column][start:end], label=label)
ax.set_ylabel(ylabel, fontsize=16)
if series.any():
ax.plot(series, label=label)
ax.set_ylabel(ylabel, fontsize=16)
if label:
ax.legend(fontsize=16)
if title:
ax.set_title(title, fontsize=24)
ax.grid(True)
return ax
Plot Auto-Correlation & Partial-Auto-Correlation Functions
The partial autocorrelation plot of the eletricity price time series shows that the direct relationship between an observation at a given hour (t) is strongest with the observations at t-1, t-2, t-24 and t-25 time-steps and diminishes afterwards. Thus, we are going to use the 25 previous values of each time series which will constitute a feature for our models.
# Step 1: load library
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf # to plot
# Step 2: Plot autocorrelation and partial autocorrelation plots
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 6))
plot_acf(df_energy['pricesCH'], lags=50, ax=ax1)
plot_pacf(df_energy['pricesCH'], lags=50, ax=ax2)
plt.tight_layout()
plt.show()
Plot the Cross-Correlation (Yt, Xt - k)
It would quite definitely be more beneficial if we only chose to use specific past values (observations at certain time-lags) of a given feature, based on the cross-correlation between the electricity price and each one of the features in the dataset. For example, below we can see the cross-correlation between the electricity price and the Renewable Generation from Switzerland. We see that there are many time-lags with a correlation which is close to zero and could be ommited.
from statsmodels.tsa.stattools import ccf # to plot the cross-correlation function
# Step 2: plot the cross-correlation between the day-ahead price "today" and with each of the first 50-lags of the column 'RenGen'
cross_corr = ccf(df_energy['RenGenCH'], df_energy['pricesCH'])
plt.plot(cross_corr[0:50])
plt.show()
Feature Engineering for Time-Series
Math with Date-Time Objects: Addition & Subtraction of Time
Why should you care?
Oftentimes, you will have to define time points in the future or past. Therefore, it is essential that you learn how to use math in the context of ‘date-time’-objects.
Find the point in Time “now” // “today”
import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"
jetzt = datetime.datetime.now() # current time
print(jetzt) # check: should output the current time
Math with Dates: Addition while using “now” as our Reference-Point // Starting-Point
import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"
jetzt = datetime.datetime.now() # current time
added_time = datetime.timedelta(hours= 1, minutes= 1) # anzahl an Stunden & Minuten, die du dazu addieren willst
ende = jetzt + added_time
print(ende) # check: should output "current time + 5h 2 min"
Math with Dates: Subtraction while using “now” as our Reference-Point // Starting-Point
import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"
fixe_zeit = datetime.datetime(2021, 7, 1, 8, 12) # datetime(year, month, day, hour, minute, second, microsecond),
# wobei die ersten 3 Argumente (Jahr, Monat, Tag) OBLIGATORISCH sind! --> hier interessiert uns die letzten 2 Inputs,
# nämlich die "8" und die "12" --> diese wiederspiegeln meine fixe 8h 12 min Arbeitszeit bei der SBB
added_time = datetime.timedelta(hours= 1, minutes= 1) # anzahl an Stunden, die du noch arbeiten musst
wie_lange_noch = fixe_arbeitszeit - added_time
print(wie_lange_noch.strftime("%H%M")) # check: should output 7h 11min --> yes!
Day of the Week given a Date
If you want to know - for example - which “Wochentag” z.B. the 21.03.2020
was, simply use this code:
import datetime
fixe_zeit = datetime.datetime(2020, 12, 12) # datetime(year, month, day, hour, minute, second, microsecond)
fixe_zeit.weekday()
# --- Gemäss Documentation
# 0 = Monday
# 1 = Tuesday
# 2 = Wednesday
# 3 = Thursday
# 4 = Friday
# 5 = Saturday
# 6 = Sunday
addierte_anzahl_tage = datetime.timedelta(days = 28)
next_day = fixe_zeit + addierte_anzahl_tage
next_next_day = next_day + addierte_anzahl_tage
next_next_next_day = next_next_day + addierte_anzahl_tage
next_next_next_next_day = next_next_next_day + addierte_anzahl_tage
next_next_next_next_day
We see that the 21.03.2020
was a Saturday. I checked it with the calendar: it’s correct =)
Trends
Weekly Rolling Mean // Gleitender Durchschnitt
### Step 1): Berechne - aus deiner stündlichen(!!) Zeitreihe - einen 1 wöchigen rolling mean:
rolling = p_train['pricesCH'].rolling(24*7, center=True).mean() # this is the rolling (weekly) mean
### Step 2): Define a function to plot different data-types (for time-series):
def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to
# plot the series in a monthly- or weekly-frequency,
# instead of hourly...
label=None, ylabel=None, title=None, start=0, end=None):
"""
Plots a certain time-series which has either been loaded in a dataframe
and which constitutes one of its columns or it a custom pandas series
created by the user. The user can define either the 'df' and the 'column'
OR the 'series' and additionally, can also define the 'label', the
'ylabel', the 'title', the 'start' and the 'end' of the plot.
"""
sns.set()
fig, ax = plt.subplots(figsize=(30, 12))
ax.set_xlabel('Time', fontsize=16)
if column:
ax.plot(df[column][start:end], label=label)
ax.set_ylabel(ylabel, fontsize=16)
if series.any():
ax.plot(series, label=label)
ax.set_ylabel(ylabel, fontsize=16)
if label:
ax.legend(fontsize=16)
if title:
ax.set_title(title, fontsize=24)
ax.grid(True)
return ax
### Step 3): Plot the rolling mean
ax = plot_series(p_train, 'pricesCH', label='Hourly', ylabel='Actual DA-Price (€/MWh)',
title='Day-Ahead Hourly Electricity Price and Weekly Rolling Mean')
ax.plot(rolling, linestyle='-', linewidth=2, label='Weekly rolling mean') # add the rolling mean to the hourly time-series
plt.show()
Saisonality
Weekly Seasonality
### Step 1: Define a function to plot different data-types (for time-series):
def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to
# plot the series in a monthly- or weekly-frequency,
# instead of hourly...
label=None, ylabel=None, title=None, start=0, end=None):
"""
Plots a certain time-series which has either been loaded in a dataframe
and which constitutes one of its columns or it a custom pandas series
created by the user. The user can define either the 'df' and the 'column'
OR the 'series' and additionally, can also define the 'label', the
'ylabel', the 'title', the 'start' and the 'end' of the plot.
"""
sns.set()
fig, ax = plt.subplots(figsize=(30, 12))
ax.set_xlabel('Time', fontsize=16)
if column:
ax.plot(df[column][start:end], label=label)
ax.set_ylabel(ylabel, fontsize=16)
if series.any():
ax.plot(series, label=label)
ax.set_ylabel(ylabel, fontsize=16)
if label:
ax.legend(fontsize=16)
if title:
ax.set_title(title, fontsize=24)
ax.grid(True)
return ax
### Step 2: Plot the (hourly) Time Series over 2 weeks:
ax = plot_series(p_train, 'pricesCH', label='Hourly', ylabel='Actual Price (€/MWh)',
start= 1, end=1 + 24 * 14, # zoom in into 2 weeks, 750-765
title='Hourly Electricity Price (Zoomed - 2 Weeks)')
plt.show()
Volatility
### Step 1): calculate the volatility (= 'change'-variable)
change = (p_train['pricesCH'].div(p_train['pricesCH'].shift(1)) - 1)*100 # Divide the 'pricesCH'-colum with the
# lagged version of itself, substract 1, then multiply by 100, to get the growth-rate in %. Store it into a
# new variable called 'change'.
### Step 2): Plot the percentage of the hourly change in the actual electricity price
ax = plot_series(series=change, ylabel='Hourly Change (%)',
title='Percentage of the hourly change in the day-ahead electricty price')
plt.show()
Daily Volatility
Here, we “Re-Sample” our hourly Volatility into daily volatility:
# Plot the electricity price (monthly frequence) along with its 1-year lagged series
# Step 1: Aggregate the data
daily_volatility = p_train['change'].asfreq('d') # here, we aggregated the hourly DA-price over days -> save it as a series
# Step 2: Define a function to plot different data-types (for time-series):
def plot_series(df=None, column=None, series=pd.Series([]), # the 'series'-Input will be useful, if we choose to plot the series in a monthly- or weekly-frequency, instead of hourly...
label=None, ylabel=None, title=None, start=0, end=None):
"""
Plots a certain time-series which has either been loaded in a dataframe
and which constitutes one of its columns or it a custom pandas series
created by the user. The user can define either the 'df' and the 'column'
OR the 'series' and additionally, can also define the 'label', the
'ylabel', the 'title', the 'start' and the 'end' of the plot.
"""
sns.set()
fig, ax = plt.subplots(figsize=(30, 12))
ax.set_xlabel('Time', fontsize=16)
if column:
ax.plot(df[column][start:end], label=label)
ax.set_ylabel(ylabel, fontsize=16)
if series.any():
ax.plot(series, label=label)
ax.set_ylabel(ylabel, fontsize=16)
if label:
ax.legend(fontsize=16)
if title:
ax.set_title(title, fontsize=24)
ax.grid(True)
return ax
# Step 3: Add the aggregated data to the plot
ax = plot_series(series=daily_volatility, ylabel='Weekly Volatility',
title='DAILY Volatility on the Day-Ahead Electricity Price')
# Step 4: final plot is ready
plt.show()
Dummy-Variables
Create a “normal” Dummy:
In Python, you need the where
-method to create a dummy on a single condition.
This is equivalent to the ifelse()
-Function in R
:
import numpy as np
np.where(maupayment['log_month'] == maupayment['install_month'], 'install', 'existing')
Create a Dummy by merging
& flagging
This can be useful, when you need to track which of the variable was part of the “left” and which one was part of the “right” dataset. You literally create a dummy-variable when you do this and it is called the _merge
-column.
Use Case 1:
Create a dummy for holidays.
df2 = df.merge(df_year_2019,
on='date', # in this case, we merge over the "date"-column; note that 'df' has 365 rows, while 'df_year_2019' only 270 rows
how='outer', # we need to set
indicator=True) # the flag is "indicator"
df2 # we get a ddset with 360 rows. On the outside, the 'testo'-ddset seems NOT to be different form the 'df',
# however, the purpose of the merge was to create a new dummy-variable to see from which dataset it came from.
# step 2: we want some more meaningful values within the 'merge'-column --> apply an if-statement onto the (string-) column
df2.loc[df2['_merge'] == 'both', 'ferien'] = 1 # Here, we create a column 'Ferien', which will be equal to "1", if
# the value in the '_merge'-column is equal to the string "both"
df2.loc[df2['_merge'] != 'both', 'ferien'] = 'False' # Ferien-Dummy equals zero otherwise
Dummy-Variables: Create Columns for specific hours, seasons of the year etc…
In the following, I will use a column called Date-Time
to create new columns to extract some ’time-information’ (hours, quarters etc…) needed.
Extract hours
and put it into a separate column
raw_dd['Hour'] = raw_dd['Date-Time'].apply(lambda x: x.hour) # Werte von 0 [= Stunde 0:00-1:00] bis 23 [= Stunde 23:00-24:00]
Extract Seasons
(= “qarters”) and put it into a separate column
raw_dd['Quarter'] = raw_dd['Date-Time'].apply(lambda x: x.quarter) # Werte von 1 [Januar bis März] bis 4 [Oktober bis Dezember]
Extract Days
and put it into a separate column
raw_dd['day_of_week'] = raw_dd['Date-Time'].apply(lambda x: x.weekday()) # Werte von 0 [= Montag] bis 6 [= Sonntag]
Apply If-Conditions on a DataFrame
- Here is a really good reference: https://datatofish.com/if-condition-in-pandas-dataframe/
Visualization of the Data
Plot & Save an Image
# Step 1: import the library matplotlib for visualization
import matplotlib.pyplot as plt # for the settings of the graph, such as title, windows-size etc...
# Step 2: make the plot
plt.figure(figsize=(14,6))
fig, ax = plot_series(y_train[10290:10320], # only take the last 30 observations from the trainings-set
y_test,
y_pred,
labels=["y_train", "y_test", "y_pred"])
plt.xticks(rotation=90) # rotates Beschriftung der X-Achse um 90-Grad, damit man überhaupt die X-Achse lesen kann
# Step 3: save the plot
fig.savefig("monochrome-2-test.png") # save the graph
For-Loops to iterate over lots of variables and save them
# step 1: create a list of all the columns, you want to iterate through
cov_list = X_train.columns # I select all columns in my dataframe 'X_train'
# step 2:
for i in cov_list:
print(i) # just as a check, to see how the loop progresses
covariate_name = i # everytime I iterate over a new variable, I store it into this variable --> I will use this variable later to save each
covariate = X_train[i] # I need each variable to be of type "Series" when creating a plot
x = list(range(0, 720, 24)) # this will put a tick-mark after each day
x.append(719) # I also add '719', since the range()-function does not include this number
fig, ax = plot_series(covariate[-720:]) # this plots the last month
plt.xticks(x); # to only display some dates
plt.xticks(rotation=90); # we need to rotate the x-axis, otherwise you cannot read it
fig.savefig("visualisation_30_days-{}.png".format(covariate_name)) # save the graph for each column. The "trick" here is, that we can use
# f-strings to create a unique name for each file =)
Plot a Histogram // Distribution-Plot
# Plot the histogram of the day-ahead electricity price
ax = p_train['pricesCH'].plot.hist(bins=18, alpha=0.65)
Python Magic Commands
Show ALL Columns of a DataFrames
Why should you care?
When we try to print out large data sets to the screen in Python, not all the columns are shown in the output, because there is simply not enough space to display all the columns. However, there we can change some default-settings inpandas
in order to make ALL columns visible:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option("display.max.columns", None) # this will tell Python: "show me ALL the columns!"
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
Working with HTML within Jupyter-Notebooks
How to use anchor-tags to make references within your Jupyter-Notebook?
Klick here to move to the very top of this Jupyter-Notebook. This is achieved by using the <a>
-HTML-Tag, and some unique ID-name & CSS
-styling on the <a>
-Tag.
<a id="gebe-hier-passenden-id-namen" style="color:black; text-decoration: none;">Hier kommt der Text, auf welchem du zeigen willst...</a>
[And this is the text which will re-direct the user that clicks on this link](#top)
Share Variables between Notebooks
This Magic Commands
allows you to share any variable between different Jupyter Notebooks. You need to pass the original variable with the magic command. To retrieve the variable, you need to pass the same command with the -r
parameter.
myData = "The World Makes Sense"
%store myData
Now that you ‘saved’ the variable myData
with the Magic Commands
, go and open another of your Jupyter Notebooks and type in the following:
%store -r myData # step one: run this line in the first cell
myData # step 2: run this line in the second cell
Useful “Tricks” I stumbled upon
Reshape(-1, 1)
What does
reshape(-1,1)
do?
If you have an array or list, like this: [1,2,3,4,5]
…
Using reshape(-1,1)
on it, will create a new list of lists, with only 1 element in each sub-list, e.g. the output will be [[1], [2], [3], [4], [5]]
.
See also this Stack-Overflow answer.
Math Tricks for Hacks
How to always round downwards?
By using the round
-function, you can round up- OR downwards, like we are normally used to do:
round(47/24, ndigits = 2) # runde auf 2 Nachkommastellen
However, by using the floor
-function, the result will ALWAYS be rounded downwards, like it is illustrated by the following example:
import math # um 'floo'
math.floor(47/24)
Calculate the Median?
Since there is no built-in function for the median, you can write a function which will be able to calculate the median:
# Step 1: make a function for the case "median calculation for an ODD (= ungerade) list of numbers"
def _median_odd(xs) -> float:
return sorted(xs)[len(xs) // 2]
# Step 2: make a function for the case "median calculation for an EVEN (= gerade) list of numbers"
def _median_even(xs) -> float:
sorted_xs = sorted(xs)
hi_midpoint = len(xs) // 2
return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2
# Step 3: Use the 2 above functions to finally be able to build the median-function
def median(v) -> float:
return _median_even(v) if len(v) % 2 == 0 else _median_odd(v)
assert median([1, 10, 2, 9, 5]) == 5 # check if the above function was written correctly (you can change the number '5' and see
# what happens, if this expression becomes incorrect)
Fun
In this section, I will just put some important concepts & some of my “inventions” that I found useful.
import random
import numpy as np
df = pd.DataFrame(np.random.randn(10, 2),
columns=['Col1', 'Col2'])
df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B'])
df
for i,el in enumerate(list(df.columns.values)[:-1]):
a = df.boxplot(el, by ='type')
Arbeitszeit (left to work…) Berechnung
In order to check, how many working-hours & -minutes I have left during a day, I used a function called hours_to_go()
, which can calculate how much time I have left to work (for the current day):
4*30*24
import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"
def hours_to_go(Stunden, Minuten): # Input: wie viele Stunden & Minuten hast du heute bereits gearbeitet?
"""gibt an, wie lange ich noch heute "Schaffen" muss --> gebe dafür bloss die Anzahl stunden & minuten ein, die du "heute"
bereits gearbeitet hast
> hours_to_go(3,5) --> 0507
interpretation: ich habe fix 8h 12min zu arbeiten. Davon ziehe ich nun 3h 5 min ab --> 0507 heisst, ich habe noch 5h 7min
zu arbeiten, was stimmt!"""
fixe_arbeitszeit = datetime.datetime(2021, 7, 1, 8, 12) # datetime(year, month, day, hour, minute, second, microsecond),
# wobei die ersten 3 Argumente (Jahr, Monat, Tag) OBLIGATORISCH sind! --> hier interessiert uns die letzten 2 Inputs,
# nämlich die "8" und die "12" --> diese wiederspiegeln meine fixe 8h 12 min Arbeitszeit bei der SBB
gearbeitet = datetime.timedelta(hours= Stunden, minutes= Minuten) # anzahl an Stunden, die du noch arbeiten musst
wie_lange_noch = fixe_arbeitszeit - gearbeitet
print(wie_lange_noch.strftime("%H%M"))
hours_to_go(0,5) # call the function to output, how many hours are left to work
To see at what time I may finish work today, I have created the work_end()
-function:
import datetime # Da wir mit "Zeit" arbeiten, brauche ich das Package "datetime"
def arbeit_ende(Stunden, Minuten):
"""gibt an, wann ich heute mit "Schaffen" fertig bin --> gebe dafür bloss die Anzahl stunden & minuten ein, die du "heute"
noch arbeiten musst, zum Beispiel arbeite ich heute noch '4h 44 min':
> arbeit_ende(4, 44) --> 2021-07-01 17:53:02.907698 """
jetzt = datetime.datetime.now() # current time
added_time = datetime.timedelta(hours= Stunden, minutes= Minuten) # anzahl an Stunden, die du noch arbeiten musst
ende = jetzt + added_time
print(ende)
arbeit_ende(8,7) # call the function to output, until when I need to work today