R Cheat-Sheet for Data-Science

Dec 16, 2022 21 min read

In this document I listed a various number of R codes, which should help any aspiring Data Scientists to perform common tasks demanded in the field.

Some important context to keep in mind before you start reading:
I have an economics background, so the recommended codes - especially the models - can be (strongly) influenced by how economists think & work.

I hope, you’ll find what you are looking for 🔥

Basic R for Data-Science

Clear Workspace

Include the following code into EVERY R-Script you are using. It will clean all variables that are saved in R’s global environment:

rm(list = ls())

Clear some Variables or Datasets

If you don’t need some variables or datasets, use this:

rm(list= c("dd_1", "dd_2", "var_1", "var_2")) # not needed anymore

Install a Package

install.packages("installr")

You can replace installr by any package name that you want to install.

Install multiple Packages

install.packages(c("ggplot2", "gcookbook", "MASS", "dplyr"))

Update Packages

Option 1:

update.packages()

The disadantage of “Opiton 1” is that R will ask you - for EVERY single package - if you want to update it. This can take very long, especially when you have lots of packages installed. That’s why there exists “Option 2” (see below).

Option 2:

update.packages(ask = FALSE) # If you want it to upgrade all packages without
# asking, use ask = FALSE:

Update R

Simply run the following code:

if(!require(installr)) {
  install.packages("installr"); require(installr)}
updateR() # this will start the updating process of your R installation.

Read Data

Load Data from `.csv`-File

data <- read.csv("datafile.csv")

Alternatively, you can use the read_csv() function (note the underscore instead of period) from the readr-Package. This function is significantly faster than read.csv()!

No Header in first Row

You can also load data with if the .csv-file does not have a header in the first row // Zeile:

data <- read.csv("datafile.csv", header = FALSE)

Load Data with special Delimiters

When loading a .csv-file, there are different ways to tell R that there there is the beginning of a new column.

Example of Delimiters are:

;
Tabs
etc…

data <- read.csv("datafile.csv", sep = "\t") # if it is tab-delimited, use
# 'sep = "\t"' --> this is called a delimiter...

Read data that are Strings

Step 1: The Problem when you have Columns that contain strings?:
By default, R treats strings as factor-variables, which is NOT what you want. But you can use the following inputs within the read.csv()-function:

data <- read.csv("datafile.csv", stringsAsFactors = FALSE)

Step 2: Convert some Columns back to factors:
If any of those columns should NOT be strings, then you need to convert them back. For example, let’s say one of the columns is “sex”, then we need to convert it (back) into a factor-variable:

data$Sex <- factor(data$Sex)
str(data)  # check if 'sex' is now a factor-variable

Load Excel-Data

install.packages("readxl") # Only need to install once
library(readxl)
data <- read_excel("datafile.xlsx", 1)

Note: here are other packages for reading Excel files. The gdata-Package has a function read.xls() for reading in .xls-Files, and the xlsx-Package has a function read.xlsx() for reading in .xlsx-Files.

Load Second Sheet

data <- read_excel("datafile.xls", sheet = 2) # To access the 2nd sh

Sheet with specific Name

data <- read_excel("datafile.xls", sheet = "Revenues") # To access a sheet with a specific name.

If Excel-Sheet has no Column-Names

data <- read_excel("datafile.xls", sheet = 2, col_names = FALSE) # uses the
# FIRST row [= Zeile] of the spreadsheet for column names. If your
# columns DON'T have column-names, then then you need to use the input
# 'col_names = FALSE', since by default the function assumes that the excel
# files have column-names.
# If you want to specify the type each column has, you can do this. But it is
# not necessary, since the function will try to infer it by itself:

Specific Column Types

data <- read_excel("datafile.xls", col_types = c("blank", "text", "date", "numeric"))

The above code will drop the first column [see “blank”], and specify the types of the next three columns.

Load Stata-Files

# install.packages("readstata13") 
library(readstata13) # read data from
# stata files --> you need to load library(readstata13) package
# for this to work!

data <- read.dta13("~/Path/to/your/data/Stata-File.dta")

Load SPSS Data

library(haven) # Need the package 'haven' for this.
data <- read_sav("datafile.sav")

Note that the haven-package also includes functions to read from other formats:

read_sas(): SAS
read_dta(): Stata

Help

If you need to get any information about a command, simply run any R command by using a ? in front of the expression:

?write.table

Another possibility would be to use help().

Data Cleaning

Quick Data Exploration

Find out number of observations in a Dataset

dim(sampleUScens2015)

For a quick & general Overview

summary(data) # makes a summary statistics of all the variables // columns

Print all the unique values from a particular column & sort all

sort(unique(data$educ))

Variable Transformations

Why is this important?
Because oftentimes - to use, for example, a function or a basic for loop - the data needs to be in a very specific data type. Otherwise, the processing you want to apply to your data may not work. That’s why it is very important to always know about the data types within your project!

Convert String into Factor

For example, let’s say one of the columns is “sex” and contains string (= “male”, “female”), then we need to convert it into a factor-variable:

data$Sex <- factor(data$Sex)
str(data)  # check if 'sex' is now a factor-variable

Create a Dummy Variable

university <- ifelse(sampleUScens2015$educ>=16,1,0) # creates dummy with
# ifelse() function

Create a Dummy Variable from multiple Columns via nested `ifelse()`-Function

# summarize "choices" into one column by putting all separate "column-decision" into one column:
data_test$choice <- ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==1, 1,
                           ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==0, 2,
                                  ifelse(data_test$sport_dummy==0 & data_test$sport_dummy_young==1, 3, 4)
                           )
)#this is a nested ifelse-function, which should allow me to build my "choice" variable

Count data which are bigger / smaller than some Threshold

library(dplyr) # count()-function requires dplyr-package
count(ifelse(data$age>=39, 1,0))

Create unique IDs

Why is this data-transformation important?
In settings where you will need to merge 2 separate datasets into 1 big datasets, it is important that both have the same unique ID column, otherwise you cannot merge those two datasets into one.

data_full <- transform(data_full,id=as.numeric(factor(country))) #creates a
# unique ID for each country --> wird als neue variable in dataset angehenkt

Possibility 2

There is an alternative way to create unique IDs (which I used during the data cleaning of my master thesis):

test_data$ID <- seq.int(nrow(test_data))

Standardize a Variable

Be careful: The follwing code is NOT a standard normal transformation!

data2["standard_score2"] <- scale(data2$totalscor) # we create the new 
# column "standard_score2"

Create an Interaction Term

data["lrain_2"] <- data$lrain*data$lrain #creates the interaction term

Managing Columns of your Dataset

Rename a Column

names(dd5)[14]<- "teens" # changes the name of the 14th column to "teens"

Change all Column-Names simultaneously

To manually assign the header names // new column names, simply use the following code:

names(data) <- c("Column1", "Column2", "Column3")

Selecting a Subset of Columns in your Dataset by their position within the dataset

You can use the following code to make a “cleaner” dataset, e.g. with a better overview of some columns:

data_test <- data[,c(39,43)] # creates a dataset: it selects all rows, but
# only with the columns 39 to 43

Selecting a Subset of Columns in your Dataset by their Names

Let’s say, we want to extract only some of the columns like “sport” or “sport_dummy” etc. This can be achieved by using the following code:

test_data <- data[,c("type","sport","sport_dummy", "m_sport", "f_sport", "sport_parent", "sport_mother", "sport_father")]#198 missings due to NAs

Drop Columns

Dropping some existing columns can be achieved with the following code:

data$male<-NULL # this will eliminate the column "male"

Drop unused Categories (= levels) in Categorical Variables

levels(data$type) # check levels: not every category defined is used, 
# so let's drop them
data$type <- droplevels(data$type, exclude = if(anyNA(levels(data$type))) NULL else NA)
levels(data$type) # check if it worked (should have only 69 categories left)

Create new Columns in a Dataset

Note that there are mulitple way to create new columns in a dataset.

Create a new Column based on an existing Column | Possibility 1

data_NJ$low <- ifelse(data_NJ$wage_st < 5,1,0) # creates a new variable //
# column "low" for dataset data_NJ

### Alternatively, in order to create an interaction term:
data["lrain_2"] <- data$lrain*data$lrain #creates the interaction term

Append new Columns to an existing Dataset | Possibility 2

university <- ifelse(sampleUScens2015$educ>=16,1,0) #creates dummy with ifelse() function
data <- data.frame(sampleUScens2015,wage,lw,university) # Using the data.frame
# function we create a matrix and append the columns with the variable-names(!): "wage", "lw" and "university" to the already cleaned 
# data set "sampleUScens".

Replace Values within Columns

Here, I want to add a value from one column into another column that has an NA-value at this place

data_did3$col_na[is.na(data_did3$col_na)] <- data_did3$type[is.na(data_did3$col_na)] # add the 2015 values to our new
# "type13" column

Missing-Values (`NA`s)

Count Missings

length(data$twinno[is.na(data$twinno)])

Possibility 2 to count Missings

# count number of missings for a variable (here: 'frequency'):
missings <- data[is.na(data$frequency),] # 352 missings

Select all the `NA`-values from a Column

test_data <- data[is.na(data$type),]

Filter all NAs into a Subset

testo <- subset(test, is.na(corrupt_icrg)==TRUE)

Build a dataset with only Missing Values

Why would you do this?
This may be useful to understand, why there are missings in a tabular dataset, because you see all the rows that have a NA-value somewhere. Note that this method preserves other columns, e.g. it selects an entire row with multiple columns, which may not all be necessarily contain a missing (which is nice, because this may show us why some other column has a missing in it!).

missings <- data[is.na(data$m_physact),] # selects only the ROWs with missings & also selects every column

Compute the Correlation with Missings

cor(data.long$avehigh, data.long$aveweigh, use='pairwise')

Delete Rows which contain Missings

data_subset <- data.wide[ , c("down_exp")] #select the columns from which you want to remove the NA's
df <- data.wide[complete.cases(data_subset), ] # Omit NAs by columns

Delete missings from a Column

testo <- subset(testo, is.na(corrupt_icrg)==FALSE)

Manual Missing Imputation: Replace a specific Value with another one

Let’s say, that we want to replace a specific value within a dataset, because - for example - it has a NA.

data[122,] <- replace(data[122,], "islam1100", 0) #Set "Islam1100" for country Israel equal to 0 instead of 1
View(data) #check if it worked

Note that it must not necessarily be a missing value. It could be, because we had the wrong value somewhere in a row of the dataset.

Creating Datasets

Create a new Dataset with selected Columns

data_sport <- data.frame("sport" = data[,c(22)], "ysport" = data[,c(32)]) #we need to specify the column-names when creating a new data-frame, that's why I wrote "sport" for the column 21 [= sport-column in data] & "ysport" for column 32

Create a new Dataset by Filtering

did_wage <- subset(data, gap>0) #creates a new variable to test gap>0

Duplicate Data

Show only unique Values, e.g. no Duplicates

Let’s say, our goal is to have an overview of the scope of all values, that the random variable of a specific column can take on. This can be easily achieved with the following code:

unique(sort(data$empstat)) # Because we are using "unique", no duplicates will be shown.

Filtering the Data, e.g. creating Subsets

data_new <- subset(data, sample==1) # Remove observations where sample == 0 from the dataset
data_NJ <- subset(data, sample ==1 & state==1) # 2 criterias

Generating Summary Statistics

For a quick & general Summary of your Data

summary(data) # makes a summary statistics of all the variables // columns

Calculate the Mean

plot <- aggregate(data_int, by=list(data_int$sport), mean) # summary of the data // model

Calculate the mean, the variance & standard deviation but ignore the Missings in a Column

mean_score <- mean(data2$totalscore, na.rm=TRUE) #ignore the NA's and
# calculate mean of this variable
variance_score <- sqrt(var(data2$totalscore, na.rm=TRUE))

Note the input-parameter na.rm = TRUE inside the mean() and var() when we perform the calculation of the mean & the variance.

Transform a Dataset

Transform your dataset into “Wide”-Format

data.wide <- reshape(data , idvar = "family", timevar = "twinno", direction = "wide") #die idvar kennzeichnet die variable eindeutig. Hier: family; die timevar sind die einzelnen twins. F?r sie wird jeweils eine eigene Spalte kreiert.
View(data.wide)

Merging

Merge two different Datasets together

Here, I merge 2 datasets together, by fusioning the two via the variable country-code in dataset 1) “data”; and 2) “ccodealp” (this column exists in both datasets). The “new” dataset is called data_1. It displays only the data-points that were successfully merged together!

data_1 <- read.csv("~/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS3/qog_bas_cs_jan19.csv") #Downloaded data csv file

fulldata <-merge(data, data_1, by.x="countrycode", by.y="ccodealp")

Regression Models

Simple Linear Regression with 1 Covariate // Regressor

model <- lm(data$cigs ~ data$educ) # Estimates a regression model. Note: lm(y ~ x)

Exclude the Constant of a Regression

model7 <- lm(data$normpolity ~ data$arabconquest+data$muslimmajority+data$lrain+data$lrain_2+data$lrain_3 + data$fuel + data$oceania + data$europe + data$asia + data$americas + data$africa - 1)

As you can see, we simply need to add -1 to the basic syntax of a regression model to exclude the constant in a regression.

Summarizing Model-Results

summary(model) # summary statstics of a regression model

Calculation of the Coefficient Estimates via Formula

b_1 <- cov(data$cigs,data$educ)/var(data$educ) #Compute estimated beta-coeff.
# with the help of the formula that I found on the econometrics-slides
b_0 <- mean(data$cigs)-b_1*mean(data$educ) #Compute the slope-coefficient
# also known as "beta 0" = Intercept

Selecting a Coefficient out of an estimated Regression

b1 <- reg5$coefficients[1]

Extract the Residuals of a Regression

regres <- lm(data$female~data$educ + data$age)
res <- data$female - predict(regres) # In "words", this would be: `y - y(hat)`

Stargazer

Stargazer is an R-Package that is used in academia to produce beautiful tables, in the layout of published papers that you can see in academic Journals.

Create a beautiful `stargazer`-Table with multiple labeling of Regressors & with only 1 model

In my workflow, I used to output beautiful stargazer-tables as .html-tables, formatting them with CSS, and then copy-pasting them into my word-document.

The following code will print your the first part of this workflow, e.g. output a table in .html-format.

stargazer(model5, title="OLS Regression",align=TRUE,covariate.labels = c("Muslim Majority", "Average Fertility"), type="text",out="~/Documents/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS3/Table1.html")

Create a `stargazer`-Table with > 1 regression model

stargazer(model1, model2, model3, model2_2, title="Different OLS Regressions",align=TRUE, type="text",out="~/Documents/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS3/Table1.html")

Create a stargazer table but omit some covariates

Why is this useful?
If you are using fixed effects in a regression, I would not recommend printing out all effect-sizes, because you usually end up with too many coefficients (which will only distract the reader from what’s important)!

stargazer(model7, model8, title="Standard OLS Regressions",align=TRUE, type="html", omit=c("fuelendowed","europe", "americas", "africa", "asia", "oceania"), out="~/Documents/Uni/Masterstudium/HS 2018/Empirical Methods/PS4/Table1.html")

Methods

Fixed Effects

To include fixed effects into a regression, we have 2 possibilities.

Possibility 1: in “native” R (no package required)

test <- lm(lifexp ~ log_gdppc + factor(country), data = data) # here I include
# a "country FE" into the regression --> no package needed
summary(test) #check if it worked

Possibility 2: using the R-Package plm

library(Formula) #you have to load these two packages first for the FE to be included
library(plm)

# Prepare dd to include only country FEs:
data_FE <- pdata.frame(data, index = c("country")) #save the data into a 
# special "country FE"-dataset

# Now, we are ready to estimate the country FE regression:
model1 <- plm(lifexp ~ log_gdppc, data=data_FE, model = "within", effect = "individual") 
# Note: replace "individual" by "twoways" when you include not only 
# country FE but also - for example - a cohort FE
summary(model1) #check if it worked

# With country FEs, as well as time FEs, we (again) prepare the dd first:
data_country_time <- pdata.frame(data, index = c("year", "country"))

model3 <- plm(lifexp ~ log_gdppc, data=data_country_time, model = "within", effect = "twoways") 
# The difference here with two FEs is the "twoways" option!

# For a Stargazer-Table, run this code:
stargazer(model3, title="OLS Regression",align=TRUE,covariate.labels = c("GDP per capita (in log)"), type="text",out="~/Documents/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS5/Table1.html") # note: output is a table in `.html`-format

IV-Regression

#load package first:
library(ivpack)

## IV-Regression with the short command:
lmiv<- ivreg(lnearn~highqua+age+agesq | age+agesq+twihigh , data = data)

## IV-Regression but with robust standard errors
lmiv6<- ivreg(normpolity~arabconquest+fuelendowed+oceania+europe+asia+americas+africa-1 | fuelendowed+oceania+europe+asia+americas+africa+mecca+lrain-1 , data = data_muslim) #es spielt keine Rolle wo man die IV hinsetzt on the RHS im Term der Addition
summary(lmiv6) #check if it worked
lmiv6<-robust.se(lmiv6)#for robust SE

Clustering your Standard Errors

model4 <- coeftest(model4, vcov=vcovHC(model4,type="HC0",cluster="time")) # clustering SEs

Synthetic Control for Difference-in-Differences (DiD)

The code below will generate a synthetic control-group in my DiD-model.

# Download package:
library(Synth) #you need this package, otherwise you cannot do a 
# synthetic control

# Use the following code to prepare your data:
data.out <- dataprep(
  foo = data_full, #plug in your dataset
  predictors = c("reform_cap", "GINIp", "GINIc"), # regressors of the dataset you want to use
  predictors.op = "mean", # in DiD, you compare means
  time.predictors.prior = 1960:1976, # periods BEFORE the treatment
  dependent="IGEincome",
  unit.variable = "id",
  unit.names.variable = "country",
  time.variable = "year.x",
  treatment.identifier = 97,# look at the id of Switzerland // Treated group
  controls.identifier = c(2:96, 98:109), # control groups
  time.optimize.ssr = 1960:1976,
  time.plot = 1955:1997)
  
# Finally, plot the data:
synth.out <- synth(data.prep.obj = data.out, method = "BFGS")

Discrete Choice Modeling

Preparing the `y`-Variable for modeling multiple Choices

# summarize "choices" into one column by putting all separate "column-decision" into one column:
data_test$choice <- ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==1, 1,
                           ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==0, 2,
                                  ifelse(data_test$sport_dummy==0 & data_test$sport_dummy_young==1, 3, 4)
                           )
)# this is a nested ifelse-function, which should allow me to 
# build my "choice" (y)-variable

str(data)#check whether "choice" need to be a categorical-variable --> if not, you need to change it for later manipulations...

Make the “choice”-variable categorical

data$choiceF <- factor(data$choice)# create a new variable that will 
# be categorical-variable for "choice"

# for multinomial logit model to work, we need to create a 
# "reference category" within the newly created variable "choiceF":
data$choices <- relevel(data$choiceF, ref = "1")# our reference category
# will be the number "1", which corresponds to the people making sport 
# in childhood, as well as in adulthood

Create a Multinomial Logit (MNL) Model

library(nnet)# we need this package for multinomial logit regression

model1 <- multinom(data$choices ~ data$m_sport + data$f_educ)# this is 
# a test-model: I want to see, if I get the same estimated coefficients, 
# as the mock-up MNL-regression that my professor sent me? 
summary(model1) # --> Yes, every coefficient is (almost) the same! 
# Hence, this package is reliable! :)

Make Predictions with the MNL-Model

predict(model1, data)#predictions in terms of the outcome variable
predict(model1, data, type = "prob")#predictions in terms of the marginal probabilities

Check the Performance of my Model

The code below will generate a “confusion matrix”, which shows how many of our classifications were correct / false.

cm <- table(predict(model1), data$choice)#note: only works when you don't have NA's
print(cm)

Data Visualisations

Make a simple Scatter Plot

With the following code, you can visualize the correlation between 2 variables:

with(data, plot(data$educ, data$cigs, main = "Years of Education VS. Cigarrets smoked per day", xlab = "Years of education", ylab="Cigarrets smoked per day")) # Note: plot(x-variable, y-variable)
abline(model, lwd = 2, lty = 3) #lwd = line width; lty = line type

Plotting a simple Histogram

hist(data$wdi_mort, main ="Distribution of infant mortality", xlab = "Mortality Rates",ylab = "Frequency of Mortality Rates",col = "red3", ylim=c(0,110)) #make a histogram of child mortality rates

Plotting Confidence Intervals

Konf <- predict(reg1, interval="confidence", level=.95) # note: you need to
# estimate a linear regression in order to plot the confidence intervals

with(data, plot(data$corruptionun, data$mortalityun, main = "Corruption vs. Mortality", xlab = "Corruption", ylab= "Mortality"))
abline(reg1, lwd = 2, lty = 3)
lines(x = data$corruptionun[order(data$corruptionun)],y= Konf[order(data$corruptionun),2],lwd=2,col= 2)
lines(x = data$corruptionun[order(data$corruptionun)],y= Konf[order(data$corruptionun),3],lwd=2, col= 2)

Make a Residual Plot

reslag <- lag(res, k=1) # Shifts the position of the residuals in the vector forward by 1
cor(res[2:807], reslag[2:807]) #Forms correlation. Be careful: if you start from 1, it does not work!!!

Make a Density Plot (of Residuals)

plot(density(res))

Plotting Autocorrelation

The following code will plot the correlation between the residuals:

acf(res, main ="Autocorrelation of the residuals")

Statistical Computations

Calculate the t-statistic “by Hand”

In the example below, we estimated a model and we now want to compute the t-statistics of our 3rd coefficient. This is achieved via the following code:

se <- sqrt(diag(vcov(reg5)))
t.stat <- reg5$coefficients[4]/se[4] #note: in our model, there is a
# "constant" [= intercept], that's why we need to select the's 4 and not 3
t.stat

A more complicated t-test calculated “by Hand”: `beta4 - beta7 = 0`

reg19 <- lm(sam$lw ~ sam$educ +sam$age+sam$childrenly+sam$Bus+sam$hea+sam$tech+sam$scie) # 7 regressors 
# and 1 constant = 8 estimated coefficients
summary(reg19)
cov <- (vcov(reg19))
se19 <- sqrt(cov[5,5]+cov[8,8]-2*cov[5,8]) #note: we have a constant, 
# that's why we have "beta4 = 5" and "beta7 = 8"
t.stat19 <- (reg19$coefficients[5]-reg19$coefficients[8])/se19 #don't 
# forget the correct intercept when selecting the coefficients! --> make it "+1"...
p.val <-2*pt(-abs(t.stat19),df=reg19$df.residual)
p.val

Calculate the F-statistics “by hand”

reg7 <- lm(s$lw ~ s$educ +s$age+s$female+ s$educ*s$female+s$female*s$age) #5
# regressors & 1 constant = 6 coefficients are getting estimated
summary(reg7) #this is the unrestricted model with all regressors
R <- rbind (c(0 ,0 ,0,1 ,0,0) , c(0 ,0,0,0 ,1 ,0),c(0,0,0,0,0,1)) #put a 1 for
# the coefficients you want to test --> note: the first 0 is for the constant!
r <- c(0 ,0,0) #number of equations
ftest <- linearHypothesis ( reg7 , hypothesis.matrix =R, rhs=r, vcov = vcovHC ( reg7 ,"HC1"))
regrest <- lm(s$lw ~ s$educ +s$age) #this is the restricted model without 
# the tested regressors
summary(regrest)
f.test <- ((sum((regrest$residuals)^2)-sum((reg7$residuals)^2))/3)/(sum((reg7$residuals)^2)/(reg7$df.residual))
f.test
crit_value <- qf(0.95, df1=3, df2=561073)

Monte Carlo Simulation “by hand”

N = 10  # sample size you draw --> change this variable if you want
R = 200 # this is (the end of) your counter --> number of times you repeat 
# the random draws --> change this variable if you want
x_r <- mat.or.vec(R,1) # Creates a 0-Vector of length "R" (= 200) and 
# with 1 dimension, e.g. we have a vector here (not a matrix)...

# make a for-loop:
for(i in 1:R){
  x <- rexp(N) # Random draws with sample-size 'N', e.g. 10 random draws in 
  # this case...
  meanx_r <- mean(x) # Computation of our'Sample-Average' (= random variable)
  x_r[i]<-meanx_r # Save the 'Sample-Average' in the i-th position of the 
  # vector we created above.
}

hist(x_r, main ="Distribution of x^r with N=1 and R=200", xlab = "x^r",ylab = "Frequency of x^r",col = "red3")
meanx <- mean(x_r) # Takes the mean of all randomly generated
# 'Sample-averages'
varx <-var(x_r) # Takes the variance of all randomly generated
# 'Sample-averages'

Simulating a Normal Distribution over a Sequence

Step one
We create a sequence of values.

sequence <- seq(-50, 50, by = 0.1) # we create this sequence to define a set 
# of X variables. It is important that we make sure to incorporate the 
# whole scope of values the residuals can take on as our sequence, 
# which is why we enter a minimum as the starting-point and the maximum 
# as the endpoint.

Step two
Next, we tell R to construct a normal distribution over the given sequence. This way, we can see which values will appear with high probability within the normal distribution.

normal <- dnorm(sequence, mean = 0, sd = sqrt(var(res))) #jetzt sagt man R es soll eine Normalverteilung von der angegebenen Sequenz bilden. Dadurch sieht man, welche Werte mit hoher Wahrscheinlichkeit innerhalb der Normalverteilung auftauchen werden...

General Project Management

Packages to make a Full Econ-Project

library (sandwich)
library (lmtest) #to make regressions
library (car)
library(foreign)
library(stargazer) #to make tables that you can use in Word or picture them as HTML tables
library(readstata13) #you need this package, otherwise you cannot read the data