R Cheat-Sheet for Data-Science
In this document I listed a various number of R codes, which should help any aspiring Data Scientists to perform common tasks demanded in the field.
Some important context to keep in mind before you start reading:
I have an economics background, so the recommended codes - especially the models - can be (strongly) influenced by how economists think & work.
I hope, you’ll find what you are looking for 🔥
Table of Contents
- Basic R for Data-Science
- Data Cleaning
- Quick Data Exploration
- Variable Transformations
- Columns
- Missings (=
NA
s) - Creating Datasets
- Duplicate Data
- Filtering the Data
- Summary Statistics
- Dataset Transformation
- Merging
- Regression Models
- Methods
- Data Visualisations
- Statistical Computations
- General Project Management
Basic R for Data-Science
Clear Workspace
Include the following code into EVERY R-Script you are using. It will clean all variables that are saved in R’s global environment:
rm(list = ls())
Clear some Variables or Datasets
If you don’t need some variables or datasets, use this:
rm(list= c("dd_1", "dd_2", "var_1", "var_2")) # not needed anymore
Install a Package
install.packages("installr")
You can replace installr
by any package name that you want to install.
Install multiple Packages
install.packages(c("ggplot2", "gcookbook", "MASS", "dplyr"))
Update Packages
Option 1:
update.packages()
The disadantage of “Opiton 1” is that R will ask you - for EVERY single package - if you want to update it. This can take very long, especially when you have lots of packages installed. That’s why there exists “Option 2” (see below).
Option 2:
update.packages(ask = FALSE) # If you want it to upgrade all packages without
# asking, use ask = FALSE:
Update R
Simply run the following code:
if(!require(installr)) {
install.packages("installr"); require(installr)}
updateR() # this will start the updating process of your R installation.
Read Data
Load Data from .csv
-File
data <- read.csv("datafile.csv")
Alternatively, you can use the read_csv()
function (note the underscore instead of period) from the readr
-Package. This function is significantly
faster than read.csv()
!
No Header in first Row
You can also load data with if the .csv
-file does not have a header in the first row // Zeile:
data <- read.csv("datafile.csv", header = FALSE)
Load Data with special Delimiters
When loading a .csv
-file, there are different ways to tell R that there there is the beginning of a new column.
Example of Delimiters are:
;
- Tabs
- etc…
data <- read.csv("datafile.csv", sep = "\t") # if it is tab-delimited, use
# 'sep = "\t"' --> this is called a delimiter...
Read data that are Strings
Step 1: The Problem when you have Columns that contain
strings
?:
By default, R treatsstrings
asfactor-variables
, which is NOT what you want. But you can use the following inputs within theread.csv()
-function:
data <- read.csv("datafile.csv", stringsAsFactors = FALSE)
Step 2: Convert some Columns back to
factors
:
If any of those columns should NOT bestrings
, then you need to convert them back. For example, let’s say one of the columns is “sex”, then we need to convert it (back) into a factor-variable:
data$Sex <- factor(data$Sex)
str(data) # check if 'sex' is now a factor-variable
Load Excel-Data
install.packages("readxl") # Only need to install once
library(readxl)
data <- read_excel("datafile.xlsx", 1)
- Note: here are other packages for reading Excel files. The
gdata
-Package has a functionread.xls()
for reading in.xls
-Files, and thexlsx
-Package has a functionread.xlsx()
for reading in.xlsx
-Files.
Load Second Sheet
data <- read_excel("datafile.xls", sheet = 2) # To access the 2nd sh
Sheet with specific Name
data <- read_excel("datafile.xls", sheet = "Revenues") # To access a sheet with a specific name.
If Excel-Sheet has no Column-Names
data <- read_excel("datafile.xls", sheet = 2, col_names = FALSE) # uses the
# FIRST row [= Zeile] of the spreadsheet for column names. If your
# columns DON'T have column-names, then then you need to use the input
# 'col_names = FALSE', since by default the function assumes that the excel
# files have column-names.
# If you want to specify the type each column has, you can do this. But it is
# not necessary, since the function will try to infer it by itself:
Specific Column Types
data <- read_excel("datafile.xls", col_types = c("blank", "text", "date", "numeric"))
The above code will drop the first column [see “blank”], and specify the types of the next three columns.
Load Stata-Files
# install.packages("readstata13")
library(readstata13) # read data from
# stata files --> you need to load library(readstata13) package
# for this to work!
data <- read.dta13("~/Path/to/your/data/Stata-File.dta")
Load SPSS Data
library(haven) # Need the package 'haven' for this.
data <- read_sav("datafile.sav")
Note that the haven
-package also includes functions to read from other formats:
read_sas()
: SASread_dta()
: Stata
Help
If you need to get any information about a command, simply run any R command by using a ?
in front of the expression:
?write.table
Another possibility would be to use help()
.
Data Cleaning
Quick Data Exploration
Find out number of observations in a Dataset
dim(sampleUScens2015)
For a quick & general Overview
summary(data) # makes a summary statistics of all the variables // columns
Print all the unique values from a particular column & sort all
sort(unique(data$educ))
Variable Transformations
Why is this important?
Because oftentimes - to use, for example, a function or a basic for loop - the data needs to be in a very specific data type. Otherwise, the processing you want to apply to your data may not work. That’s why it is very important to always know about the data types within your project!
Convert String into Factor
For example, let’s say one of the columns is “sex” and contains string
(= “male”, “female”), then we need to convert it into a factor
-variable:
data$Sex <- factor(data$Sex)
str(data) # check if 'sex' is now a factor-variable
Create a Dummy Variable
university <- ifelse(sampleUScens2015$educ>=16,1,0) # creates dummy with
# ifelse() function
Create a Dummy Variable from multiple Columns via nested ifelse()
-Function
# summarize "choices" into one column by putting all separate "column-decision" into one column:
data_test$choice <- ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==1, 1,
ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==0, 2,
ifelse(data_test$sport_dummy==0 & data_test$sport_dummy_young==1, 3, 4)
)
)#this is a nested ifelse-function, which should allow me to build my "choice" variable
Count data which are bigger / smaller than some Threshold
library(dplyr) # count()-function requires dplyr-package
count(ifelse(data$age>=39, 1,0))
Create unique IDs
Why is this data-transformation important?
In settings where you will need to merge 2 separate datasets into 1 big datasets, it is important that both have the same unique ID column, otherwise you cannot merge those two datasets into one.
data_full <- transform(data_full,id=as.numeric(factor(country))) #creates a
# unique ID for each country --> wird als neue variable in dataset angehenkt
Possibility 2
There is an alternative way to create unique IDs (which I used during the data cleaning of my master thesis):
test_data$ID <- seq.int(nrow(test_data))
Standardize a Variable
- Be careful: The follwing code is NOT a standard normal transformation!
data2["standard_score2"] <- scale(data2$totalscor) # we create the new
# column "standard_score2"
Create an Interaction Term
data["lrain_2"] <- data$lrain*data$lrain #creates the interaction term
Managing Columns of your Dataset
Rename a Column
names(dd5)[14]<- "teens" # changes the name of the 14th column to "teens"
Change all Column-Names simultaneously
To manually assign the header names // new column names, simply use the following code:
names(data) <- c("Column1", "Column2", "Column3")
Selecting a Subset of Columns in your Dataset by their position within the dataset
You can use the following code to make a “cleaner” dataset, e.g. with a better overview of some columns:
data_test <- data[,c(39,43)] # creates a dataset: it selects all rows, but
# only with the columns 39 to 43
Selecting a Subset of Columns in your Dataset by their Names
Let’s say, we want to extract only some of the columns like “sport” or “sport_dummy” etc. This can be achieved by using the following code:
test_data <- data[,c("type","sport","sport_dummy", "m_sport", "f_sport", "sport_parent", "sport_mother", "sport_father")]#198 missings due to NAs
Drop Columns
Dropping some existing columns can be achieved with the following code:
data$male<-NULL # this will eliminate the column "male"
Drop unused Categories (= levels) in Categorical Variables
levels(data$type) # check levels: not every category defined is used,
# so let's drop them
data$type <- droplevels(data$type, exclude = if(anyNA(levels(data$type))) NULL else NA)
levels(data$type) # check if it worked (should have only 69 categories left)
Create new Columns in a Dataset
Note that there are mulitple way to create new columns in a dataset.
Create a new Column based on an existing Column | Possibility 1
data_NJ$low <- ifelse(data_NJ$wage_st < 5,1,0) # creates a new variable //
# column "low" for dataset data_NJ
### Alternatively, in order to create an interaction term:
data["lrain_2"] <- data$lrain*data$lrain #creates the interaction term
Append new Columns to an existing Dataset | Possibility 2
university <- ifelse(sampleUScens2015$educ>=16,1,0) #creates dummy with ifelse() function
data <- data.frame(sampleUScens2015,wage,lw,university) # Using the data.frame
# function we create a matrix and append the columns with the variable-names(!): "wage", "lw" and "university" to the already cleaned
# data set "sampleUScens".
Replace Values within Columns
Here, I want to add a value from one column into another column that has an NA
-value at this place
data_did3$col_na[is.na(data_did3$col_na)] <- data_did3$type[is.na(data_did3$col_na)] # add the 2015 values to our new
# "type13" column
Missing-Values (NA
s)
Count Missings
length(data$twinno[is.na(data$twinno)])
Possibility 2 to count Missings
# count number of missings for a variable (here: 'frequency'):
missings <- data[is.na(data$frequency),] # 352 missings
Select all the NA
-values from a Column
test_data <- data[is.na(data$type),]
Filter all NAs into a Subset
testo <- subset(test, is.na(corrupt_icrg)==TRUE)
Build a dataset with only Missing Values
Why would you do this?
This may be useful to understand, why there are missings in a tabular dataset, because you see all the rows that have aNA
-value somewhere. Note that this method preserves other columns, e.g. it selects an entire row with multiple columns, which may not all be necessarily contain a missing (which is nice, because this may show us why some other column has a missing in it!).
missings <- data[is.na(data$m_physact),] # selects only the ROWs with missings & also selects every column
Compute the Correlation with Missings
cor(data.long$avehigh, data.long$aveweigh, use='pairwise')
Delete Rows which contain Missings
data_subset <- data.wide[ , c("down_exp")] #select the columns from which you want to remove the NA's
df <- data.wide[complete.cases(data_subset), ] # Omit NAs by columns
Delete missings from a Column
testo <- subset(testo, is.na(corrupt_icrg)==FALSE)
Manual Missing Imputation: Replace a specific Value with another one
Let’s say, that we want to replace a specific value within a dataset, because - for example - it has a NA
.
data[122,] <- replace(data[122,], "islam1100", 0) #Set "Islam1100" for country Israel equal to 0 instead of 1
View(data) #check if it worked
Note that it must not necessarily be a missing value. It could be, because we had the wrong value somewhere in a row of the dataset.
Creating Datasets
Create a new Dataset with selected Columns
data_sport <- data.frame("sport" = data[,c(22)], "ysport" = data[,c(32)]) #we need to specify the column-names when creating a new data-frame, that's why I wrote "sport" for the column 21 [= sport-column in data] & "ysport" for column 32
Create a new Dataset by Filtering
did_wage <- subset(data, gap>0) #creates a new variable to test gap>0
Duplicate Data
Show only unique Values, e.g. no Duplicates
Let’s say, our goal is to have an overview of the scope of all values, that the random variable of a specific column can take on. This can be easily achieved with the following code:
unique(sort(data$empstat)) # Because we are using "unique", no duplicates will be shown.
Filtering the Data, e.g. creating Subsets
data_new <- subset(data, sample==1) # Remove observations where sample == 0 from the dataset
data_NJ <- subset(data, sample ==1 & state==1) # 2 criterias
Generating Summary Statistics
For a quick & general Summary of your Data
summary(data) # makes a summary statistics of all the variables // columns
Calculate the Mean
plot <- aggregate(data_int, by=list(data_int$sport), mean) # summary of the data // model
Calculate the mean, the variance & standard deviation but ignore the Missings in a Column
mean_score <- mean(data2$totalscore, na.rm=TRUE) #ignore the NA's and
# calculate mean of this variable
variance_score <- sqrt(var(data2$totalscore, na.rm=TRUE))
Note the input-parameter na.rm = TRUE
inside the mean()
and var()
when we perform the calculation of the mean & the variance.
Transform a Dataset
Transform your dataset into “Wide”-Format
data.wide <- reshape(data , idvar = "family", timevar = "twinno", direction = "wide") #die idvar kennzeichnet die variable eindeutig. Hier: family; die timevar sind die einzelnen twins. F?r sie wird jeweils eine eigene Spalte kreiert.
View(data.wide)
Merging
Merge two different Datasets together
Here, I merge 2 datasets together, by fusioning the two via the variable country-code
in dataset 1) “data”; and 2) “ccodealp” (this column exists in both datasets). The “new” dataset is called data_1
. It displays only the data-points that were successfully merged together!
data_1 <- read.csv("~/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS3/qog_bas_cs_jan19.csv") #Downloaded data csv file
fulldata <-merge(data, data_1, by.x="countrycode", by.y="ccodealp")
Regression Models
Simple Linear Regression with 1 Covariate // Regressor
model <- lm(data$cigs ~ data$educ) # Estimates a regression model. Note: lm(y ~ x)
Exclude the Constant of a Regression
model7 <- lm(data$normpolity ~ data$arabconquest+data$muslimmajority+data$lrain+data$lrain_2+data$lrain_3 + data$fuel + data$oceania + data$europe + data$asia + data$americas + data$africa - 1)
As you can see, we simply need to add -1
to the basic syntax of a regression model to exclude the constant in a regression.
Summarizing Model-Results
summary(model) # summary statstics of a regression model
Calculation of the Coefficient Estimates via Formula
b_1 <- cov(data$cigs,data$educ)/var(data$educ) #Compute estimated beta-coeff.
# with the help of the formula that I found on the econometrics-slides
b_0 <- mean(data$cigs)-b_1*mean(data$educ) #Compute the slope-coefficient
# also known as "beta 0" = Intercept
Selecting a Coefficient out of an estimated Regression
b1 <- reg5$coefficients[1]
Extract the Residuals of a Regression
regres <- lm(data$female~data$educ + data$age)
res <- data$female - predict(regres) # In "words", this would be: `y - y(hat)`
Stargazer
Stargazer
is an R-Package that is used in academia to produce beautiful tables, in the layout of published papers that you can see in academic Journals.
Create a beautiful stargazer
-Table with multiple labeling of Regressors & with only 1 model
In my workflow, I used to output beautiful stargazer
-tables as .html
-tables, formatting them with CSS, and then copy-pasting them into my word-document.
The following code will print your the first part of this workflow, e.g. output a table in .html
-format.
stargazer(model5, title="OLS Regression",align=TRUE,covariate.labels = c("Muslim Majority", "Average Fertility"), type="text",out="~/Documents/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS3/Table1.html")
Create a stargazer
-Table with > 1 regression model
stargazer(model1, model2, model3, model2_2, title="Different OLS Regressions",align=TRUE, type="text",out="~/Documents/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS3/Table1.html")
Create a stargazer table but omit some covariates
Why is this useful?
If you are using fixed effects in a regression, I would not recommend printing out all effect-sizes, because you usually end up with too many coefficients (which will only distract the reader from what’s important)!
stargazer(model7, model8, title="Standard OLS Regressions",align=TRUE, type="html", omit=c("fuelendowed","europe", "americas", "africa", "asia", "oceania"), out="~/Documents/Uni/Masterstudium/HS 2018/Empirical Methods/PS4/Table1.html")
Methods
Fixed Effects
To include fixed effects into a regression, we have 2 possibilities.
Possibility 1: in “native” R (no package required)
test <- lm(lifexp ~ log_gdppc + factor(country), data = data) # here I include
# a "country FE" into the regression --> no package needed
summary(test) #check if it worked
Possibility 2: using the R-Package
plm
library(Formula) #you have to load these two packages first for the FE to be included
library(plm)
# Prepare dd to include only country FEs:
data_FE <- pdata.frame(data, index = c("country")) #save the data into a
# special "country FE"-dataset
# Now, we are ready to estimate the country FE regression:
model1 <- plm(lifexp ~ log_gdppc, data=data_FE, model = "within", effect = "individual")
# Note: replace "individual" by "twoways" when you include not only
# country FE but also - for example - a cohort FE
summary(model1) #check if it worked
# With country FEs, as well as time FEs, we (again) prepare the dd first:
data_country_time <- pdata.frame(data, index = c("year", "country"))
model3 <- plm(lifexp ~ log_gdppc, data=data_country_time, model = "within", effect = "twoways")
# The difference here with two FEs is the "twoways" option!
# For a Stargazer-Table, run this code:
stargazer(model3, title="OLS Regression",align=TRUE,covariate.labels = c("GDP per capita (in log)"), type="text",out="~/Documents/Uni/Masterstudium/FS_2019/Policy Analysts/Problem Set/PS5/Table1.html") # note: output is a table in `.html`-format
IV-Regression
#load package first:
library(ivpack)
## IV-Regression with the short command:
lmiv<- ivreg(lnearn~highqua+age+agesq | age+agesq+twihigh , data = data)
## IV-Regression but with robust standard errors
lmiv6<- ivreg(normpolity~arabconquest+fuelendowed+oceania+europe+asia+americas+africa-1 | fuelendowed+oceania+europe+asia+americas+africa+mecca+lrain-1 , data = data_muslim) #es spielt keine Rolle wo man die IV hinsetzt on the RHS im Term der Addition
summary(lmiv6) #check if it worked
lmiv6<-robust.se(lmiv6)#for robust SE
Clustering your Standard Errors
model4 <- coeftest(model4, vcov=vcovHC(model4,type="HC0",cluster="time")) # clustering SEs
Synthetic Control for Difference-in-Differences (DiD)
The code below will generate a synthetic control-group in my DiD-model.
# Download package:
library(Synth) #you need this package, otherwise you cannot do a
# synthetic control
# Use the following code to prepare your data:
data.out <- dataprep(
foo = data_full, #plug in your dataset
predictors = c("reform_cap", "GINIp", "GINIc"), # regressors of the dataset you want to use
predictors.op = "mean", # in DiD, you compare means
time.predictors.prior = 1960:1976, # periods BEFORE the treatment
dependent="IGEincome",
unit.variable = "id",
unit.names.variable = "country",
time.variable = "year.x",
treatment.identifier = 97,# look at the id of Switzerland // Treated group
controls.identifier = c(2:96, 98:109), # control groups
time.optimize.ssr = 1960:1976,
time.plot = 1955:1997)
# Finally, plot the data:
synth.out <- synth(data.prep.obj = data.out, method = "BFGS")
Discrete Choice Modeling
Preparing the y
-Variable for modeling multiple Choices
# summarize "choices" into one column by putting all separate "column-decision" into one column:
data_test$choice <- ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==1, 1,
ifelse(data_test$sport_dummy==1 & data_test$sport_dummy_young==0, 2,
ifelse(data_test$sport_dummy==0 & data_test$sport_dummy_young==1, 3, 4)
)
)# this is a nested ifelse-function, which should allow me to
# build my "choice" (y)-variable
str(data)#check whether "choice" need to be a categorical-variable --> if not, you need to change it for later manipulations...
Make the “choice”-variable categorical
data$choiceF <- factor(data$choice)# create a new variable that will
# be categorical-variable for "choice"
# for multinomial logit model to work, we need to create a
# "reference category" within the newly created variable "choiceF":
data$choices <- relevel(data$choiceF, ref = "1")# our reference category
# will be the number "1", which corresponds to the people making sport
# in childhood, as well as in adulthood
Create a Multinomial Logit (MNL) Model
library(nnet)# we need this package for multinomial logit regression
model1 <- multinom(data$choices ~ data$m_sport + data$f_educ)# this is
# a test-model: I want to see, if I get the same estimated coefficients,
# as the mock-up MNL-regression that my professor sent me?
summary(model1) # --> Yes, every coefficient is (almost) the same!
# Hence, this package is reliable! :)
Make Predictions with the MNL-Model
predict(model1, data)#predictions in terms of the outcome variable
predict(model1, data, type = "prob")#predictions in terms of the marginal probabilities
Check the Performance of my Model
The code below will generate a “confusion matrix”, which shows how many of our classifications were correct / false.
cm <- table(predict(model1), data$choice)#note: only works when you don't have NA's
print(cm)
Data Visualisations
Make a simple Scatter Plot
With the following code, you can visualize the correlation between 2 variables:
with(data, plot(data$educ, data$cigs, main = "Years of Education VS. Cigarrets smoked per day", xlab = "Years of education", ylab="Cigarrets smoked per day")) # Note: plot(x-variable, y-variable)
abline(model, lwd = 2, lty = 3) #lwd = line width; lty = line type
Plotting a simple Histogram
hist(data$wdi_mort, main ="Distribution of infant mortality", xlab = "Mortality Rates",ylab = "Frequency of Mortality Rates",col = "red3", ylim=c(0,110)) #make a histogram of child mortality rates
Plotting Confidence Intervals
Konf <- predict(reg1, interval="confidence", level=.95) # note: you need to
# estimate a linear regression in order to plot the confidence intervals
with(data, plot(data$corruptionun, data$mortalityun, main = "Corruption vs. Mortality", xlab = "Corruption", ylab= "Mortality"))
abline(reg1, lwd = 2, lty = 3)
lines(x = data$corruptionun[order(data$corruptionun)],y= Konf[order(data$corruptionun),2],lwd=2,col= 2)
lines(x = data$corruptionun[order(data$corruptionun)],y= Konf[order(data$corruptionun),3],lwd=2, col= 2)
Make a Residual Plot
reslag <- lag(res, k=1) # Shifts the position of the residuals in the vector forward by 1
cor(res[2:807], reslag[2:807]) #Forms correlation. Be careful: if you start from 1, it does not work!!!
Make a Density Plot (of Residuals)
plot(density(res))
Plotting Autocorrelation
The following code will plot the correlation between the residuals:
acf(res, main ="Autocorrelation of the residuals")
Statistical Computations
Calculate the t-statistic “by Hand”
In the example below, we estimated a model and we now want to compute the t-statistics of our 3rd coefficient. This is achieved via the following code:
se <- sqrt(diag(vcov(reg5)))
t.stat <- reg5$coefficients[4]/se[4] #note: in our model, there is a
# "constant" [= intercept], that's why we need to select the's 4 and not 3
t.stat
A more complicated t-test calculated “by Hand”: beta4 - beta7 = 0
reg19 <- lm(sam$lw ~ sam$educ +sam$age+sam$childrenly+sam$Bus+sam$hea+sam$tech+sam$scie) # 7 regressors
# and 1 constant = 8 estimated coefficients
summary(reg19)
cov <- (vcov(reg19))
se19 <- sqrt(cov[5,5]+cov[8,8]-2*cov[5,8]) #note: we have a constant,
# that's why we have "beta4 = 5" and "beta7 = 8"
t.stat19 <- (reg19$coefficients[5]-reg19$coefficients[8])/se19 #don't
# forget the correct intercept when selecting the coefficients! --> make it "+1"...
p.val <-2*pt(-abs(t.stat19),df=reg19$df.residual)
p.val
Calculate the F-statistics “by hand”
reg7 <- lm(s$lw ~ s$educ +s$age+s$female+ s$educ*s$female+s$female*s$age) #5
# regressors & 1 constant = 6 coefficients are getting estimated
summary(reg7) #this is the unrestricted model with all regressors
R <- rbind (c(0 ,0 ,0,1 ,0,0) , c(0 ,0,0,0 ,1 ,0),c(0,0,0,0,0,1)) #put a 1 for
# the coefficients you want to test --> note: the first 0 is for the constant!
r <- c(0 ,0,0) #number of equations
ftest <- linearHypothesis ( reg7 , hypothesis.matrix =R, rhs=r, vcov = vcovHC ( reg7 ,"HC1"))
regrest <- lm(s$lw ~ s$educ +s$age) #this is the restricted model without
# the tested regressors
summary(regrest)
f.test <- ((sum((regrest$residuals)^2)-sum((reg7$residuals)^2))/3)/(sum((reg7$residuals)^2)/(reg7$df.residual))
f.test
crit_value <- qf(0.95, df1=3, df2=561073)
Monte Carlo Simulation “by hand”
N = 10 # sample size you draw --> change this variable if you want
R = 200 # this is (the end of) your counter --> number of times you repeat
# the random draws --> change this variable if you want
x_r <- mat.or.vec(R,1) # Creates a 0-Vector of length "R" (= 200) and
# with 1 dimension, e.g. we have a vector here (not a matrix)...
# make a for-loop:
for(i in 1:R){
x <- rexp(N) # Random draws with sample-size 'N', e.g. 10 random draws in
# this case...
meanx_r <- mean(x) # Computation of our'Sample-Average' (= random variable)
x_r[i]<-meanx_r # Save the 'Sample-Average' in the i-th position of the
# vector we created above.
}
hist(x_r, main ="Distribution of x^r with N=1 and R=200", xlab = "x^r",ylab = "Frequency of x^r",col = "red3")
meanx <- mean(x_r) # Takes the mean of all randomly generated
# 'Sample-averages'
varx <-var(x_r) # Takes the variance of all randomly generated
# 'Sample-averages'
Simulating a Normal Distribution over a Sequence
Step one
We create a sequence of values.
sequence <- seq(-50, 50, by = 0.1) # we create this sequence to define a set
# of X variables. It is important that we make sure to incorporate the
# whole scope of values the residuals can take on as our sequence,
# which is why we enter a minimum as the starting-point and the maximum
# as the endpoint.
Step two
Next, we tell R to construct a normal distribution over the given sequence. This way, we can see which values will appear with high probability within the normal distribution.
normal <- dnorm(sequence, mean = 0, sd = sqrt(var(res))) #jetzt sagt man R es soll eine Normalverteilung von der angegebenen Sequenz bilden. Dadurch sieht man, welche Werte mit hoher Wahrscheinlichkeit innerhalb der Normalverteilung auftauchen werden...
General Project Management
Packages to make a Full Econ-Project
library (sandwich)
library (lmtest) #to make regressions
library (car)
library(foreign)
library(stargazer) #to make tables that you can use in Word or picture them as HTML tables
library(readstata13) #you need this package, otherwise you cannot read the data