Skip to content

DataFrame manipulation in R from basics to dplyr

 

In my surroundings at work I see quite a few people managing their data in spreadsheet software like Excel or Calc, these software will do the work but I usually tend to do as little data manipulation in them as possible and to turn as soon as possible my spreadsheets into csv files and then bring the data to R where every single manipulation I do on them is recorded by default in the history (if you use RStudio) or in scripts if you are documenting your work (which should always be the way to go). The aim of this post is to show how to do some manipulations often done on data (ie subsetting, summarizing, ordering …) in R. As always there are a thousand way to do an operation, I will go through the basic way to do these manipulation using the vector-based approach of R and then at the end show how new libraries allow you to do these manipulation on data frame using code easily understandable for those not grasping (yet) the magic of vector-based operations. (As always a nicer RPubs version of this article is available at: http://rpubs.com/Lionel/33216, if anyone around now how to transfer .Rmd files to WordPress blog I’ll be glad to hear about it)

#################
#Data management#
#################


#the data frame I will use
data<-data.frame(Factor1=rep(LETTERS[1:5],each=20),Factor2=sample(letters[1:10],100,replace=TRUE),Var1=rnorm(100,2,4),Var2=rpois(100,2))
#some simple summary
summary(data)
table(data$Factor1)
table(data$Factor2)

#####basic way using vectors######
#subsetting
#only keep observation with Factor1 equal to A
sub1<-subset(data,Factor1=="A")
#only keep observation with Factor1 equal to A and Var2 lower than 4
data$Factor1=="A"
sub2<-data[data$Factor1=="A" & data$Var2<4,]
summary(sub2)
#only keep every thrird rows
head(data[seq(1,nrow(data),3),])
#only keep row number 2,6,13,22 from column 1 and 4
data[c(2,6,13,22),c(1,4)] #when numbers are following each other can use :, ie 1:10

#summarising
library(plyr)
#get the mean value and standard error of Var1 for each level of Factor1
rbind.fill(by(data,data$Factor1,function(x) return(data.frame(Factor1=unique(x$Factor1),Mean=mean(x$Var1),SE=sd(x$Var1)/sqrt(length(x$Var1))))))
#get the 25% and 75% quantile for Var2 for each level of Factor2
rbind.fill(by(data,data$Factor2,function(x) return(data.frame(Factor2=unique(x$Factor2),Q_25=quantile(x$Var2,prob=0.25),Q_75=quantile(x$Var2,prob=0.75)))))

Wow these two last calls can seem rather intimidating at first but as always you need to start by the center and then walk away from it to understand what is happening in these two lines, let’s look at the first one for example. First we call an un-named function on the dataframe data and we apply this function to each level of data$Factor1 separately, we pass these chunks of data to the function and call them x, now this function will return a dataframe made of three columns, the first one named Factor1 take the unique value present in the column Factor1 of the x chunks, the second one takes the mean of the Var1 values, the third one divide the standard deviation of Var1 values by the square root of the number of observations (giving the standard error around the mean). As the by function will return a series of dataframe we can combine them together in one dataframe using rbind.fill. This is rather long lines of code, keep them in mind as at the end of the post you will see how to do this in a different way.

#changing column order
data<-data[,c(1,4,3,2)]
head(data)
#also work with column names
data<-data[,c("Factor1","Var1","Factor2","Var2")]
head(data)
#sorting the rows first by Factor1 then by Factor2
data<-data[do.call(order,list(data$Factor1,data$Factor2)),]

######increasing complexity, switching from long to wide format########
library(reshape2)
#the long format makes one column keeping the info on a grouping variable (eg Sex) instead of making a separate column for each levels
#the object data is for example in a long format, we may want to make a separate column for each level of Factor1 and storing Var1 in the rows
data$Observation<-rep(1:20,time=5)
data_wide<-dcast(Observation~Factor1,data = data,value.var = "Var1") #the left-hand side of the formula is the variable that will make up the rows the right hand side the columns
#if certain combination are missing one can use the fill argument
data_wide<-dcast(Factor2~Factor1,data=data,fun.aggregate = length,fill=0) #here we count how many observations are for each levels of Factor2 and Factor1
#other functions can be provided if nore then one values are present in each cells
data_wide<-dcast(Factor2~Factor1,data=data,fun.aggregate = sum,value.var="Var2",fill=0)
#turning back the data to a long format
data_long<-melt(data_wide,value.name = "Sum_Var2",id.vars="Factor2",variable.name = "Factor1") #melt the data frame id.vars correspond to the column that contain the factor infos
#long format are then pretty handy to use for plotting
library(ggplot2)
ggplot(data_long,aes(x=Factor2,y=Sum_Var2,colour=Factor1))+geom_point()
#but is also the way the data should be structure for data analysis:
lm(Sum_Var2~Factor2+Factor1,data_long)

data1

For more about long and wide format you can also look at the great article in the R cookbook on this: http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/.
Now let’s turn to a new library that came out to my attention recently and that is extremely elegant and easy to use. More info on this library: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

#####using the dplyr to turn all data manipulation easy######
library(dplyr)
#the five functions of dplyr, dplyr works with data frame instead of vectors which makes data frame manipulation much more straightforward
filter(data,Factor1%in%c("A","D"),Var1>=0) #similar to subset
head(select(data,contains("factor",ignore.case=TRUE))) #only return some specific columns see ?select for more possibilities
head(arrange(data,Factor1,Var1))
head(mutate(data,Var3=Var1+Var2,M_1=(Observation+Var2)/length(Var2)))
summarise(data,sum(Var2))
#summarise becomes extremely handy when use with group_by
data_d<-summarise(group_by(data,Factor1),Mean=mean(Var1),SE=sd(Var1)/sqrt(n())) #remember the huge by function needed to get the same results
#the n() function is built-in with dplyr and count how many element there are
#going from the full dataset to a graph summarising mean difference between factor is swift and painless using these functions
ggplot(data_d,aes(x=Factor1,y=Mean))+geom_point(colour="red",size=3,show_guide=FALSE)+geom_errorbar(aes(ymin=Mean-2*SE,ymax=Mean+2*SE),width=.1)

data2

As always as you dwell deeper in these topics you can see that the options are extremely numerous which makes R extremely enjoyable for data manipulation once the basics are understood. As R is used nowadays for most of the data analysis (in my field of work at least), I see it natural to bring the data as soon as possible into R to really play with it and grasp there structure instead of just doing linear models in R and then using other software to make plots or observe basic patterns in the data. Enjoy your data manips’!

First scientific conference; Inspiration, Meetings and Walking

gfoe

I was attending this week my first scientific conference and I will give here some thoughts and impression I kept from it. The conference was located in Hildesheim and gathered scientists from the German (and also Austrian and Swiss) ecological society. For those that do not know how these conference works, it consists in a combination of keynote talks: presentation given by influential scientists lasting an hour and involving research done around the theme of the conference, and also standard talks given by Masters PhDs, Post-Docs, or Professors and lasting 12 min. After each of these talks there is usually some time to ask (pertinent) questions and discuss a bit. Time is THE limiting factors in these conferences with sometimes 8 talks in a row in some sessions (talks on a similar topic), the chairs (the guy/gal presenting the speaker and responsible for the timetable) are forced to limit discussion to only one or two short question/answer. In between the keynote talks and the session are the coffee break, where most of the social interaction take place around coffee. Also we sometimes like to have fun so there was a nice club night where if you were lucky you could see your supervisor dancing, a very rare sight.

Now more to the contents, we got some very engaging keynotes talks, especially Robert Beschta discussing the impacts of re-introducing large carnivores on the ecosystem of national parks in North-America. But also Frans Vera presenting his work on conserving habitats in the Netherlands using herbivores (in large numbers) to reintroduce natural dynamics in the plant and animal communities. These two talks were very nicely linked, both arguing that standard conservation practice in use in Europe and the USA are promoting a very non-natural stable state of the ecosystem, these practices strive in freezing the communities in particular composition of emblematic species without taking into account the added richness that dynamic system due to their heterogeneity can bring. Due to the interest raised by these talks these two guys offered an extra question section where they answered further interrogation. They were pretty convincing in their argument, and were also honest enough to admit that they were not able to predict the outcome of certain management actions taken up today. As this is for me one of the big challenge of biodiversity research in the years to come and one way to tackle the challenge of biodiversity loss. This was again highlighted in another session on IPBES where some interesting talk outlined the role and issues that ecologist should tackle, one is clearly being able to predict future biodiversity state under various land-use and climatic change scenarios. We are still very far from this.

Another great session for me was the citizen-science session where speakers made a great job pointing out that science for itself without public participation and interest is doomed in a world of diminishing funding. One number: the European commission want to get 5 millions of European to participate in such programs in the next 5 years. There are now quite a few case studies of citizen-science program that worked but also projects that did not work. The key point here is to combine what society wants with what society need. In the UK they already got quite a lot of experience on these issues as was presented by Michael J O Pocock, they have their the Biological Record Center with a lot of informations on what they do and some publications on how to develop these projects.

I also gave a talk on my own (little) research, it went alright, I could have been better but I was not too disappointed by myself. I presented some Structural Equation Models testing various hypothesis of links between changes in the plant diversity and variation of the arthropod herbivores and carnivores species richness and biomass. This is based on worked that has already been done in the US but that we want to extend to our system/communities (see this and this paper).

Then on the last day we went to the Harz National Park, a very foggy place but very interesting due to the impacts of human activities on the forest communities found there. Since it was a big mining place from the XVIth century the forests were completely cleared several time to provide enough timber for the mines and fuel for the energy, since the XVIIth century the natural forest types mainly made of Beech trees were replaced by Spruce which are growing faster. Also due to the heavy demand of water from these mining activities they built an impressive network of canals and dikes affecting the water flow. Finally humans also tried to exploit the peat as a fuel source to replace the wood that was still growing, destroying on the way quite a few bogs of the region. Fortunately the wet climatic conditions present in the area prevented the full-scale development of this trade and was abandoned after a few decades. After the reunification of the two Germany they decided to let natural process govern the communities there, therefore they left the dead wood in the forest allowing an explosion of the bark beetle populations, decimating the spruce forest and leaving some place for new tree species to come back (Hazel …). The managers also let the water table gradually increase on some places and when we walk there we see dying stands of spruce that will soon (in the course of the next 50 years) be turned into bogs. The walk was cold and wet but it was again very inspiring to think how human practice and management decision have such a huge impact on natural communities, the issue being that we cannot predict exactly what will happen if we take a particular decision but also that there can be conflicts between how the society see nature conservation (forest everywhere with nice birds) and how natural process are shaping these habitats (ecological succession ..).

Hopefully the next conference for me will be the BES/SFE meeting in my home country, in Lille in December!

Ploting SEMs in R using semPlot

This is a short post presenting the great package semPlot developed by Sacha Epskamp (check out his nice website: http://sachaepskamp.com/) to make nice plots from your SEMs. SEMs are a modelling tool that allow the researcher to investiguate complex relationships between the variables, you may find here many links to free tutorials: http://www.structuralequations.org/. Here I present a few tricks to plot SEMs in R that I have been using but if you look at ?semPaths or ?qgraph you will see that there are many many options to tune your graphs that I will not present here.

 #loading the library
library(semPlot)
library(lavaan)
library(clusterGeneration) #this is to generate a positive definite covariance matrix
#simulate some data
set.seed(1222)
sig<-genPositiveDefMat("onion",dim=5,eta=4)$Sigma #the covariance matrix
mus<-c(10,5,120,35,6) #the vector of the means
data<-as.data.frame(mvrnorm(100,mu=mus,Sigma=sig)) #the dataset
names(data)<-c("CO2","Temp","Nitro","Biom","Rich") #giving it some names
#building an SEM with a latent variable
m<-'Abiot =~ CO2 + Temp + Nitro
Biom ~ Abiot
Rich ~ Abiot + Biom'
m.fit<-sem(m,data)

#the plot
#basic version, the what arguments specify what should be plotted, here we choose to look at the standardized path coefficients
semPaths(m.fit,what="std",layout="circle")

semPlot1

Here is just a basic version of the plots, many things can be changed, I will focus on the layout of the graph, the labels going into the nodes and the different groups of variables.

 

#define the label that will go into the nodes
lbls<-c("CO2\nconcentration","Temperature","Nitrogen\ncontent","Plant\nbiomass","Plant\nrichness","Abiotic\nenvironment")
#define the groups
grps<-list(Abiotic=c("CO2","Temp","Nitro","Abiot"),Plant=c("Biom","Rich"))
#define the layout
ly<-matrix(c(-0.5,-0.5,0,-0.5,0.5,-0.5,0,0.5,-0.5,0.5,0,0),ncol=2,byrow=TRUE)
#new plot
semPaths(m.fit,what="std",layout=ly,residuals=FALSE,nCharNodes=0,groups=grps,color=c("brown","green"),nodeLabels=lbls,sizeMan=8,posCol=c("blue","red"),edge.label.cex=1.5,legend=FALSE)
text(0.9,0.9,labels="Some text about\nthe model or\nthe weather in Indonesia")

 

semPlot2

In this new plot I used the layout argument to specify my home-made layout, the plot is within a (-1,1)(-1,1) space and the position of each node can be specify using a 2 column matrix containing the X and Y position. To find out the order of the nodes and the edges one can do something like:

 

semPaths(m.fit,what="std",nodeLabels=letters[1:6],edgeLabels=1:12,edge.label.cex=1.5,fade=FALSE)

 

semPlot3

Using this knowledge we can define our own node labels using the nodeLabel argument (the \n is to add a line break in the label). Finally the groups argument need a list with character vectors of the different groups, each of the nodes belonging to the groups can get a particular color defined by color. sizeMan control the size of the nodes, posCol the color of the edges, when this is two colors the first one will be used for the positive edges and the second one for the negatives.

Again just by looking at the help pages of semPaths you will see much more ways to tailor your graphs to your need. Happy plotting.

Using bootMer to do model comparison in R

Setting the right random effect part in mixed effect models can be tricky in many applied situation. I will not talk here about choosing whether a grouping variable (sites, individuals …) should be included as a fixed term or as a random term, please see Gelman and Hill (2006) and Zuur et al (2009) for informations. Here I will present the use of the bootMer function in the package lme4 to compare two models with different random effect terms specification and decide whether one model do a (significantly) better job at fitting the data. The standard way to compare two model is to derive the likelihood ratio test (LRT) value and since these should follow a chi-square distribution derive a p-value corresponding to the probability to observe such an extreme LRT under the null hypothesis that both model perform equally well. This approach works relatively fine for GLM but for (G)LMM several problem arises due mainly to boundary effects (the null hypothesis being in this case that the variance of the random effects is 0) see Bolker et al (2009). One way to do model comparison in (G)LMM is to derive bootstrapped likelihood values from the two competing models and to draw confidence intervals around the observed values to decide if one model perform better than the other. Below are some code with simulated data (a cleaner version with more graphs can be found here: http://rpubs.com/hughes/22059):

library(lme4)
library(arm)
library(RColorBrewer)

##### work on model comparison using bootMer ##### simulate some data and fit a
##### random intercept model to them
x <- runif(100, 0, 10)
# the grouping variable
site <- gl(n = 10, k = 10)
# the random intercept effect, the simulated standard deviation around the
# intercept is 1
rnd <- rnorm(10, 0, 1)
# the simulated resposne variable, note that the fixed effect coefficient
# are 1 for the intercept and 3 for the slope. Also the simulated residuals
# will have a standard deviation of one
y <- rep(1 + rnd, each = 10) + 3 * x + rnorm(100, 0, 1)
# fit the model using Maximum Likelihood to be able to use the LRT
m1 <- lmer(y ~ x + (1 | site), REML = FALSE)

# simulate to generate credible intervals
simu <- sim(m1, n.sims = 1000)
# a new model matrix with ordered and equally spaced predictor values
new.x <- model.matrix(~x, data = data.frame(x = seq(0, 10, length.out = 100)))
new.y <- matrix(ncol = 1000, nrow = 100)
# get the predicted response values for each 1000 simulations of the fixed
# effect model parameters
new.y <- apply(simu@fixef, 1, function(x) new.x %*% x)
# compute the lower/upper quantile
lower <- apply(new.y, 1, function(x) quantile(x, prob = 0.025))
upper <- apply(new.y, 1, function(x) quantile(x, prob = 0.975))
median <- apply(new.y, 1, function(x) quantile(x, prob = 0.5))

# nice plot
pal <- brewer.pal(10, "RdYlBu")
plot(y ~ x, col = rep(pal, each = 10), pch = 16)
lines(new.x[, 2], median, col = "blue", lwd = 2)
lines(new.x[, 2], lower, col = "red", lwd = 2, lty = 2)
lines(new.x[, 2], upper, col = "red", lwd = 2, lty = 2)

fig1

# fit a second model with a random slope effect
m2 <- lmer(y ~ x + (x | site), REML = FALSE)

# using bootMer to compute 100 bootstrapped log-likelihood
b1 <- bootMer(m1, FUN = function(x) as.numeric(logLik(x)), nsim = 100)
b2 <- bootMer(m2, FUN = function(x) as.numeric(logLik(x)), nsim = 100)

# the observed LRT value
lrt <- as.numeric(-2 * logLik(m1) + 2 * logLik(m2))
# the 100 bootstrapped LRT
lrt.b <- -2 * b1$t + 2 * b2$t
# plot
quant <- quantile(lrt.b, probs = c(0.025, 0.975))
plot(1, lrt, xlab = "", ylab = "Likelihood ratio test", xaxt = "n", ylim = c(quant[1] + 
    1, quant[2] + 1))
abline(h = 0, lty = 2, lwd = 2, col = "red")
segments(1, quant[1], 1, quant[2], lend = 1)

In the above example the 95% CI of the bootstrapped LRT cross the 0 line which means that one model do not fit the data better than the other. In this case the rule use would be to use the most simple model (the one with the lower number of parameters) which is the random-intercept model.

Let's make another example:

# now simulate data from random intercept/ slope
rnd.slope <- rnorm(10, 0, 0.5)
y <- rep(1 + rnd, each = 10) + rep(3 + rnd.slope, each = 10) * x + rnorm(100, 
    0, 1)

# the new models
m3 <- lmer(y ~ x + (x | site), REML = FALSE)
m4 <- lmer(y ~ x + (1 | site), REML = FALSE)

# LRT the observed values
lrt <- -2 * logLik(m4) + 2 * logLik(m3)
# the bootstrap
b3 <- bootMer(m3, FUN = function(x) as.numeric(logLik(x)), nsim = 100)
b4 <- bootMer(m4, FUN = function(x) as.numeric(logLik(x)), nsim = 100)

# the 100 bootstrapped LRT
lrt.b <- -2 * b4$t + 2 * b3$t

# the nice plot
quant <- quantile(lrt.b, probs = c(0.025, 0.975))
plot(1, lrt, xlab = "", ylab = "Likelihood ratio test", xaxt = "n", ylim = c(0, 
    quant[2] + 1))
abline(h = 0, lty = 2, lwd = 2, col = "red")
segments(1, quant[1], 1, quant[2], lend = 1)

In this second example the random intercept/slope model fits much better to the data than the random intercept. This random effect structure should be kept. As mentioned in Bolker et al (2009) the LRT will be relevant depending on the design and the interest that is put on the random terms. In the case were random terms are due to the particular design of the study (site, blocks …) and when there are considered as a “nuisance” they may be included in the models without testing for the increase in fitness that their inclusion provide. In the case where the random term effects is of interest (individual sampling units …) then using LRT is a sensible way to detect and interpret the effect of the random terms. The function PBmodcomp in the package pbkrtest allows one to do all the preceding code in just one line with various ways to test for the significance of the likelihood ratio (thanks to Ben Bolker for his comment).

Biblio:
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in ecology & evolution, 24(3), 127-135.
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. Springer.

Regular expression and associated functions in R

When working with strings regular expressions are an extremely powerful tool to look for specific patterns in the strings. In informatics a string is several characters put together, this can be words, sentences, or DNA code. Regular expression were developed in the fifties (thanks to Jeff Newmiller comment) by computer scientists (https://en.wikipedia.org/wiki/Regular_expression) and I discovered them using Perl (http://www.perl.org/). They have also been implemented in various other programming languages due to their nice functionalities. Here I will present the functions in R that use regular expression and will present some general use of regular expression in ecology. As always a cleaner version of the post is available at: http://rpubs.com/Lionel/19068

#regular expression functions, there are 7 functions that can use regular expression
?grep

These functions will look for a certain pattern in the provided string(s), that may be a vector of strings.

# example using grep
words <- c("abc", "Cde", "Truc", "Machin")  #a vector of string
grep("c", words)  #looking for 'C' in each element of the vector, this return the index of the element in the vector containing the pattern ('C')
grep("c", words, value = TRUE)  #same but return the element of the vector containing the pattern
# by default grep is case-sensitive, can be turned to case-insensitive
grep("C", words, ignore.case = TRUE)
# example using grepl
grepl("c", words)  #return a vector of logical indicating if the pattern was found in the elements of the vector
# example using sub and gsub, these two function replace the pattern by a
# replacement value specified by the user
species <- c("Rattus_domesticus", "Formica_rufa", "Germanium_pratense_germanica")
sub("_", " ", species)  #sub will only replace the first occurence of the pattern in the string
gsub("_", " ", species)  #gsub will replace all occurences
# example using regexpr and gregexpr
species <- c("Onthophagus_vacca", "Copris_hispanus", "Carabus_hispanus_hispanus")
regexpr("hisp", species)  #regexpr return the position in the string of the first occurence of the pattern as well as the length of the pattern matched
gregexpr("hisp", species)  #gregexpr will return the position of all matched occurence of the pattern
regexec("hisp", species)  #similar as regexpr but with a different output formatting

As seen in the few examples above when we have a clear idea of what the pattern is that we want to match we can just use it for the pattern argument in the functions. However sometimes we do not know the exact form of the pattern or we want to match at one time several closely related strings, this is where regular expressions come into play. They are an abstracted way to represent the different element (letters, digits, space …) present in the strings.

# the regular expression help page in R
?regex
# regular expression with 5 different strings
words <- c("Bonjour", "Bienvenue", "Au revoir", "A la bonne heure", "2 heures")
#'\\w' will match any letters (a-z) present in the strings
grep("\\w", words)  #there are letters in all elements of the string
# now if we want to keep only elements having exactly 7 letters we use:
grep("\\w{7}", words)
# we can also match digits using '\\d', space '\\s' and their negation:
# '\\W','\\D','\\S' could you guess what the coming regular expression
# will match?
grep("\\w{2}\\W\\w+", words, value = TRUE)
# by placing '^' we match from the beginning of the string, similarly $ will
# match the end
grep("^\\w{2}\\W\\w+$", words, value = TRUE)
# a last one using '\\d'
grep("\\d\\W\\w+", words, value = TRUE)

Several comments, using {n} will look for a string where the item is matched n times, {n,} matched n or more times, {n,m} matched at least n times but no more than m times. We can also use + for matching an item one or more times and * for matching zero or more times. Have a look at ?regex where everything is explained in length.

Now I used regular expression most of the time to specifically format labels or species names, this is where gsub in combination with regular expression become very handy. For example:

# We have three labels with a plot ID BX followed by a genus species
# information with 3 letters each (Poatri = Poa trivialis), we would like to
# have the first letter for the genus and the species as upper-case
species <- c("B1_Poatri", "B2_PlaLan", "B3_lAtPRA")
# in sub and gsub we can put part of the pattern between parentheses to call
# it in the replacement argument
gsub(".{5}(\\w{1}).*", "\\1", species)
# now we use the argument perl=TRUE to use the \\U and \\L special
# symbols that set the following letters to upper and lower-case
# respectively
gsub("(.{3})(\\w{1})(\\w{2})(\\w{1})(\\w{2})", "\\1\\U\\2\\L\\3\\U\\4\\L\\5", 
    species, perl = TRUE)

Here is the end of this first overview of regular expression in R, I used them quite often for formatting strings when I don’t want to spend hours with calc. There are many subtleties not covered here (UTF-8 mode, perl mode …) but these informations should be enough to get everyone started.

Pourquoi je vais aller voter le 25 mai

Entre le 22 et le 25 mai prochain les européens éliront leurs représentants au parlement européen, cette élection est souvent marqué par un désintérêt, un manque de compréhension des enjeux et donc une abstention très élevée. Lors des dernières élections seul 40% des français se sont déplacés. Je vais expliquer un peu les raisons qui font que je pense que ce scrutin est important et mérite mon attention.

 

ep

L’union européenne garantit la paix

Il y a bientôt 100 ans l’Europe sombrait dans la folie meurtrière de la première guerre mondiale détruisant un continent jusqu’alors opulent. Se rappeler de l’histoire européenne du siècle dernier renforce le sentiment que l’union européenne est essentiel pour continuer la réconciliation entre les peuples et laissé derrière soi les barbaries et le racisme du siècle dernier. L’union européenne ne peut se développer sans le soutien de sa population et le vote aux uniques élections européennes qui existent actuellement est le meilleur moyen de le montrer.

 

Ce scrutin est exceptionnel

Durant le mandat à venir les pouvoirs du parlement européen seront agrandis et le poids législatif de la commission européenne (institutions politiques composé de membres désigné par les gouvernements nationaux) et du parlement européens seront mis peu à peu à égalité. En d’autres termes le parlement européens a à présent un réel poids politique dans l’union. Ce qui veut dire que le vote de la populations européennes aura un impact au travers des élus sur la politique européenne.

Autre point plus symbolique mais tout aussi important, le président de la prochaine commission européenne sera élu par le parlement, ce qui veut dire que le prochain parti majoritaire pourra envoyer un de ces membres à la tête de l’union. Avant le président de la commission était désigné par les différents chefs de gouvernements européens, à présent l’un des représentants clefs de l’Europe sera élu au suffrage universel!

 

L’Europe est une chance pour l’avenir

Face aux problèmes actuels tel le changement climatique, la crise en Ukraine, la crise économique, des solutions et actions au niveau national ont une porté limité et sont peu efficace. Lorsque 28 états représentant plus de 400 milions de personnes s’engagent pour une action selon à un poids politique et diplomatique beaucoup plus important. Je pense donc qu’il faut aller vers une intégration européenne renforcé pour répondre aux enjeux des décennies à venir. Voter aux élections européennes montre à tous les politiques que la population s’intéresse à ces enjeux et est confiante que l’Europe sera mieux à même de les résoudre que les gouvernements nationaux.

Quelques références:

Voici donc mes raisons principales pour lesquels je me déplacerai le 25 mai et exprimerai ma confiance envers l’Europe. Et vous? Faisez-vous confiance en l’Europe? Qu’est ce que vous changerez si vous en avez le pouvoir?

Importing 100 years of climate change into R

This is a flashback post, I was working on species distribution shifts over the last 40 years last summer and recently Rémi Genevest contacted me asking me how I managed to import the CRU TS 1.2 dataset into R. As always a more readable version of the code can be found here.

At that time I used a not very elegant coding involving SpatialPixels and SpatilGridDataFrame, scrolling back to the question I asked to the R-sig-geo mailing list back then I stumbles across the answer from Robert Hijmans that I did not take into account at that time. Now one year after I found his answer going in the right direction and made some heavy change in the coding.

#reading in CRU files into R
library(raster)

#for the CRU TS 1.2 download the .zip at http://www.cru.uea.ac.uk/cru/data/hrg/timm/grid/CRU_TS_1_2.html

#the raster we get at the end, the data are monthly for all the years between 1901 and 2000
temp<-brick(nrows=228,ncols=258,xmn=-11,xmx=32,ymn=34,ymx=72,nl=1200,crs=CRS("+proj=longlat +datum=WGS84"))

#example using the temperature
all_dat<-scan("/home/lionel/Documents/Master/CRU/obs.1901-2000.tmp",skip=5,what="list")

#now turn the data into a matrix format with every line corresponding to a raster cell and the first two columns the column and row number of the cell
xs<-all_dat[seq(2,37465029,1203)]
xs<-gsub(",","",xs)
xs<-as.numeric(xs)
ys<-as.numeric(all_dat[seq(3,37465029,1203)])
mat<-matrix(c(xs,ys),ncol=2,byrow=FALSE)
#now add the temperature data from these cells for all month all year
numb<-matrix(4:1203,ncol=1)
numb<-apply(numb,1,function(x) seq(x[1],37465029,1203))
mat<-cbind(mat,apply(numb,2,function(x) as.numeric(all_dat[x])))

#reverse the rows number since they are numbered from bottom to top in CRU and from top to bottom in rasters
ys_inv<-ys-((ys-113.5)-1)*2
mat[,2]<-ys_inv

#get the cell numbers of each box defined in the CRU dataset
ce<-cellFromRowCol(temp,rownr=mat[,2],colnr=mat[,1])
#attribute to these cells the temperature values
values(temp)[ce,]<-mat[,3:1202]
#divide by 10 to get the temperature in degree celsius
values(temp)<-values(temp)/10
#put names to the layers
month<-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Okt","Nov","Dec")
years<-1901:2000
names(temp)<-paste(rep(month,times=100),rep(years,each=12),sep="_")

#the winter mean temperature between 1914 and 1918
winter_1418<-calc(temp[[which(names(temp)%in%paste(rep(c("Dec","Jan","Feb"),times=5),rep(1914:1918,each=3),sep="_"))]],mean)
plot(winter_1418)

raster1

#the standard deviation in temperature for the years 1901 and 2000
sd_100<-stack(calc(temp[[grep("1901",names(temp))]],sd),calc(temp[[grep("2000",names(temp))]],sd))
plot(sd_100)

raster2

The only mathematical magic involve here is changing the row numbers. Then from this huge dataset we can do lots of neat thing, like we can see how cold did the soldier of the first world war were (first raster plot), or we can look at changes in standard deviation in temperature between the year 1901 and 2000 after one century of climate change.

If you use such data in your work do not forget to cite the owners: Mitchell, T.D., Carter, T.R., Jones, P.D., Hulme,M., New, M., 2003: A comprehensive set of high-resolution grids of monthly climate for Europe and the globe: the observed record (1901-2000) and 16 scenarios (2001-2100). Journal of Climate: submitted

And if you have some knowledge of similar dataset (monthly values over Europe) at a finer spatial resolution please contact me!

 

 

Follow

Get every new post delivered to your Inbox.