4 data wrangling tasks in R for advanced beginners

With great power comes not only great responsibility, but often great complexity — and that

With great power comes not only great responsibility, but often great complexity — and that sure can be the case with R. The open-source R Project for Statistical Computing, a programming language and environment, offers immense capabilities to investigate, manipulate and analyze data. But because of its sometimes complicated syntax, beginners may find it challenging to improve their skills after learning some basics.

If you’re not even at the stage where you feel comfortable doing rudimentary tasks in R, we recommend you head right over to Computerworld’s Beginner’s Guide to R. But if you’ve got some basics down and want to take another step in your R skills development — or just want to see how to do one of these four tasks in R — please read on.

I’ve created a sample data set with three years of revenue and profit data from Apple, Google and Microsoft, looking at how the companies performed shortly after the 2008-09 “Great Recession.” (The source of the data was the companies themselves; “fy” means fiscal year.) If you’d like to follow along, you can type (or copy and paste) this into your R terminal window:

fy <- c(2010,2011,2012,2010,2011,2012,2010,2011,2012)
company <- c("Apple","Apple","Apple","Google","Google","Google","Microsoft","Microsoft","Microsoft")
revenue <- c(65225,108249,156508,29321,37905,50175,62484,69943,73723)
profit <- c(14013,25922,41733,8505,9737,10737,18760,23150,16978) 
companiesData <- data.frame(fy, company, revenue, profit)

The code above will create a data frame like the one below, stored in a variable named “companiesData”:

  fy company revenue profit
1 2010 Apple 65225 14013
2 2011 Apple 108249 25922
3 2012 Apple 156508 41733
4 2010 Google 29321 8505
5 2011 Google 37905 9737
6 2012 Google 50175 10737
7 2010 Microsoft 62484 18760
8 2011 Microsoft 69943 23150
9 2012 Microsoft 73723 16978

(R adds its own row numbers if you don’t include row names.)

If you run the str() function on the data frame to see its structure, you’ll see that the year is being treated as a number and not as a year or factor:

Source Article