A Simple Way to Gather all Coronavirus Related Data with R


Simplicity is key.

Federico Riveroll

*Note: I recently published the Python version of this tutorial which you can find here.

The Coronavirus propagation throughout the world is alarming. Up to today (April 13, 2020), there are over 2 million confirmed cases of Coronavirus.

John Hopkins Map

The objective of this article is to get the needed data for research and gain proactive visibility on COVID-19 by enabling the gathering of all relevant data into an R dataframe.

install.packages("openblender")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("reshape")

The CSSE is doing the amazing job of uploading the daily data here. However, it is wildly untidy and on many different datasets, so I sorted it and uploaded it as a single OpenBlender dataset:

Let’ pull the data into an R dataframe by running this script:

#Load libraries
library(openblender)
library(dplyr)
library(ggplot2)
library(reshape)
# Copy your token here
token_key <- 'YOUR_TOKEN_HERE'
action <- "API_getObservationsFromDataset"# ANCHOR: 'COVID19 Confirmed Cases'parameters <- list(
token=token_key,
date_filter = list(start_date="2020-01-01", end_date="2020-03-10"),
id_dataset="5e7a0d5d9516296cb86c6263",
consumption_confirmation="on"
)
df <- openblender::call(action, parameters)$sample
head(df)

So now we have the -aggregated by day and by location- number of confirmed cases, deaths and recoveries for some countries of interest such as France, Italy, South Korea, Iran, Spain and Germany.

Here we can see the outburst of confirmed cases in Iran, Italy and Korea. We can also see Spain, France and Germany starting to rise.

We’ll gather COVID news and texts from these sources: ABC News, Wall Street Journal, CNN News and USA Today Twitter (you can look for other sources)

So let’s fetch the data.

action <- "API_getOpenTextData"parameters <- list( 
token=token_key,
date_filter = list(start_date="2020-01-01", end_date="2020-03-10"),
text_filter_search = list("covid","coronavirus"),
aggregate_in_time_interval = list(time_interval_size = 60 * 60 * 24),
consumption_confirmation="on",
sources = list(
# Wall Street Journal
list(id_dataset = '5e2ef74e9516294390e810a9', features = list('text')),
# ABC News Headlines
list(id_dataset = '5d8848e59516294231c59581', features = list('headline', 'title')),
# USA Today Twitter
list(id_dataset = '5e32fd289516291e346c1726', features = list('text')),
# CNN News
list(id_dataset = '5d571b9e9516293a12ad4f5c', features = list('headline', 'title'))
)
)
df_news <- openblender::call(action, parameters)$sample

Above, we specified the following:

head(df_news)

Every observation is the aggregation of news by day, we have the source which is the concatenation of all news into a string and the source_lst which is a list of the news of that interval.

The timestamp (which is returned as unix timestamp) means the news happened in the interval from the prior timestamp to the current one (strictly before):

Now let’s get the mentions count for some countries of interest.

countries <- c('china', 'iran', 'korea', 'italy', 'france', 'germany', 'spain')# Very sorry for the numerous loops.
for (country in countries){
lst_count = c()
for (row in 1:nrow(df_news)) {
news <- df_news[row, "source_lst"][[1]]
print("-------")
count_mentions <- 0
if (length(news) > 0) {
for (row_n in 1:length(news)){
news_text <- news[row_n]
if (length(news_text) > 0){
if (grepl(country, news_text)){
print(news_text)
count_mentions <- count_mentions + 1
}
}
}
}
lst_count[length(lst_count) + 1 ] <- count_mentions
print(count_mentions)
}
df_news[paste('count_', country)] <- lst_count
}
# Now lets plot the data.
df_news <- df_news %>% select('timestamp', 'count_ china', 'count_ iran', 'count_ korea', 'count_ italy', 'count_ france', 'count_ germany', 'count_ spain')
df_news.timestamp <- melt(df_news, id.var="timestamp")
r <- ggplot(df_news.timestamp, aes(unlist(timestamp), value, colour=variable)) +
geom_smooth(method = "loess", se = FALSE) +
scale_colour_manual(values=rep(c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"), c(1, 1,1,1,1,1,1))) +
labs(x="timestamp", y="values")
r

The news mentioning China increased and are now in a decreasing slope.

For this, we can part from the Dow Jones Dataset and blend several others such as exchange rates (Yen, Euro, Pound), material prices (Crude Oil, Corn, Platinum, Tin), or stock (Coca Cola, Dow Jones).

action <- "API_getObservationsFromDataset"# ANCHOR: 'DowJones'
# BLENDS: 'Valor contra Yen', 'Valor contra Euro', 'Valor contra Libra', 'Crude Oil Price', 'Corn Price', 'Platinum price', 'CocaCola Price'
parameters <- list(
token=token_key,
id_user="5ca62dafe213e54e846509ea",
id_dataset="5d4c14cd9516290b01c7d673",
consumption_confirmation = "on",
aggregate_in_time_interval = list(output = "avg",empty_intervals = "impute",time_interval_size = 86400),
blends = list(
list(id_blend = "5d2495169516290b5fd2cee3",restriction = "None",aggregate_in_time_interval = list(empty_intervals = "impute",output = "avg",time_interval_size = 86400),blend_type = "ts",drop_features = list()),list(id_blend = "5d4b3af1951629707cc1116b",restriction = "None",aggregate_in_time_interval = list(empty_intervals = "impute",output = "avg",time_interval_size = 86400),blend_type = "ts",drop_features = list()),list(id_blend = "5d4b3be1951629707cc11341",restriction = "None",aggregate_in_time_interval = list(empty_intervals = "impute",output = "avg",time_interval_size = 86400),blend_type = "ts",drop_features = list()),list(id_blend = "5d4c80bf9516290b01c8f6f9",restriction = "None",aggregate_in_time_interval = list(empty_intervals = "impute",output = "avg",time_interval_size = 86400),blend_type = "ts",drop_features = list()),list(id_blend = "5d4c23b39516290b01c7feea",restriction = "None",aggregate_in_time_interval = list(empty_intervals = "impute",output = "avg",time_interval_size = 86400),blend_type = "ts",drop_features = list()),list(id_blend = "5d4ca1049516290b02fee837",restriction = "None",aggregate_in_time_interval = list(empty_intervals = "impute",output = "avg",time_interval_size = 86400),blend_type = "ts",drop_features = list()),list(id_blend = "5d4c72399516290b02fe7359",restriction = "None",aggregate_in_time_interval = list(empty_intervals = "impute",output = "avg",time_interval_size = 86400),blend_type = "ts",drop_features = list())
),
date_filter = list(start_date="2020-01-01T06:00:00.000Z", end_date="2020-10-03T05:00:00.000Z")
)
df <- openblender::call(action, parameters)$sample

We just pulled a dataset with all the data as columns blended by time with 24 observations (one for each day) and 53 variables.

head(df)

So now we have a single dataset with daily observations of prices blended by time. If we want to compare them, we better normalize them between 0 and 1, so that we can better appreciate the patterns:

# Let's compress them into de 0,1 domainnums <- unlist(lapply(df, is.numeric))  
df_nums <- df[ , nums]
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
df_compress <- df_nums['timestamp']
df_compress['timestamp'] <- df['timestamp']
df_compress['PLATINUM_PRICE_price'] = range01(df_nums['PLATINUM_PRICE_price'])
df_compress['DOW_JONES_price'] = range01(df_nums['open'])
df_compress['CRUDE_OIL_PRICE_price'] = range01(df_nums['CRUDE_OIL_PRICE_price'])
df_compress['CORN_PRICE_price'] = range01(df_nums['CORN_PRICE_price'])
df_compress['COCACOLA_PRICE_price'] = range01(df_nums['COCACOLA_PRICE_price'])
df_compress['POUND_VS_USD'] = range01(df_nums['VALOR_CONTRA_LIBRA_nivel'])
df_compress['EURO_VS_USD'] = range01(df_nums['VALOR_CONTRA_EURO_nivel'])
df_compress.timestamp <- melt(df_compress, id.var="timestamp")
# Now let's plot
ggplot(df_compress.timestamp, aes(timestamp, value, colour=variable)) +
geom_smooth(method = "loess", se = FALSE) +
scale_colour_manual(values=rep(c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"), c(1, 1,1,1,1,1,1))) +
labs(x="timestamp", y="values")

We can see the Euro and Pound rising against the USD (which is -more likely- the USD plunging along with the other indicators).

Now, we will align the COVID-19 confirmed cases, the coronavirus news and the economical indicators data into one single dataset blended by time.

To blend the data let’s upload the datasets we created to OpenBlender, first the news dataframe:

action <- "API_createDataset"parameters <- list( 
token=token_key,
name="Coronavirus News",
description="YOUR_DATASET_DESCRIPTION",
visibility="private",
select_as_timestamp="timestamp",
tags = list(),
insert_observations = "on",
dataframe = df_news
)
openblender::call(action, parameters)

Now the financial indicators:

action <- "API_createDataset"parameters <- list( 
token=token_key,
name="Financial Indicators",
description="YOUR_DATASET_DESCRIPTION",
visibility="private",
select_as_timestamp="timestamp",
tags = list(),
insert_observations = "on",
dataframe = df_compress
)
openblender::call(action, parameters)

And now we just pull the initial COVID-19 dataset and we blend the new datasets we created by placing the “id_dataset”s from our created datasets on the ‘id_blend’ fields.

action <- "API_getObservationsFromDataset"# ANCHOR: 'Daily data of cases COVID-19'
# BLENDS: 'Financial Indicators for COVID', 'COVID NEWS'
parameters <- list(
token=token_key,
id_dataset="5e73d928951629333c04a9c6",
consumption_confirmation = "on",
blends = list(list(id_blend = "ID_DATASET_FINANCIAL",blend_type = "ts",drop_features = list()),
list(id_blend = "ID_DATASET_COVID_NEWS",blend_type = "ts",drop_features = list())),
date_filter = list(start_date="2020-01-01T06:00:00.000Z", end_date="2020-03-10T06:00:00.000Z")
)
df <- openblender::call(action, parameters)$sample

And now we have a dataset with the daily information of all the data blended by time!

There are an enormous range of directions to go with this.

Please post further findings.

Source Article