Just A Data Geek...: 2015

Sunday, December 6, 2015

Continuing Research on U.S. Gasoline and Crude Oil Prices, Part Three

Continuing on with the work in my previous two posts [1] [2], I will explore the possibility that the data exhibit a unit root by using the urca package in R. Having already established that the data exhibit significant signs of autocorrelation, checking for unit root (with and without a drift) is another step in the process of working with time series data -- and financial data falls into this category.

For each Augmented Dickey-Fuller (ADF) test looking for unit root, done using the ur.df() function within the urca package, I am including a graph of that variable as a visual reference:

1. The averaged price of U.S. Conventional Gasoline prices

> gasADFtest <- summary(ur.df(avgConvGas,

                              type = "drift",

                              selectlags = "BIC"))

> gasADFtest

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.08395 -0.10528  0.01245  0.13314  0.31021 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.24809    0.08949   2.772  0.00675 ** 
z.lag.1     -0.10241    0.03706  -2.764  0.00692 ** 
z.diff.lag   0.39773    0.09582   4.151 7.45e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1941 on 91 degrees of freedom
Multiple R-squared:  0.1881, Adjusted R-squared:  0.1703 
F-statistic: 10.54 on 2 and 91 DF,  p-value: 7.616e-05


Value of test-statistic is: -2.7637 3.8806 

Critical values for test thestatistics:

      1pct  5pct 10pct
tau2 -3.51 -2.89 -2.58
phi1  6.70  4.71  3.86

2. The West Texas Intermediate crude oil spot price

> wtiADFtest <- summary(ur.df(wti,
type = "drift",
selectlags = "BIC"))

> wtiADFtest

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47603 -0.08486  0.01266  0.10034  0.34917 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.19902    0.07044   2.825  0.00580 ** 
z.lag.1     -0.09557    0.03394  -2.816  0.00596 ** 
z.diff.lag   0.47096    0.09296   5.066 2.11e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.147 on 91 degrees of freedom
Multiple R-squared:  0.2432, Adjusted R-squared:  0.2265 
F-statistic: 14.62 on 2 and 91 DF,  p-value: 3.123e-06


Value of test-statistic is: -2.8159 4.0264 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.51 -2.89 -2.58
phi1  6.70  4.71  3.86

3. The Brent crude oil spot price

> brentADFtest <- summary(ur.df(brent,
type = "drift",
selectlags = "BIC"))

> brentADFtest

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42331 -0.07683  0.02460  0.10103  0.35226 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.14981    0.06196   2.418   0.0176 *  
z.lag.1     -0.06661    0.02777  -2.399   0.0185 *  
z.diff.lag   0.48010    0.09208   5.214 1.15e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.147 on 91 degrees of freedom
Multiple R-squared:  0.2468, Adjusted R-squared:  0.2302 
F-statistic: 14.91 on 2 and 91 DF,  p-value: 2.509e-06


Value of test-statistic is: -2.399 2.9484 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.51 -2.89 -2.58
phi1  6.70  4.71  3.86

Based on reviewing the visualized graphs, I chose to use the type = "drift" option in the ur.df() function. I do not believe that any strong signs of a trend exist in the data, but I do believe that it is appropriate to treat it for a drift term. I also used the Bayesian information criteria in order to select the lags used with the selectlags = "BIC" option -- which selected a lag of one (1) for the ADF test.

For each variable, the ADF test shows that we can not reject the null hypothesis that the variable has a unit root. At the bottom of each test is the ADF test statistic and the ADF test critical values. In all cases, we can not reject the null hypothesis at the 5-percent significance level, labeled 5pct, and only for the average gasoline price and the WTI crude price are they significant at the 10-percent (10pct) significance level. For the purposes of my research and this blog post, I am going to assume that each variable exhibits has a unit root. The regressions in each test shows the significance of the intercept, lagged term, and the drift component, all of which are statistically significant within the 5-percent significance level.

From here, I will continue my research using the first difference of each variable. The plots below are for each variable, now in a first-difference form.:

These first-differenced variables make the data stationary about a mean of zero (or near to it, which can be checked by summarizing the variables with summary()) and will aid in performing an accurate causal analysis, which I will perform in my next post. I plan to conclude this series with my next blog post and make a final update to my GitHub repository for this work as well as list the works I used in the course of my original research, when this was my thesis project.

Wednesday, November 4, 2015

Continuing Research on U.S Gasoline and Crude Oil Prices, Part Two

Exploratory Analysis, cont'd.

In my previous post, I began to explore the relationship between the spot prices of West Texas Intermediate (WTI) crude oil and Brent crude oil with the average of the spot prices of conventional gasoline in the New York and Gulf hubs.

To continue with my analysis, the question of whether or not there was a statistically significant difference between the dollar-per-gallon price of WTI and Brent crude oil should be addressed. The first thing I did was create a boxplot of the two price variables:

You can see that the range of the prices are about the same, and the means appear to be only slightly different. Just to be sure, however, I calculated the confidence interval between the two means using the assumption that the variances were unequal (which they are) and then testing the hypothesis that the difference between the mean prices is zero:

Lower C.I., 95% | Upper C.I., 95% 
   -0.008423387 |     0.280491840

t-stat, 189.208df | comparative t-stat | comparative p-value
            1.973 |           1.857953 |          0.03236876

So, you can see that the confidence intervals of the difference of the means includes zero, ranging from just more and negative (-) $0.01 to $0.28. The hypothesis test allows me to have confidence in my decision to not reject the null hypothesis -- which appear to not be statistically significantly different from one another.

However, we know that the two prices are viewed as significantly different in the market, so further exploration is necessary. First is to check for randomness because it is a basic assumption in statistical modelling; and, if the data are not random, than any statistical tests comparing them may produce questionable results. For this task, I looked at the autocorrelation and partial autocorrelation of the WTI and Brent crude oil spot prices and the averaged conventional gasoline spot price:

The top row of plots are the autocorrelation plots, which suggest that there is a moderate autocorrelation between the variables and their respective lagged values. The data do not exhibit randomness and we can infer that the current month's price has a correlation with its own lagged values.

The second row of plots are the partial autocorrelation of the variables in this analysis. This can be used to find the appropriate lag order for the modeling to be done in the next step of my analysis. In this case, it appears that the first and second lags are statistically significant for all of the variables (wti, brent, and avgConvGas) and only WTI data has a lagged value beyond that, and that only the third lagged value.

So, it would appear that an appropriate course of action would be to use an autoregressive model, AR(2), for all three variables and an additional model for WTI modeled as an AR(3).

----------------------------------------------------------------------------------------

I'm going to conclude this entry here. I will continue in my analysis in my next post, which will include moving on to actually modelling the data.

EDIT (05 Nov 2015): I uploaded my R script and dataset to my GitHub account so that it is available for everyone. I will be updating it as I go along, with the final submission sent when I conclude my work in this blog series.

Sunday, September 27, 2015

Continuing Research on U.S. Gasoline and Crude Oil Prices, Part One

Preface

The next several posts will contain a brief version of the exploratory analysis that I had performed for my final research project (thesis) while I was enrolled in a Masters program in Applied Economics. The overall goal was to quantify the causal relationship between U.S. gasoline prices and spot crude oil prices -- both West Texas Intermediate (WTI) and Brent.

All of my data comes from the U.S. Dept. of Energy's Energy Information Agency (EIA). A lot of the initial transformation of the data, such as picking the time frame and choosing which variables to include, was done in LibreOffice Calc. Originally, the statistical analysis had been done in Stata12, but since migrating myself to R I have done most of my work over again.

The data and R script will be made available in my GitHub account, a link is available on the right-hand side of the page.

Introduction

Crude oil is a heavily traded commodity that garners a lot of attention. The question that I ultimately would like to answer is: What is the relationship between crude oil and U.S. gasoline prices, and is it quantifiable in some manner?

First, I chose an eight year time period of data from the EIA's publicly available data, beginning in July 2006 and ending in June 2014. The first part of my analysis focuses primarily on WTI and Brent crude oil spot prices and the simple average of the U.S. Conventional Gasoline prices from the New York and Gulf hubs. (i.e.: (Conv. Gas NY + Conv. Gas Gulf) / 2)

Exploratory Analysis

Foregoing summary statistics, I will begin by visualizing the data:

From this simple line plot, it looks like the average Conventional Gasoline price tracks closer to the Brent crude oil price than the WTI price. So this graph is followed up with the correlation between the variables, using the cor() function in R:

                 wti     brent avgConvGas
wti        1.0000000 0.9385743  0.9172266
brent      0.9385743 1.0000000  0.9718169
avgConvGas 0.9172266 0.9718169  1.0000000

From the table above you can see that the two crude oil prices are highly correlated (orange highlighting) with each other, which is not a surprise given the global nature of the crude oil market. We also see that the two market crude oil prices are highly correlated with the gasoline prices -- again, not a surprise since gasoline is derived from crude oil via the refining process -- but U.S. Conventional Gasoline prices are more correlated with the Brent crude oil price (blue highlighting) than the WTI crude oil price (green highlighting).

...

I'm going to stop here. I will continue my analysis in Part Two of this post. Stay tuned!

Monday, August 17, 2015

Alcoa vs. Aluminum

For your reading pleasure, I am submitting a brief analysis of Alcoa's stock price -- as listed on the NY Stock Exchange (NYSE) -- and the London Mercantile Exchange's price for aluminum.

Before I do, I wanted to let whomever may be reading that I have updated my blog a bit with links to other blogs and sites that I frequent, as well as a link to my GitHub account.

... and, my GitHub account now has a repository for the blog that contains my R scripts and data. The README.md file explains the pertinent details.

As for my blog post, all of my data comes from Quandl. If you haven't done so, you can sign up for an account for free. This will give you easier access to their data.

I decided to go with a year-to-date approach for my analysis. Alcoa's stock price hasn't been all that great this year, as you can see here:

... it's lost roughly one-third of its value over the course of the year. So, I looked up the settlement prices (the "cash" or spot price) of aluminum on the LME:

Visually, there is a noticeable correlation. After controlling the two datasets for commonly shared dates -- U.S. markets and UK markets have some operating differences -- I was able to transform the data to make them comparable. The correlation between the datasets is roughly 85%.

At first this may appear natural, but the relationship would make more sense if it were the opposite. One would expect the Alcoa share price to go up as the price of aluminum goes down, since their costs would go down. To see if the law of supply and demand were working with that assumption, I then downloaded the stock quantities of aluminum that LME has warehoused around the world.

There is a noticeable decrease in quantities of aluminum (both "primary", or pure, and alloy) being warehoused. These quantities are closing quantities, defined as the on-hand quantity at the close of business day after deliveries in and out of inventory. Closing quantities includes open interest quantities and cancelled interest quantities.

The next question that comes to mind is: How is this important?

Simply answered with: It may be a sign that demand for aluminum may be falling... and that may be a sign of weakness in the manufacturing sector. Aluminum is a pretty common metal and is widely used in manufacturing, so decreasing quantities and prices may be the result of decreased demand.

To see if this hypothesis is indeed true, I will have to do further research focusing on ore production as well as other companies with operations similar to Alcoa.

Sunday, July 26, 2015

Using RPubs To Publish Work

I need to apologize to anyone who is actually reading my blog. I had every intention of publishing a follow up post about Greece -- comparing its 10-year bond rates with those of the other "troubled" EU countries of Spain, Italy, and Portugal -- but life got in the way a bit.

This post will be a brief explanation of my current course. I'm currently in the Reproducible Research course in the Data Science specialization certificate offered through Coursera.

One of the assignments in this course has me working with a particularly messy dataset. After downloading the data, cleaning it up, and performing some exploratory analysis, I published my findings to RPubs.com.

Here is a link to my ~~report~~. It's nothing fancy, but I'm proud of it.

This was created using the knitr package that integrates nicely with RStudio. It allows you to create markdown documents that are easily viewable in HTML, as well as publishing them to RPubs, which is maintained by the folks who created and maintain RStudio.

knitr can also create documents in .doc and .pdf format as well. I recommend trying it if you're in a data analysis role in any compacity.

For now, I'm going to leave off here.

Update 11 December 2015: Due to RPubs documents being public, and not wanting to tempt others who are taking the Data Science specialization through Coursera to copy my work, I have removed the report from RPubs. I apologize for the inconvenience.

Friday, July 10, 2015

Exploring Greek 10-year Bond Rates

I want to start this blog post with a few notes:

I received some feedback and updated the third graph in my previous post to include a label for the y-axis.
I plan to make my R scripts, data, etc. available through my GitHub account at a later date
I will be adding links to blogs that I follow and websites that I find useful and visit often.
I am hoping to post up something new every other week, but I am not making any promises on creating a schedule for publishing.

This post is the result of a conversation I was having with a coworker. We were discussing the continued financial woes of Greece when he looked up the Greek 10-year bond rate. It was not extraordinarily high, giving it recent history. So, I decided to look up the data through Quandl and see what was available. I had free access to a premium data set, and pulled the data straight into R using the "Quandl" package.

First, here is a historical look at Greek 10-year bond rates:

This plot was made using the "ggplot2" package. Current rates (around the 10% range) are not particularly bad when compared to 2012 and 2013. It did appear to have a steep increase from the beginning of 2015 to the last observed data point.

Then, I created another plot using a narrower time frame to focus on more recent events following the 2008 financial crisis:

These plots (and their data) made me wonder if the market was really reacting to the problems in Greece.

The caveat to these plots, however, is that the last captured observation was June 25, 2015. Quite a lot has happened since then, so I went to Investing.com and pulled data for the trading days of June 25, 2015 through the close of business July 10, 2015:

We can see that there was a jump in the Greek 10-year bond rate that corresponds to the referendum that took place, and the market's reaction to the referendum's outcome.

I took this data and then combined it the data I retrieved from Quandl and created this final plot:

I started working on this blog on Tuesday, and was hoping that Quandl would have updated data -- I am slightly hesitant to append one (very small) data set to another that come from two different sources. Investing.com's data is not formatted for download. I had to manually collect the data from their site in order to create that data set.

Regardless, it leads to a few questions:

If rates are a sign of risk, how worried was the market in the Greek referendum? (The spike is less than half of the larger spike in 2012.)
How will this weekend's continued negotiations affect Greek bond rates?
Like everyone watching events unfold, what would happen if Greece left/were removed from the EU? Or stopped using the Euro and reintroduced the Drachma?

Continued research could answer those questions.

Sunday, June 28, 2015

First Post

I'm trying to blog in order to exercise my R skills and as an outlet for my own interest in data analysis... or data science, since that's the new "buzz term".

I'm using R (currently, version 3.2.1) and RStudio (currently, version 0.99.441) on an "aging" Acer Aspire TimelineX laptop running Linux Mint 17 'Qiana' for its operating system (OS).

So, for my first post, I'm analysing data that I had previously looked at using Stata12. Since migrating to R, I'm finding that the graphical output is much better. Please note that the following graphs are using the base graphics package.

My first plot is Weekly U.S. Retail Gasoline Prices:

Followed by Weekly U.S. Total Stocks:

These are exploratory graphs. I am just viewing the data. The price data will probably show a strong correlation with oil prices (which I will show in a later post). The stock quantities graph looked interesting, so I created a more detailed graph using the 'ggplot2' graphics package:

There was an interesting decrease in the quantities of gasoline stocks that began around 1992 through 1996. Quantities of gasoline stocks did not show an increase until 2007. This may be worth exploring further.