% Exercise 19 -- file 'PCAsup.m'
% Principal Component Analysis of LakeSuperiorAirTemp
% IMPORT FILE 'LakeSuperiorAirTemperature.dat.txt' AND
% RENAME THE RESULTING VARIABLE TO 'LakeSuperiorAirTemp'.

% The file contains hourly measurements of air temperature above Lake
% Superior at four locations:
%  Station-1: Lat 47.08 Lon -90.73 
%  Station-2: Lat 47.87 Lon -89.31 
%  Station-3: Lat 48.22 Lon -88.37 
%  Station-4: Lat 47.18 Lon -87.22 

% First, copy the data into separate variables
aa = LakeSuperiorAirTemp(:,1);
bb = LakeSuperiorAirTemp(:,2);
cc = LakeSuperiorAirTemp(:,3);
dd = LakeSuperiorAirTemp(:,4);

% Remove the NaN's via interpolation 
% In this example, I'm using my dumb-simple interpolation.
va = sergeiinterpolate(aa);
vb = sergeiinterpolate(bb);
vc = sergeiinterpolate(cc);
vd = sergeiinterpolate(dd);

plot([va vb vc vd])
%%
% ---- BEFORE YOU BEGIN THE ANALYSIS -----
% Look at the data. Think about how the PCA analysis could be useful here. 
% Ask yourself a series of questions. 
% What factors do you think may generate variance in this data? (Come up with at least 3)
% (Variance here is anything that departs from the average temperature for the dataset.) 
% How many of these factors are independent?
%%
% Now let's proceed. Choose a record with least sinister gaps. 
% You can later play with these numbers
% mybegin=7000; myend=22000;
% mybegin=3000; myend=10000;
mybegin=8000; myend=10000;
va = va(mybegin:myend);
vb = vb(mybegin:myend);
vc = vc(mybegin:myend);
vd = vd(mybegin:myend);
plot([va vb vc vd])
%%
% Make a scatter plot to see the dispersion of data
% (don't have a 4-dimensional space handy, have to settle for 3D) 
plot3(va,vb,vc,'.');grid on
% Rotate the plot and investigate the dimensionality of the data.
%%
% Subtract off the mean from each vector.  
% If there was a sensible and easy trend, I'd subtract that too.
% Stuff them all into a n row x 4 column matrix
mymatrix = [va-mean(va), vb-mean(vb), vc-mean(vc), vd-mean(vd)];
size(mymatrix)
sr=mymatrix;        % here, 'sr' is the variable used in the analysis

% Create labels
categories={'St1' 'St2' 'St3' 'St4'};
%%
% Get a quick impression about the data
boxplot(mymatrix,'orientation','horizontal','labels',categories)
% You could also use 'plot' to compare pairs of variables,

%%
% Get a quick idea of correlations
corr(mymatrix)
%%
% ---- SKIP THIS STEP ON THE FIRST RUN ----
% If the four datasets were very different, we'd need to standardize their
% variance. Since they are not, on the first pass it's possible to retain
% the original scale and units. You can run this cell later and compare the
% results.
% Divide the data by the corresponding standard deviations
stdr = std(mymatrix);
sr = mymatrix./repmat(stdr,max(size(mymatrix)),1);
% The standardized rankings are now in variable 'sr'
boxplot(sr,'orientation','horizontal','labels',categories)
% --------------------------------------------
%%
% Now find the principal components!
[coefs,scores,variances,t2] = princomp(sr);
%%
% First, look at vectors of principal component coefficients
coefs
% The first column is the first principal component
% Note the weights (loadings) for each station
%%
% Component scores (variable 'scores') are the original data that have been
% mapped into the new variables, i.e. projected on principal components.
% Projection on the first two (most significant) principal components:
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component');
ylabel('2nd Principal Component');
%%
% Using the 'variances' output, calculate the percent of variance
% in the data explained by each principal component
percent_explained = 100*variances/sum(variances)
pareto(percent_explained)
xlabel('Principal Component')
ylabel('Variance Explained (%)')
%%
% Visualize the results of the principal component analysis
biplot(coefs(:,1:2), 'scores',scores(:,1:2),... 
'varlabels',categories);
% Each of the variables is represented in this plot by a vector, and
% the direction and length of the vector indicates how each variable 
% contributes to the two principal components in the plot.
% What can you tell about the nature of the first PC?
%%
% Let's visualize the principal components in another way.
% Plot the data along the principal components against time
hold off
plot(scores)
%hold on
%plot(-sr(:,1),'black')
% Doesn't this look like the first PC is seasonal variation?
% Of course, you knew this from the beginning, right?
% Is it only seasonal?
%%
% But what is the second PC? Daily cycle? Latitude?
% Use Google Earth or Google Maps to find out the locations of the four stations.
% (Both accept comma-separated (Lat,Lon) pairs.
% Compare the latitudes against the biplot. Is it now clearer?
% Let's compare the magnitudes of the second and third components 
plot(scores(:,2:3))
%%
% Hm, about the same... But can they really be reflections of the same 
% underlying thing? Well, no! They are orthogonal by construction, remember?

corr(scores)

%%
% Plot the second and third PCs on a biplot.
biplot(coefs(:,2:3), 'scores',scores(:,2:3),... 
'varlabels',categories);
% Does this give you a clue? Compare to the Google Earth map.
% (This is the part that I find most amazing about this analysis.)
% Explain what you see.

% Additional food for thought: Could daily cycle be a separate PC? 
% If it were, how would the contributions from the four factors look like? 
% For example, could any two of them have different signs? 
% Or would all four stations contribute to the component in the same way? 
% If so, wouldn't the daily component be included in the first PC? 
% Check if you can see the daily cycle in the first principal component. 
% We have 2000 hours (data points) in the series, that's 83 days.
% Compare the first PC against the actual temperature data.

% Repeat the analysis using another part of the series (or a longer
% series). Do you get the same results?