Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

Problem with Canonical Correspondence Analysis. Every environmental variables in - (Jun/15/2014 )

Hi, I'm trying to do a CCA, using PAST3, of two phylotypes in 7 samples to analyze the relationship with 8 to 12 environmental variables. These phylotypes are from a certain molecular marker. A problem is that the designated PCR primers also amplified two types sequences totally unrelated to my molecular marker, yet they make about 50% of the total clones.  The situation is that if I use only the correct phylotypes in a CCA, I get that all variables are in one axis, thus making the CCA analysis useless. If I add the other two unrelated "phylotypes", I get a CCA plot that makes sense: but I doubt it is right and representative about the relationship of the environmental variables and the "correct" marker phylotypes. Can I really use the 2 correct phylotypes of my gene in combination with the 2 unrelated-to-my-gene "phylotypes"?

 

If not, I'm looking for alternatives for software or multivariate analyses. I'm using PCA with the environmental variables and the Simpson index, so in the biplot, the Simspon index is a variable, but I've no idea if I'm doing it right. Please help me.

-Procyon-

Apologies if I’m way off the mark here.

With two phylotypes and 12 variables I would have first looked at fitting a binary (Logistic) regression to the first two or three principle components from the environmental variables.

-DRT-

I'm going to do this, but for the moment I´ve the following question: How does a CCA treat the missing values? Are they ignored or imputed?

 

How would the amount of missing values define the criteria for selection a certain variable? 

 

I´ve some unavailable data, which is available at most of the sampling sites, yet I don´t know if I shouldn´t use those variables. How do I explain a CCA where a variable data is unavailable for 2 of 7 sampling points?

 

Is there a way I can calculate the effects of missing values in a CCA or PCA?

 

What about the CCA using the additional 2 unrelated-to-my-gene "phylotypes"??

-Procyon-

DRT on Mon Jun 16 22:00:36 2014 said:

Apologies if I’m way off the mark here.

With two phylotypes and 12 variables I would have first looked at fitting a binary (Logistic) regression to the first two or three principle components from the environmental variables.

 

How do I interpret and explain that analysis??

-Procyon-

Procyon on Fri Jun 20 22:00:15 2014 said:

 

DRT on Mon Jun 16 22:00:36 2014 said:

Apologies if I’m way off the mark here.

With two phylotypes and 12 variables I would have first looked at fitting a binary (Logistic) regression to the first two or three principle components from the environmental variables.

 

How do I interpret and explain that analysis??

 

 

One of the reasons I like logistic regression over the multivariate methods is that interpretation isn’t much different to ordinary regression. You will end up with a set of parameters and their errors/significance which translate to the intercept and slopes that most people are familiar with. If you are lucky there will be two parameters, maybe with an interaction term, which predominate. This will allow you to illustrate the solution as a simple scatter plot with a curved line through your data corresponding to the midpoint of the logistic equation, viz. one side of the line is probably one phylotype and the further from the line the greater the probability.

 

I see the CCA coming into its own when you add the other two “phylotypes” but I wouldn’t like to comment on the scientific validity of including these. No harm in looking though; if the data is there use it.

 

The missing values problem is a tricky one (I assume they are genuinely missing not just 0 ie zero inflated) and you may need to seek out guidance because the response could depend on exactly what is missing. Usually the simplest solution is to delete the ‘row’ but with only 7 sampling points this is going to take out a big proportion of your data. If the distribution around the missing data points is normal I would be tempted to run a series of CCA each time inserting an appropriately distributed random number into the missing data points. This would give you a sense of how dependent the results of your analysis are upon the values of the missing data.

 

good luck

-DRT-