Add calculate_cng_indices#97
Conversation
| if model == "factors": | ||
| data -= np.linalg.pinv(np.diag(np.diag(np.linalg.pinv(data)))) | ||
| # TODO: Should this line be here? | ||
| data = covariance_to_correlation(data) |
There was a problem hiding this comment.
This line comes from the eigenComputes function. When the data passed in are correlations and cor == TRUE, it calls eigen(cov2cor(corFa(x))). However, it doesn't do this if the variables/observations themselves are passed, in which case it calls eigen(corFA(cov(x))). This produces very different results! I think this is only needed if covariances are passed in rather than correlations, but since this implementation calculates the correlations I think this line should be removed. Agreed?
There was a problem hiding this comment.
So, are we assuming the input value here will always be eigenForm == "data"? If so, I think you're right that this should be removed, but I may be misunderstanding.
There was a problem hiding this comment.
Yeah, I think you're right. I was getting confused by the case where dataType == correlation && cor == FALSE, but I think in that case the data passed in should actually be a covariance matrix, not a correlation matrix.
62823a7 to
9143d0e
Compare
|
|
||
| values = np.sort(np.linalg.eigvals(data))[::-1] | ||
|
|
||
| num_variables = len(data) |
There was a problem hiding this comment.
Won't len(data) give you the number of observations, rather than the number of variables, assuming data is a numpy array or pandas data frame?
Unfortunately, this doesn't translate exactly from R:
> df <- data.frame(A = c(1, 2, 3), B = c(1, 2, 3))
> length(df)
[1] 2
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]})
>>> len(df)
3
>>> len(df.values)
3
Also, should we move the line below to the beginning?
There was a problem hiding this comment.
At this line of the code data refers to the correlation matrix, which is square. data.shape[0] == data.shape[1]
| The eigenvalues and CNG indices of the dataset | ||
| """ | ||
| data = corr(data) | ||
| if model == "factors": |
There was a problem hiding this comment.
If we're just going to have "factors", do we need the model argument? Does it make sense to exclude this for now?
There was a problem hiding this comment.
if model == "components", the adjustment underneath is not needed. In other words, only taking the correlation of the data is sufficient. We might want to add elif model != "components": raise ValueError
| if model == "factors": | ||
| data -= np.linalg.pinv(np.diag(np.diag(np.linalg.pinv(data)))) | ||
| # TODO: Should this line be here? | ||
| data = covariance_to_correlation(data) |
There was a problem hiding this comment.
So, are we assuming the input value here will always be eigenForm == "data"? If so, I think you're right that this should be removed, but I may be misunderstanding.
|
Thanks so much for submitting this PR! I had a few comments/questions. Let me know if any of them don't make sense, or if you just want me to submit a PR to merge into this branch. |
I implemented this function while working on a paper and decided it belongs in this excellent project. I will add tests (I have been comparing to output from R), but there is a major potential issue in the code which I will mark out with inline comments.