I have a cancer data set that has the date of diagnosis, the date of death (if dead), and some features about the patients that I want to build and test the model on. I plan to randomly divide the data set in 50 percent training and 50 percent testing. I’m looking to predict 2 year survival.
The problem is, I am not sure how to approach this when I don’t have any context about the deaths. I don’t know if the deaths are from cancer or unrelated. I also don’t know if I should include those who are not dead because I don’t know how long they have been followed.
I really just don’t know how to approach the problem overall with respect to censoring. Can this be done?
What if I just used linear regression? How much could the model’s accuracy realistically differ from a survival model which takes into consideration censoring?
I’m checking the stability of a sample in 5 different conditions: acidic, basic, oxidative, neutral solution, and solid state. I’m trying to see if I can get it to forcibly degrade to see what possible new impurities arise.
For those unfamiliar with chromatography, basically I’ll run it on the HPLC machine to get a chromatograph that shows me the peak area of the sample. Over time if there is degradation, then peak area should decrease, and if new impurities form, then new peaks should start popping up. I need to make sure that if there are new impurities, that they are all separated enough from the main product. This will tell me whether the chromatographic method I’ve built is “stability indicating”.
I have 6 data points of the peak purity at initial condition, ran at different days. So my assumption is that I can average these purities out and consider it the “true purity” of the material, and construct an interval (alpha=0.05) that would be considered “not degraded”. Then I compare the samples that underwent this forced degradation condition and record the peak purity, seeing if it falls outside my significance level.
My question is should it be 2-sided or 1-sided? Since I’m looking at degradation, i think 1-sided makes sense. But it is also possible for the peak to grow, theoretically, if there happens to be an impurity hiding underneath it, thus adding to its peak area. I’m not sure what is the correct choice.
So now I’ll have two conditions in which I can check against whether the method is stability indicating. First, if my sample degrades and I can see that all the peaks are well separated, then I can conclude it’s stability indicating. Second, if my sample actually increases in peak purity, then it means there is a peak hiding underneath that I cannot detect and therefore my method is not stability indicating.
Are these valid conclusions?
Can anyone explain how to connect or bridge the ideas between sample size vs sampling plans?
For sample size, I believe this is defined as sample sizes for statistical tests (ex: t-test, ANOVA) and sample sizes for population proportions.
For sampling plans, I believe this is related to OC curves and RQL/AQL/LTPD and you end up with a sample size.
I don’t quite understand “sampling plans” and am trying to somehow tie my understanding of “sample size” so that I can bridge the 2 concepts together.
Can anyone help explain the similarities or differences between the two?
I am struggling to comprehend how to identify variables as confounding or effect modifiers. I do not know when to even suspect a variable to be one or the other.
Fundamental question: Are all variables either an effect modifier (important to our analysis) OR a confounding factor (not helping our analysis). Or is there another classification of variable, that is neither an effect modifier or confounding effect?
All help is appreciated!
I am currently trying to wrap my head around HMC. So far i understood that we use an integration scheme to traverse the subspace of points with equal energy as indicated by the Hamiltonian. The initial draw of momentum works as “jitter” for the energy of the current state so that we traverse areas with different energies. The integrator is typically some symplectic integrator, e.g. leapfrog that is run for N steps. Since the integration is not perfect, we have to use a metropolis acceptance check to see whether we keep or discard the sample.
So far I see people discuss using N in the order of 5-10. But I have trouble understanding why we choose the number of leapfrog steps and do not simply sub-sample the Markov Chain: perform a single step of leapfrog-integration, check whether it would be accepted or not and repeat. Afterwards pick every N-th sample. From the point of view of the acceptance check it should not make much of a difference, because if we made an error so large in the first step that we would not accept the sample, what would make us think that this error would become smaller? So it must have something to do with sampling the momentum at every step? does this destroy some important property?