In Mitchell’s Machine Learning textbook, the maximum likelihood hypothesis is defined to be argmax P(D/h) over all h in the hypothesis space, where D is the set of all observed target values d and P is the probability. When the target value is not assumed to be noise-free anymore, we define it to be d=f(x) +e, where e is a random continuous variable with an associated probability density function, and f(x) is the true target data value. This means that each target value d observed is now a continuous variable.

When generalizing the maximum likelihood hypothesis to this continuous variable, Mitchell replaces argmax P(D/h) by argmax f(D/h), where f is the probability density function. How is this valid? first of all, the probability density function is not the same as the probability. He then proceeds to say this new formula is the same as

argmax f(d1/h) * argmax f(d2/h)…*argmax f(dn/h) over all h in H

since the target values d1…dn are mutually independent. Each term would then be zero in this product, if he assumes the probability density function is the same as the probability since the probability of a particular data value occuring is zero.