Testing a Hypothesis with BROOM
In data analytics and data science, we often need to test a hypothesis. This is usually a tiresome and confusing process, particularly to those new to the craft. Although formulating a hypothesis (and the corresponding null hypothesis) isn't that hard, testing it and drawing some reliable conclusions isn't exactly easy. For starters, you need to define a reasonable alpha threshold, then check to see if the variances of the two samples match, and (are you still with me?) then run the statistical test. In this example, I'm referring to the T-test, which is the most popular one and the one that's more robust for the task at hand. Yet, even that test requires the distributions of the two samples to be Normal (Gaussian), something that may not hold true always. Of course, even if the distributions are not Normal, you can still use the test, but someone may attack us on this point, especially if that person is looking for weaknesses in our analysis.
What if there was a way to run a test like that without all that frustration? Well, there are a few such ways, one of which is the V-test, which is part of the BROOM framework (see the corresponding article for more details on it). The V-test (I was going for Z-test but that name was taken!) takes as inputs the two samples and yields two things: the v heuristic (the t-statistic equivalent) and a true-false value, regarding whether the two samples are different enough or not. This is if you don't define a custom threshold, defaulting to 0. The threshold "th" is an optional parameter, like the alpha one in the T-test, but it doesn't have some arbitrary value like 0.05 or 0.01. It's just 0 unless you make it something else. This translates into the maximum membership of the difference, in relation to the distribution of the variable d which is defined as the (actual) differences of the two samples. So, if that difference of 0 (which relates to the null hypothesis) is way beyond the distribution of d (i.e., it has a membership of 0), then you are in the clear. Simple as that!
OK, maybe not that simple, since what happens if you have outliers? Well, that’s where the (optional) “pop” parameter comes into play. This is another true-false variable that has to do with the "possibility of problematic data points" (i.e., outliers). If that parameter is set to true, then the V-test function makes sure that all outliers are taken care of, so that they don't distort your analysis. Don't worry, this is done efficiently and methodically, so you don't have to worry about this at all.
What if the two variables contain a lot of data points though? Well, there is another (optional) parameter for that too: ss, which stands for sample size (default value is 100). So, if you want this test to finish before you have to go home (or to bed, if you are already home), you can set it to a value that makes sense to you for the sampling process that ensues. Note that this is a deterministic kind of sample, a kind of summarization of the data. This way, even if you run this test several times on the same data, the corresponding samples, and therefore the results are going to be the same every time.
The V-test is designed to work with matrices too as inputs, as long as they have the same number of variables. This way, you can easily compare two datasets of the same dimensionality, without having to take each variable one by one. It’s hard to overestimate the convenience of this if you have large datasets that you wish to analyze.
I can go on talking about the V-test and its merits until the cows come home. However, I’ll stop here as I don’t want to monopolize your time explaining this to death (something many Stats people tend to do). Feel free to let me know your thoughts on this in the comments section below. Cheers!
Source: pixabay.com · “While binary behaviour is s ...
Source: pixabay.com · What Cognitive Dissonance Is ...
You have no groups that fit your search