Zacharias 🐝 Voulgaris

1 year ago · 2 min. reading time · ~10 ·

Zacharias 🐝 blog
Testing a Hypothesis with BROOM

Testing a Hypothesis with BROOM


In data analytics and data science, we often need to test a hypothesis. This is usually a tiresome and confusing process, particularly to those new to the craft. Although formulating a hypothesis (and the corresponding null hypothesis) isn't that hard, testing it and drawing some reliable conclusions isn't exactly easy. For starters, you need to define a reasonable alpha threshold, then check to see if the variances of the two samples match, and (are you still with me?) then run the statistical test. In this example, I'm referring to the T-test, which is the most popular one and the one that's more robust for the task at hand. Yet, even that test requires the distributions of the two samples to be Normal (Gaussian), something that may not hold true always. Of course, even if the distributions are not Normal, you can still use the test, but someone may attack us on this point, especially if that person is looking for weaknesses in our analysis.

What if there was a way to run a test like that without all that frustration? Well, there are a few such ways, one of which is the V-test, which is part of the BROOM framework (see the corresponding article for more details on it). The V-test (I was going for Z-test but that name was taken!) takes as inputs the two samples and yields two things: the v heuristic (the t-statistic equivalent) and a true-false value, regarding whether the two samples are different enough or not. This is if you don't define a custom threshold, defaulting to 0. The threshold "th" is an optional parameter, like the alpha one in the T-test, but it doesn't have some arbitrary value like 0.05 or 0.01. It's just 0 unless you make it something else. This translates into the maximum membership of the difference, in relation to the distribution of the variable d which is defined as the (actual) differences of the two samples. So, if that difference of 0 (which relates to the null hypothesis) is way beyond the distribution of d (i.e., it has a membership of 0), then you are in the clear. Simple as that!

OK, maybe not that simple, since what happens if you have outliers? Well, that’s where the (optional) “pop” parameter comes into play. This is another true-false variable that has to do with the "possibility of problematic data points" (i.e., outliers). If that parameter is set to true, then the V-test function makes sure that all outliers are taken care of, so that they don't distort your analysis. Don't worry, this is done efficiently and methodically, so you don't have to worry about this at all.

What if the two variables contain a lot of data points though? Well, there is another (optional) parameter for that too: ss, which stands for sample size (default value is 100). So, if you want this test to finish before you have to go home (or to bed, if you are already home), you can set it to a value that makes sense to you for the sampling process that ensues. Note that this is a deterministic kind of sample, a kind of summarization of the data. This way, even if you run this test several times on the same data, the corresponding samples, and therefore the results are going to be the same every time.

The V-test is designed to work with matrices too as inputs, as long as they have the same number of variables. This way, you can easily compare two datasets of the same dimensionality, without having to take each variable one by one. It’s hard to overestimate the convenience of this if you have large datasets that you wish to analyze.

I can go on talking about the V-test and its merits until the cows come home. However, I’ll stop here as I don’t want to monopolize your time explaining this to death (something many Stats people tend to do). Feel free to let me know your thoughts on this in the comments section below. Cheers!


Articles from Zacharias 🐝 Voulgaris

View blog
5 months ago · 5 min. reading time

Introducción no tan técnica · Cualquiera que se haya adentrado en el mundo de la informática ha oído ...

11 months ago · 2 min. reading time

Lately, I’ve been thinking a lot about podcasts. I suppose this has to do with the zeitgeist of quic ...

6 months ago · 3 min. reading time

The problem with problems these days · There have always been problems we have had to solve across v ...

Related professionals

You may be interested in these jobs

  • Centric Consulting

    Sr. Consultant

    Found in: beBee S2 US - 4 days ago

    Centric Consulting Charlotte Regular, Full time

    Sr. Consultant · Job Location: Charlotte, NC Based · Join our team · Centric Consulting is an international management consulting firm with expertise in digital, business and technology. We are looking for a Sr. Consultant to join our growing team. · Qualifications · * M ...

  • Light & Wonder

    Field Service Technician II

    Found in: beBee S2 US - 1 week ago

    Light & Wonder Chicago Regular, Full time

    Gaming: · Welcome to the world of land-based gaming. Light & Wonder's gaming team builds cutting-edge technology, products, and content for the most iconic casinos and operators across the globe. · Position Summary · In this Customer facing role you will.... · Perform Preventativ ...

  • Highmark Health

    Home Infusion Registered Nurse

    Found in: beBee S2 US - 18 hours ago

    Highmark Health Harrisburg Full time

    **Company :** · Allegheny Health Network · **Job Description :** · **Starting pay $30-$45/hour PLUS mileage** · **(pay dependent on experience)** · **Flexible Schedule** · **$15,000 SIGN ON BONUS AVAILABLE FOR ELIGIBLE NEW HIRES** · **GENERAL OVERVIEW:** · The Registered Nurse as ...