Zacharias 🐝 Voulgaris

2 years ago · 2 min. reading time · ~10 ·

Blogging
>
Zacharias 🐝 blog
>
Testing a Hypothesis with BROOM

Testing a Hypothesis with BROOM

yUa9U.png
Source: pixabay.com 

In data analytics and data science, we often need to test a hypothesis. This is usually a tiresome and confusing process, particularly to those new to the craft. Although formulating a hypothesis (and the corresponding null hypothesis) isn't that hard, testing it and drawing some reliable conclusions isn't exactly easy. For starters, you need to define a reasonable alpha threshold, then check to see if the variances of the two samples match, and (are you still with me?) then run the statistical test. In this example, I'm referring to the T-test, which is the most popular one and the one that's more robust for the task at hand. Yet, even that test requires the distributions of the two samples to be Normal (Gaussian), something that may not hold true always. Of course, even if the distributions are not Normal, you can still use the test, but someone may attack us on this point, especially if that person is looking for weaknesses in our analysis.

What if there was a way to run a test like that without all that frustration? Well, there are a few such ways, one of which is the V-test, which is part of the BROOM framework (see the corresponding article for more details on it). The V-test (I was going for Z-test but that name was taken!) takes as inputs the two samples and yields two things: the v heuristic (the t-statistic equivalent) and a true-false value, regarding whether the two samples are different enough or not. This is if you don't define a custom threshold, defaulting to 0. The threshold "th" is an optional parameter, like the alpha one in the T-test, but it doesn't have some arbitrary value like 0.05 or 0.01. It's just 0 unless you make it something else. This translates into the maximum membership of the difference, in relation to the distribution of the variable d which is defined as the (actual) differences of the two samples. So, if that difference of 0 (which relates to the null hypothesis) is way beyond the distribution of d (i.e., it has a membership of 0), then you are in the clear. Simple as that!

OK, maybe not that simple, since what happens if you have outliers? Well, that’s where the (optional) “pop” parameter comes into play. This is another true-false variable that has to do with the "possibility of problematic data points" (i.e., outliers). If that parameter is set to true, then the V-test function makes sure that all outliers are taken care of, so that they don't distort your analysis. Don't worry, this is done efficiently and methodically, so you don't have to worry about this at all.

What if the two variables contain a lot of data points though? Well, there is another (optional) parameter for that too: ss, which stands for sample size (default value is 100). So, if you want this test to finish before you have to go home (or to bed, if you are already home), you can set it to a value that makes sense to you for the sampling process that ensues. Note that this is a deterministic kind of sample, a kind of summarization of the data. This way, even if you run this test several times on the same data, the corresponding samples, and therefore the results are going to be the same every time.

The V-test is designed to work with matrices too as inputs, as long as they have the same number of variables. This way, you can easily compare two datasets of the same dimensionality, without having to take each variable one by one. It’s hard to overestimate the convenience of this if you have large datasets that you wish to analyze.

I can go on talking about the V-test and its merits until the cows come home. However, I’ll stop here as I don’t want to monopolize your time explaining this to death (something many Stats people tend to do). Feel free to let me know your thoughts on this in the comments section below. Cheers!

Comments

Articles from Zacharias 🐝 Voulgaris

View blog
2 years ago · 2 min. reading time

Lately, I’ve been thinking a lot about podcasts. I suppose this has to do with the zeitgeist of quic ...

7 months ago · 1 min. reading time

My team and I are working on an educational venture for data matters. Nothing too technical but some ...

2 years ago · 3 min. reading time

Overview · Mentoring is one of those subjects I can talk about till the cows come home (the other su ...

Related professionals

You may be interested in these jobs

  • Envisioneering

    Program Analyst

    Found in: Talent US A C2 - 3 days ago


    Envisioneering Arlington, United States Full time

    Envisioneering, Inc. is seeking a Program Analyst with experience working within Navy RDT&E activities (i.e., ONR, NRL, NSWC, NAWC, NUWC, UARCs, FFRDCs and Universities) to perform a variety of financial, technical and support services involving data review, event, contract, and ...

  • JobRialto

    Cardiac Ultrasound Technician

    Found in: Lensa US P 2 C2 - 1 day ago


    JobRialto Houston, United States

    Job Summary/DescriptionAnalytical Thinking - Gather relevant information systematically; break down problems into simple components; make sound decisions.Self-Adaptability - Work in situations involving uncertainty, shifting priorities, and rapid change; deal constructively with ...

  • Foodservice Supplier

    Front End Supervisor

    Found in: ClickTrader US C2 - 5 days ago


    Foodservice Supplier Dearborn, United States

    Front End Supervisor · FULL-TIME · $13.50/hr · Supervising Cashiers, Front-End Loaders, Checkers, and U-Boat Retrieval. · Essential functions: · - Supervising, training, and evaluating cashiers, front-end loaders, checkers, and U-boat retrieval. · - Ensuring friendly, efficien ...