Zacharias 🐝 Voulgaris

2 years ago · 2 min. reading time · ~10 ·

Blogging

>

Zacharias 🐝 blog

>

Testing a Hypothesis with BROOM

Source: pixabay.com

In data analytics and data science, we often need to test a hypothesis. This is usually a tiresome and confusing process, particularly to those new to the craft. Although formulating a hypothesis (and the corresponding null hypothesis) isn't that hard, testing it and drawing some reliable conclusions isn't exactly easy. For starters, you need to define a reasonable alpha threshold, then check to see if the variances of the two samples match, and (are you still with me?) then run the statistical test. In this example, I'm referring to the T-test, which is the most popular one and the one that's more robust for the task at hand. Yet, even that test requires the distributions of the two samples to be Normal (Gaussian), something that may not hold true always. Of course, even if the distributions are not Normal, you can still use the test, but someone may attack us on this point, especially if that person is looking for weaknesses in our analysis.

What if there was a way to run a test like that without all that frustration? Well, there are a few such ways, one of which is the V-test, which is part of the BROOM framework (see the corresponding article for more details on it). The V-test (I was going for Z-test but that name was taken!) takes as inputs the two samples and yields two things: the v heuristic (the t-statistic equivalent) and a true-false value, regarding whether the two samples are different enough or not. This is if you don't define a custom threshold, defaulting to 0. The threshold "th" is an optional parameter, like the alpha one in the T-test, but it doesn't have some arbitrary value like 0.05 or 0.01. It's just 0 unless you make it something else. This translates into the maximum membership of the difference, in relation to the distribution of the variable d which is defined as the (actual) differences of the two samples. So, if that difference of 0 (which relates to the null hypothesis) is way beyond the distribution of d (i.e., it has a membership of 0), then you are in the clear. Simple as that!

OK, maybe not that simple, since what happens if you have outliers? Well, that’s where the (optional) “pop” parameter comes into play. This is another true-false variable that has to do with the "possibility of problematic data points" (i.e., outliers). If that parameter is set to true, then the V-test function makes sure that all outliers are taken care of, so that they don't distort your analysis. Don't worry, this is done efficiently and methodically, so you don't have to worry about this at all.

What if the two variables contain a lot of data points though? Well, there is another (optional) parameter for that too: ss, which stands for sample size (default value is 100). So, if you want this test to finish before you have to go home (or to bed, if you are already home), you can set it to a value that makes sense to you for the sampling process that ensues. Note that this is a deterministic kind of sample, a kind of summarization of the data. This way, even if you run this test several times on the same data, the corresponding samples, and therefore the results are going to be the same every time.

The V-test is designed to work with matrices too as inputs, as long as they have the same number of variables. This way, you can easily compare two datasets of the same dimensionality, without having to take each variable one by one. It’s hard to overestimate the convenience of this if you have large datasets that you wish to analyze.

I can go on talking about the V-test and its merits until the cows come home. However, I’ll stop here as I don’t want to monopolize your time explaining this to death (something many Stats people tend to do). Feel free to let me know your thoughts on this in the comments section below. Cheers!

#Normal #Hypothesis #Gaussian #Stats #BROOM

in Data Science, Data Analytics, and Data Professionals in General

Comments

Articles from Zacharias 🐝 Voulgaris

View blog

2 years ago · 2 min. reading time

Related professionals

€500 hour

Javier Cámara-Rica 🐝🇪🇸

CEO & Co-founder at beBee

Technology / Internet

(4)

Madrid, Madrid

Marketing Strategy + 10

You may be interested in these jobs

Program Analyst

Found in: Talent US A C2 - 3 days ago

Envisioneering Arlington, United States Full time

Envisioneering, Inc. is seeking a Program Analyst with experience working within Navy RDT&E activities (i.e., ONR, NRL, NSWC, NAWC, NUWC, UARCs, FFRDCs and Universities) to perform a variety of financial, technical and support services involving data review, event, contract, and ...
Cardiac Ultrasound Technician

Found in: Lensa US P 2 C2 - 1 day ago

JobRialto Houston, United States

Job Summary/DescriptionAnalytical Thinking - Gather relevant information systematically; break down problems into simple components; make sound decisions.Self-Adaptability - Work in situations involving uncertainty, shifting priorities, and rapid change; deal constructively with ...
Front End Supervisor

Found in: ClickTrader US C2 - 5 days ago

Foodservice Supplier Dearborn, United States

Front End Supervisor · FULL-TIME · $13.50/hr · Supervising Cashiers, Front-End Loaders, Checkers, and U-Boat Retrieval. · Essential functions: · - Supervising, training, and evaluating cashiers, front-end loaders, checkers, and U-boat retrieval. · - Ensuring friendly, efficien ...

Zacharias 🐝 Voulgaris

Testing a Hypothesis with BROOM

Comments

Articles from Zacharias 🐝 Voulgaris

Podcast Listening

A couple of mini-surveys for you

Mentoring – Questions and Answers

Related professionals

Javier Cámara-Rica 🐝🇪🇸

You may be interested in these jobs

Program Analyst

Cardiac Ultrasound Technician

Front End Supervisor

for Recruiters

Information