Hacker News new | past | comments | ask | show | jobs | submit login
A/B testing original design vs. random template bought from ThemeForest (candyjapan.com)
142 points by janpio on March 15, 2019 | hide | past | favorite | 44 comments



    During this time there were a total of 67 new
    subscriptions. Of these 58% (39) came from the
    new design and 42% (28) came from the old design.
    Looks like the new one is a clear winner.
Is it? This seems a small population to settle on a clear winner.

Using R's prop test, I get a p value of 0.22. (Type "prop.test(39,67)" to calculate it).

I think this means that in a world where it makes no difference which design is used, you would get a result as significant as this 22% of the time.

An alternative is the Adjusted Wald method. You can try it online here:

https://measuringu.com/wald/

Which gives some confidence intervals which also range from "could be better" to "could be worse". Even when you reduce the confidence level from the typical 95% to 90%.

    a quick check with an A/B testing calculator
    even says that this result has significance
    (~90% likely)
Which calculator was that?


Using the techniques described on my blog [0], which are ideal for KPIs like conversion rates and small sample sizes (since no Gaussian approximation is made), I get a p-value of 0.177, which is not significant. The observed treatment effect is a 36.8% lift in conversion rate, but a confidence interval on this effect has endpoints -10.7% and +97.4%. Anything in that range would be considered consistent with the observed result at a 0.05 significance threshold.

With 6000 total impressions and a 50/50 split, the experiment is only able reliably to detect a 74% lift in conversion rate (with power = 80%).

If you want to rigorously determine the impact, decide what effect size you hope to see. Use a power calculator to decide the sample size needed to detect that effect size. Administer the test, waiting to acquire the planned sample size. When analyzing, be sure to compute a p-value and a confidence interval on the treatment effect.

[0] https://www.adventuresinwhy.com/post/ab-testing-random-sampl...


That's not how you use prop.test. What you've tested using that invocation is the null hypothesis that the underlying probability of 39/67 is 0.5.

If you want to perform a test of a difference of two proportions, you need to do:

prop.test(c(39, 67), c(total_group_a_impressions, total_group_b_impressions))

I don't have experience with A/B testing, so I'm not sure if this is typically or best handled using this particular statistical test.

Edit: The first parameter should be c(39, 28), meaning the total conversions in each group. I have no excuse beyond being tired.

Edit 2: To clarify, I think he should still use the two-sample form of prop.test, especially since we did not know at the time of his posting that the sample sizes are equal.


    What you've tested using that invocation is the
    null hypothesis that the underlying probability
    of 39/67 is 0.5.
Isn't that equivalent to my interpretation of the test result? "In a world where it makes no difference which design is used, you would get a result as significant as this 22% of the time".

    If you want to perform a test of a difference of two proportions, you need to do:
    prop.test(c(39, 67), c(total_group_a_impressions, total_group_b_impressions))
Do you mean c(39,28)? Because group_a had 39 hits and group_b hat 28. Doing so with the group sizes Bemmu stated (3000/3000) also gives me a p value of 0.22.

As long as the group sizes are equal, the test is not very sensitive to the sizes.


I think there is a difference in the approaches, given that the Chi-squared test statistic for the two-sample version is ~1.52, while for your one-sample version it is ~1.81. If group size doesn't matter and if you're justified in adding up successes as you have, I'd expect the test statistics to be nearly the same.

Edit: I'd expect them to be nearly the same since the Chi-squared distributions would be parameterized similarly, so if we have similar results, we should see similar test statistics. Maybe my reasoning here is incorrect though!


I used http://www.abtestcalculator.com/ and entered 3000 participants -> 28 conversions and 3000 participants -> 39 conversions.

I neglected to record how many views each version had, but should be at least 3000 since the conversion ratio is about 0.5 - 1%.


Resource provided uses a very naive approach to determining the outcome of an AB test. It's not accurate, given the very small numbers.


Yes, I don't have much data to work with, and was also surprised that the calculator considered this significant. But even without significance, I assume it still makes sense to go with the winner?


Does the calculator really use the word "significant"? I don't see it. I am not sure how to interpret the language it uses.

As for going with the winner: Yes, if the test result (39/28) is the only information you have and there are only two choices (go with winner / go with loser) then it makes sense to go with the winner.


The statement I see on that page is "There is a 91% chance that Variation A has a higher conversion rate".

I am not sure how to interpret that. We would have to dive into the GitHub repo and figure out which test it performs I guess.


It looks like the difference of two beta distributions based on the visualization.

So, assuming a uniform prior and updating with 39/3000 and 28/3000 conversions the difference between the two distributions is greater than zero 91% of the time. It's only guaranteed to be above zero at about the 80% credible interval, and since we started with an uninformed prior that'd be about p=.2?

I'm open to correction here.


You get 91% if you put a uniform prior on the proportion coming from each alternative.


Looks like I misunderstood the result, I will change the post to reflect this.


If you have Google Analyics on your site, you can use the Unique Pageviews metric for each of the two page variations and use that as the metric, instead of arbitrarily assigning 3000 views to both.


If you had 3000 visits, shouldn't that be 1500 -> 28 and 1500 -> 39? (assuming you're doing a uniform split of both groups)


I don't think that's the correct use of prop.test for this question. When you give it two numbers (success, total), it tests against the null that chance of success is 50%.

Here, we want to test whether p(success|cond) differs across conditions, not whether prob(cond|success) is 50%.

This distinction is important because when p(success|cond) for some cond is low, its variance is also very small, but prop.test(39,67) doesn't reflect this. That could be 67 successes out of a small sample (and high chance of success), or out of a huge sample (and low chance of success).

Edit: whoops, I didn't notice other comments point out this issue


They said the following in the beginning though:

> For example if you want to test a tweak that results in 5% more conversions, you need about 3000 sales to detect it! For Candy Japan this would mean waiting for about 10 years for the test to complete.

But they want to do something still to try and improve sales. Seems reasonable even if not scientific.


This is not the right approach to take then. There are lots of other approaches to decision making outside of hypothesis testing, use them! This is not an appropriate use of hypothesis testing and can very much lead you toward making the WRONG decision.

For example, with such small numbers, there isn't much value in aggregate statistics. It would take a day or two to go through each one individually and see what happened, and you'd probably learn way more about your customers.


Whatever test you do, you would need to know the total number of visitors in each group, right?

And unless I missed it the article doesn't state those numbers.

Intuitively, the numbers you quoted would be more significant the bigger the test and control groups are.


That's not very intuitive for me. Let's do some limit analysis: imagine the groups were one million sessions each, but the convertions in the groups were only one and two people respectively. Wouldn't this result seem like the result of random chance?

The conversion rate is basically one in a million in both cases.


It's definitely not just the design itself. The CTA (your email submit) of the new landing page is better positioned throughout. You have also added urgency ("time until next shipment", "subscribe now before time runs out"). You can also see up to 3 previous candy boxes now, whereas before, you'd have to click through. All of those contribute to the conversion rate of your new landing page.

Your old landing page actually contains some great elements that are missing from the new one, like the reviews, the explanation anime video and the "new tastes only available in Japan", which seems a great feature to me. I'd definitely try and add those back in, and split test that new version against the current winner.


This. Design != Content. Often times, simplified CTA and Value Proposition can have the same effect as a full blown redesign.


Both pages would probably convert better if there was an explanation as to what happens when I put in my email address. There's also a confusing use over the word 'mailbox' - which I'd use for email, but also physical mail. The page asks for an email, immediately after saying "candy surprises in your mailbox" - if you assume no previous knowledge of what this site is going to offer, it's pretty confusing. And if it's confusing, you're less likely to get people to sign up.

Without having a proper look at and understanding the site, I'm not sure whether that email is a leadgen form, or start of the sign up process, but either way you could probably increase conversion with some better explanatory copy.

Interesting read though! Thanks for sharing!


I've always struggled to find good wording to separate physical mailboxes from electronic ones. Maybe "shipped to you twice a month" would be less vague than "in your mailbox twice a month"?


Yep, or you could describe the physical product and make sure it's clear what people are putting their email address in for.

Like "we ship a box of surprising Japanese candy to your house every month" then explain why you need the email.

If it was my lander I'd experiment with showing prices before you ask for an email. Just from my experience people are reticent to put in an email without knowing what the reason is.


Using the word "door" would be best I think.


Perhaps "delivered to you twice a month" would be even better


Seems ambiguous - delivered where? My e-mail?


Delivered to your door


From my experience, having less information on a landing page increases the email capture rate. Keep it simple with a single message all above the fold and it can improve the conversion rate by over 50%.

At the same time, it doesn't make a difference either way in actual sales. At least nothing big enough that can be measured with significance.

In the end, landing page optimization matters less than most people expect. It is better to work on other parts of the website.


I think it depends. You can have great success with longform landers with the right product. But I agree that you should cover as much as possible as simply as possible as early as possible.

I definitely disagree that landing pages make no difference to sales and LP optimisation matters "less than most people expect". I think that wholly depends on the product (or goal) and the acquisition strategy.


I think the actual product images are more tangible to the customer compared to the illustrations. My gut feeling is that the new design and repositioning of the elements on the page don't make a huge difference aside from drawing attention to the product images.


I don't think this is necessarily the right conclusion.

Imagine a friend with a cafe, they get the opportunity to get another cafe, same area, same clientele. With the new cafe they decide to just get some franchise, a Subway type of thing. People are familiar with it and flock there in relative droves. The original 'ma and pa' cafe is therefore deemed to be not as good as a 'template'.

The other option to the franchise could have been to have gone with the 'ma and pa' offering, same deal.

But that is not all of the choices. You could actually create something new, not hold with what you have or chuck it in and go with the 'franchise'.

With website design it is very easy for people to chuck it in and go with a Themeforest effort, Shopify, Squarespace and the like.

But a lot of 'new ingredients' have came along with CSS Grid, semantic markup and much else that the themes, 'serviced websites' and the like are not up to speed with yet and show no signs of wanting to implement.

It is also possible to build out more than just a landing page from scratch in two weeks with the new tools. This involves learning instead of botching someone else's floats, divs and margin hacks. We have Cargo Cult programming and in my opinion this story is just another example of this.

I would recommend anyone else in this position to start content first, not buying a theme and then taking the photos needed to push in the theme because the boxes are there already. This is a backwards, 'design led' process and there are plenty of good reasons to go content first, then structure it properly, spend a day or two with CSS grid, then put the existing branding on there. A static HTML page is a good start, a working prototype instead of a PDF mockup.

Then, instead of the A/B testing, ask an honest person who knows a thing or two and won't bullshit you. They will ask 'why have you done that' questions which will get the content in shape. Thereafter, once you have a go at doing it you can spot what you like elsewhere and learn from it rather than cargo-cult-copy-paste it to never be confident of anything.


How many successful businesses have you launched doing things “the right way”?

The point is there are many paths to success. Maybe candy japan didn’t do things the way you think they should be done (it’s frankly rude to refer to template hacking as cargo cult programming; we’re talking design here, not programming). But despite their “primitive approach”, they’ve been successful in multiple ways.

OP has taught me quite a bit with their experience, whereas your negative tone has taught me nothing except you believe you know how things should be done.

Honestly, you sound like a know it all.


I get your point, but have you ever found a theme that hasn't came along with an extraordinary amount of bloat, technical debt and featuritis?

Have you ever worked for an agency where a 'theme from Theme Forest' was deemed professionally acceptable?

The fundamentals have changed for web development but nobody gets off the hamster wheel to learn the fundamentals. The web doesn't have to be an obstacle course of hacks, polyfills, libraries, frameworks and other crutches any more. It also doesn't have to be full of fancy build tools, CSS compilers and other things that make web development accessible for the first time in 20 years.

I preferred to do my own homework when I was at school, not copy what everyone else thought the answers were, changing details around.

There is an intimidating thing going on with web development and the increasing specialisation. No one can hope of being able to create a web page due to this atmosphere. But so many people - a decade or two ago - got started writing actual HTML, not assuming it is all too hard and you have to just hack someone else's work. The industry is lacking these people now and is becoming less diverse.

Frontend development isn't a creative medium if people are just using frameworks from yesteryear and taking on technical debt from themes whilst busying themselves with the latest buzzword bingo. Things like content, accessibility and document structure matter, it is no good just going for a visual design and working backwards to the inevitable 'div soup'. Something has to change.


You make a principled point that I respect. I’m generally trying to get shit done. They aren’t always at odds with each other, but often are.

Personally, I hate web design and layout, so I want to do as little of it as possible. But I know I need a good looking site to be taken seriously. Templates have helped me out and generally last long enough that I can afford to hire a “real” designer.

Also, templates are $50 for a beautiful one. There are 2-3 OOM difference in cost.


Willing to bet that with some improvements to the original design the results would've been even better.

For example:

– Your navigation at the top is illegible. Theres very little contrast and it's hard to read. Make that more clear, higher contrast.

- Your main CTA ("Japanese candy surprises in your mailbox twice a month." [Email]) is also hard to read because of the orange and the custom font. Swap out the orange for a more vibrant color (maybe the orange color of the button) and use a nicer font that's easier to read.

- MAKE THE FONT SIZE BIGGER

- Add in some of the elements you added in the new design: When the next box is going out, etc.)

Your original design had an air of authenticity and honesty to it. The bought template seems like a copycat.

Keep (and improve!) the original!


Small sample size aside, they used the wrong KPI (imho). This needs to be tested over time. That is, which one retains better? Focusing on conversion's is a fool's errand. That is, you convert better but then if churn is higher you could end up at net loss. That's not ideal, obviously.

I think also, you'd have to look at referrals (if you have them). Perhaps a lower subscription rate actually led to more sales because (for some reason) those subscribers like to tell their friends.

The analysis here is too shallow.


I always love this person's posts. Great informative and fun reads... Keep it up.


Japanese websites amaze me. The colors and density always seem so extreme.


Can you A/B test how many of your blog readers prefer 0 memes, 1-2 memes, 5-10 memes, and 10+ memes? After the first few I stopped reading...


Are you using woocommerce? Care to share how you pinned users to either the ild or new template?


The author lost credibility by describing statistical significance as unecessary for making informed decisions.

In this context, that’s just not correct.


No, he's correct. This is a decision theory problem. Controlling false positives or false negatives at some arbitrary significance level like 0.05 is completely and utterly irrelevant. All of the debates above about which proportion test to use are irrelevant because none of them answer the question about which decision (keep current template or revert) maximizes utility (revenue). The hypothetical long-run rejection rate of a particular test in a world in which there is no difference between templates (which is a priori known to be false), has little or nothing to do with that.

Here, since he has no strong prior probability of the new version being bad, the switching costs are already sunk, and the evidence increases the posterior probability of the new version being better, and experimentation involves a high opportunity cost while being unlikely to yield a result indicating the new version is noticeably worse, keeping the new template probably is the correct decision.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: