I love the article on hyperloglog! It is really quite good to read even if you're not interested in algorithms. I always liked number theory and I think that it's very interesting that you can guess how many uniques there are by counting how long your longest run of zeroes in a hash is.
I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.
You are correct, but HyperLogLog has many buckets counting the longest run of zeros in order to avoid the problem of outliers. I recently studied these probabilistic algorithms and did a notebook with code and plots to show their performance: https://github.com/lucasschmidtc/Probabilistic-Algorithms/bl...
Thanks for the write up, Lucas. It was very intuitive and I learnt a lot.
I noticed that you used 5000 buckets to store the frequency of 7000 non-unique words in the section on 'Counting Bloom Filters'. How is that better than using 7000 buckets and a uniformly distributed hash function, which would maintain frequencies perfectly? We would be using fewer buckets by an order of magnitude in a real-world implementation to save memory.
For outliers by random chance, lucasschm's reply explains.
The usual trick for preventing folks _maliciously_ sending in outliers is to use a hash with a secret key such as SipHash so that folks on the outside can't trivially figure out what inputs will lead to hashes with a lot of leading zeroes.
The German Tank Problem guesses the size of a set, given a limited sample and successive serial numbers. If they had randomized the serial numbers it wouldn't have worked.
HyperLogLog is different because you have the entire population (not just a sample), and it's a multiset (the same element can appear more than once). Getting the size of a (non-multi) set is easy, you just keep a counter and increment it for each element; it only takes enough memory to maintain the counter. Counting the distinct members of a multiset takes a lot more memory because you have to remember whether you've already seen a particular element or not.
The tl;dr is that the German Tank Problem is about making an estimate of size when you have imperfect information, and HyperLogLog gives you an estimate when you have perfect information, but it's too expensive to make an exact calculation.
"We want to better communicate the scale of Reddit to our users."
If that's true why did they hide vote numbers on comments and posts? It used to say "xxx upvotes xxx downvotes" now it just gives a number and hides that.
That be easy to test though if you were bot was effective or not, just post to unpopular subreddits, make bot votes on those submissions, then check back the next day. If votes not counted, then your bot is being ignored and you'd move on to changing your IP address or building your next bot or such.
It's a poor method of deterring bots written by people with very little experience writing such bots. Botting reddit is really easy, you'll just want lots of IPs so something like luminati.
Reddit is definitely not a paragon of anti-bot engineering, I think most of those skills exist in adtech.
Suppose there's a comment with 12 upvotes and 207 downvotes. Now suppose you're reddit and you want to make this comment seem more popular than it actually is.
You could slowly remove the downvotes, but attentive people refreshing the page will see the number shrinking and become suspicious, because real users never suddenly do a mass retraction of their votes.
You could slowly add a bunch of upvotes, but then people will wonder why this comment with a consistent 12/207 popularity ratio for the past hour suddenly overcame it and became the most popular comment in the thread. People will suspect a coordinated raid took place.
Both approaches raise too much suspicion. The safest approach is to turn off the ability to view the separate upvote/downvote values altogether and use a simple easing function to artificially increase the comment's total score over time. When no one can see the upvote/downvote ratio or the volume of vote activity over time, they lose the ability to judge whether manipulation is taking place.
You need an excuse for the change, so don't forget to also come up with a spurious narrative about how it was supposedly done to fight bots.
Except that vote fuzzing was always a thing, so when you'd see 1000 upvotes and 100 downvotes, that could be off by an enormous amount (I think a reddit admin said by a factor of 5x or 10x in extreme cases, ie. frontpage posts).
...because it's almost certainly a lie. It's because they want to better communicate the scale of Reddit to their customers - the companies on the other side of the links who they are driving content to.
It's way easier to say "Hey, we're giving you XYZ traffic, give us ABC dollars," when you have the figures in front of you rather than just upvote/downvote numbers.
Counting views/impressions in combination with Apache Kafka sounds like the ideal use case for a stream processor like Apache Flink. It supports very large state which can be managed off-hand. This should enable you to count the exact number of unique views in real time with exactly once semantics. Here is a blog post on large scale counting with more details. It also includes a comparison with other streaming technologies like Sanza and Spark: https://data-artisans.com/blog/counting-in-streams-a-hierarc...
So how do they determine whether a user has viewed a post already? I would think that unique counting is accomplished using the hyperloglog counter, but the article says that this decision is made by the Nazar system, which doesn't use the hyperloglog counter in Redis.
Why can't they just associate a list of viewed posts with each user, or list of users that viewed a post with each post, and check that? I don't get why this needs any consideration.
They addressed your second point in the article. On a popular post, you would be storing several megabytes of data to capture/relate each unique user that visited. That gets expensive at scale. HLL takes then down to a few kilobytes, less than 1% of the original size.
For your first suggestion, you would have to do a very expensive look up. You couldn't cache it effectively due to the requirement of near real time stats. You could improve look up time using columnar storage, but the performance and memory usage will be nowhere near as nice as with HLL.
I've had a "phases of computing" article percolating for a while to this end. Problems aren't just harder at scale, but they actively change their observable properties because of the stressors involved and where they crop up.
"This was a product decision. Currently view counts are purely cosmetic, but we did not want to rule out the possibility of them being used in ranking in the future. As such, building in some degree of abuse protection made sense (e.g. someone can't just sit on a page refreshing to make the view number go up). I am fully expecting us to tweak this time window (and the duplication heuristics in general) in future, especially as the way that users interact with content will change as Reddit evolves."
Because it's more valuable of a data point to those who care about overall audience and reach. Someone visiting repeatedly might be evidence of an engaged user, but things like ads would have diminishing returns.
Wouldn't it had been easier to simply increment a counter for each visit and then set a short lived cookie in the browser for that post?
And put the spam detection system before the counter increment
Indeed it can easily be deleted, that's why I said to put the spam detection before the counter.
I assume most of the visits are from normal users, not spammers, so if a normal user has the cookie for the post set then it means it's a page refresh, so don't increase the counter.
Identifying sophisticated spammers accurately is more complicated though. You can't rely on any client side info (user agent, cookies, browser history, screen resolution, OS, etc) because they can all be modified. You can't rely on IP address either, because there are public hotspots used by genuine users also. I think their spam detector is more complicated than this and they have to use it for HLL also.
So, for the genuine users, a counter increased based on the cookie mechanism would've worked just fine.
I think the main problem is not concurrently updating the counter. After all updating the HLL must also be done concurrently.
The primary motivation on their part for using HLL is given in the intro.
>In order to maintain an exact count in real time we would need to know whether or not a specific user visited the post before. To know that information, we would need to store the set of users who had previously visited each post, and then check that set every time we processed a new view on a post. A naive implementation of this solution would be to store the unique user set as a hash table in memory, with the post ID as the key.
>This approach works well for less trafficked posts, but is very difficult to scale once a post becomes popular and the number of viewers rapidly increases. Several popular posts have over one million unique viewers! On posts like these, it becomes extremely taxing on both memory and CPU to store all the IDs and do frequent lookups into the set to see if someone has already visited before.
Writes are atomic in redis because redis is single threaded. So you are bounded by how fast redis can write. If you try to write any faster then redis can handle you'll get queueing or errors.
The wonderful thing about HyperLogLogs is that you can split the counter in N servers and "merge" the registers later, in case you want an architecture that shards the same counter in multiple servers. But sharding directly by resource looks simpler actually...
Yes, that was my idea as well. They must have some sort of cache system for serving basic user meta-data at scale when a page is loaded, and they could add a time-expiring list post ids of the posts viewed by a user to do detection on a per user basis on the backend.
I think they want to break it into different services for (whatever) reasons.
Running a counter across a sharded in memory cache implementation (like Redis) for something like post views should be ok, but they might have some weird rule-sets for spam detection that are slow.
I'm not following your solution for how it handles the unique visitor counting. You have to know when to increment the counter. Thus why they used the HLL.
Ah, sorry if it wasn't clear. By storing an expiring list post IDs in the close cache for loading the user metadata, you could filter out the posts that were already visited by that user.
Weird thing I have been seeing on Reddit is comment upvotes being off-by-one periodically on page refreshes. Reload, you get 3. Reload again, you get 4. Again, you get 3. Seems like a replication issue?
> Weird thing I have been seeing on Reddit is comment upvotes being off-by-one periodically on page refreshes. Reload, you get 3. Reload again, you get 4. Reload, you get 3. Seems like a replication issue?
This is done on purpose [1], to prevent bots from calculating exact post/comment scores.
You could have you bot upvote, then remove its vote, and monitor from another browser whether the score changes. Having the number be fuzzed makes it harder to confirm if your bot has been caught and it's votes discarded.
Fuzzing the votes like this is cheap for reddit, but makes it more expensive for the spammer.
The best spam-filter is tricked by a hand-crafted, personalized email that is well-targeted. However, the cost of these are insane compared to mailing the same thing to hundrets or thousands of users. So anything any spam-filter ever does is making spamming less lucrative for the spammer, and fuzzing does that.
But you're multiplying it by some number for every single comment/reply. Spammers work through large numbers... make each one a little bit harder, and it adds up.
A votebot suspects that it is being detected and filtered out of the final scores. Because the scores fluctuate, it is harder to determine if your vote had an impact.
But that doesn't "prevent spambots etc.", that merely prevents people from easily and instantly figuring out whether their bots are detected yet. It doesn't stop them from spamming votes, or from making their bots more elaborate regardless of whether they have been detected yet. I don't know all the motivations for spam bots or how people who make them tick, but I'd figure for a significant number the crucial bit is having an impact, not measuring it. Most spam is fire and forget, after all.
The point isn't to prevent them which is near on impossible but to get them to waste enough resources on non-visible actions regularly enough that it isn't economically viable to continue trying to spam the site.
Just curious if this is a stab at Cassandra, or whether use of Cassandra would automatically imply eventual consistency or something else that would appear in this way?
Cassandra as it's often used can imply eventual consistency (e.g. counter incrs with CL.ONE) but "eventual" in this case would be in the range of 10's of ms
That said, reddit's upvote counters in particular are stored in Postgres, not Cassandra
I have two related questions:
1. I assume the process which reads from Cassandra and puts it back to Redis is parallized if not even distributed. How do you ensure correctness? Implementing 2PC seems extreme overhead. Or do you lock in Redis?
2. What database is used to actually store the view counts? Cassandras Counters are afaik not very reliable...
1. Redis is atomic, so we use the SETNX operation to ensure that only one write succeeds.
2. We have HLLs in Redis, so we just issue a PFCOUNT and store the result of that in Cassandra as an integer value. We don't use counters in Cassandra.
Slightly OT; but I wish reddit would use traditional forum style replies to push threads up, instead of the positive feedback loop of votes with opinions that agree with majority getting upvotes giving views which give proportionally more upvotes
Traditional forum style sorting just sorts for controversiality. Posts that generate heated discussion will continue to push their way to the top. This is okay but it means that non controversial stories will not make it to the top, which is honestly one of the largest benefits of news aggregator sites.
How about, here and on reddit, being able to mark threads you're interested in, and having a page where you can see those sorted by last reply. On HN, make that page refresh every 15 or 60 minutes or whatever. Heck, once every 24 hours would be enough... sometimes I just want to talk about the things that interest me, with the people that are interested in them. I would love to be able to think on something for a few days, or to familiarize myself with a subject before mouthing off.
As they are now, reddit and HN are require you to be there when something is "current", and that's ultimately not that much better than TV. Yeah, sometimes you can get something out of it, but it's nowhere near as useful or deep. Even worse, there is a tendency to get semi- or unrelated stuff one out when something slightly similar is discussed, instead of putting things exactly where they "belong", where they add to a useful corpus. [which is exactly what I'm doing right now, and am doing too often.. but I honestly would prefer the alternative which doesn't exist yet]
Just think about it, we have practically infinite storage, there's certainly plenty of expertise in reach of HN -- yet discussions get constantly restrained because our attention is limited. But it's limited because all we get are these flat lists, X items per page, next to no means to curate or organize anything. Imagine how much worse it would be without people who sometimes link to old but very relevant and insightful stories or individual comments. I'm very grateful to them, but to take them for granted, to rely on them for "structure" is totally stone age to me. Give us tools, give us transparency, let us figure out things even if they might be hard - stop trying to hand hold people you might underestimate! We don't need you as much as you need us.
I completely agree that a large problem with news aggregator sites is that you must be there right when the news hits. Eventually in the HN algo, time always dominates score and the discussion is killed no matter how interesting it was.
To be honest, I think that a change in behavior needs to come from the website, top down, rather than allowing the community to opt in to an alternate scheme that will never hit the popularity threshold necessary to work. Like a sibling mentioned, Reddit has "newest", but because it's not on by default, no one bothers to use it.
I don't think that it's our attention spans that cause our browsing behaviors on aggregator sites- I think it's purely derived from the Ui/Ux of the site, and that if the site kept interesting conversations up longer, people would participate and have more than the surface-level reflexive remarks we see here. This is not just a hypothesis - we can see the alternate behavior born out on forum sites. I think it's a huge shame that HN has opted for this model (especially when it wants to be "the better news aggregator"). Think how many great conversations never happened here because the algorithm prematurely killed the thread - due to lack of upvotes or something else.
Another failing of the UI/UX on some social news sites is not being able to CC other posts. If there are notifications, alerting only the parent of new activity only does so much.
While sdrothrock is correct that these sites don't have much incentive to produce quality discussion, I do agree with you that there's value in seeing what gets votes as opposed to what gathers replies (not that they are necessarily mutually exclusive and not each plagued with their own problems). I wonder if it's worth trying a system that marries the two elements. Take a conventional BBS, for example. They often have a reputation system that in the best case encourage quality contributions and avoid posts only consisting of "good post, I agree." However, they can't be used to indicate from the index what discussions are worth treading. You can try highlighting the top voted comments[1] beneath or side-by-side their respective links to the OP. Then you can infer what kind of discussion is taking place. Is it a bunch of metajokes or did someone just unload a lot of expert knowledge?
For news aggregators to facilitate enduring discussion is trickier as the nature of news is to remain current. Maybe if a link would otherwise decay but discussion is ongoing, the news entry on the index could take a backseat to its most 'active' threads. Indented new lines underneath the post that link to those threads, "The conversation is still going. Click to expand" or something like that.
[1] Perhaps highlighting only a number of posts proportional to the total number of comments or only the posts above a minimum score that is proportional to the total number of comments (just so you don't get highlighted for being one of the first responders). It's also probably important to not be able to vote on the post from a preview, so people have to click into the thread and hopefully read it and possibly contribute.
> As they are now, reddit and HN are require you to be there when something is "current", and that's ultimately not that much better than TV.
One thing I don't think people have addressed is that that's the goal for them; it's not necessarily to be "useful" to the end user. If they can make users feel like they have to be on reddit 24/7 to stay "current" and not miss anything, then they will. Facebook does the same thing.
And with reddit gold you can even have it hilight all new replies regardless of sort order. This can be really useful for viewing older discussions and see what's been added.
It might be conceivable on smaller subreddits, but on the big ones it would basically just drown everything into a sea of low-quality threads. I got the feeling that the vast majority of threads never ever make it to the "front page" of their subreddit.
I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.