That could potentially indicate a database infrastructure problem. Eventually consistent databases can issue responses that appear to travel backwards in time. And [1] says this:
Coinbase uses MongoDB for their primary datastore for
their web app, api requests, etc. Coinbase is a
decentralized, digital currency that is changing
the world of payments.
As much as I love MongoDB it has way too many issues to use it as a primary data store for financial transactions. I hope they backed up and tested their backup recovery. Something tells me they're dealing with a data corruption/loss which wiped out their master and slaves without a backup. Perhaps if they've got decent logging they can piece it together with logs.
I barely trust MongoDB with my personal projects. It's shot me in the foot enough given they're not a very stable vendor (case in point, see the 2.4.0 replication bug -- we didn't get hit by that thankfully).
We've had issues where databases can get on divergent paths, then MongoDB will keep up to 300mb of the stuff it can't match in a directory and after that you're hosed.
It's absolutely insane if they are using Mongo as their source of truth (and not say some kind of caching layer). If there is one thing that should be ACID, it's financial transactions.
I wish I would have seen that they use MongoDB before using the site.
My account has data inconsistency issues. They are letting me double-sell coins, which makes me wonder if the first sale went through (at $70). Also, I have experienced up to 48 hours delay in sending BTC transactions out from my coinbase wallet. These sound like Mongo problems and they wouldn't be the first to have their Mongo databases fail under load. I am making screenshots of my major transactions to ensure that they are not lost. Hopefully they have the logs to get everything in the correct state eventually.
But it does not appear to be used for the financial transaction component, so its use should not be able to cause inconsistencies in account balances, etc.
"Now I've contracted hemorragic e-coli from cleaning cow stalls and I'm bleeding out my asshole. I'll be dead soon, but that is a welcome relief. I will never have to witness the collapse of the world economy because NoSQL radicals talked financial institutions into abandoning perfectly good datastores because they didn't support distributed fucking map/reduce."
base+extension@domain type mail addresses are incredibly useful for legitimate people. For starters, they can be used to track who is leaking your personal information out.
So I hope this alarmist note does not prompt anyone into banning the "+extension" email addresses.
Nice article, whose central message ("don't bundle your tools") is widely applicable to domains other than monitoring. Small, specialized tools that can be combined together is the very essence of the Unix philosophy.
The problem is particularly acute in monitoring because you tend to need to monitor many disparate systems and subsystems in a very fine-grained manner. It needs to be reliable and not impact performance no matter where it is run which usually has to be everywhere. The Unix philosophy is definitely useful on a wider basis, but monitoring could be the poster child.
My guess is that this one came from a post a while ago where someone at Youporn wrote about how they used Redis. Obviously not for the videos - the article writer clearly didn't read that part very thoroughly, or didn't understand it.
Redis can store binary data -- and YouPorn's Redis cluster apparently handles 300K queries per second. Those queries obviously aren't all page views (the site only peaks at 4000 PVs per second).
Why can't you store video in a database? YouPorn says that Redis is its primary data store.
I think you are. That just means that you hit MySQL but doesn't necessarily imply that the data itself are served from MySQL. Filesystems are just fine for this task and as was already mentioned, most of the data are in the CDN anyway.
The point here is that this particular cargo cult around bcrypt (one subscribed to by some really loud people) has a shaky foundation and does not deserve its reputation. He's offering alternatives that have been better studied.
So, by all means, subscribe to a cargo cult for crypto. But pick the cult carefully.
He's correct in that if you've selected bcrypt for key derivation, there's a good chance you could be doing things better (for one, its output is only 184 bits long; insufficient for AES256) where PBKDF2 works in a way where you can customize the output length.
However, the point of the bcrypt argument is not that bcrypt is the best algorithm for certain things, but that it's (at a minimum) about four orders of magnitude better than most people's "secure" password storage algorithm: sha1(password). Because it requires both a salt and a work factor, even a dictionary attack is wildly impractical unless there's a massive flaw discovered in the algorithm.
If developers are going to be trained to pick a specific algorithm for password storage, I'd much prefer bcrypt (no known flaws, many benefits) over sha1 or md5 (designed to be fast for checksumming, salt not required). Might PBKDF2 be a better choice still? Very possibly; I haven't done enough research to intelligently answer - and since this is crypto, I will not best-guess it.
My real point here? The article attacks bcrypt as a key derivation algorithm, but I've never seen someone suggest it to be used in such an application. Even the post that started what you may call the bcrypt movement (http://codahale.com/how-to-safely-store-a-password/) is linked in the article, and it's titled "How to safely store a password". It is NOT titled "How to safely derive encryption keys".
But just the existence of multiple cargo cults causes damage, because it leads people to assume that "the experts are divided." Which will lead some people to go with the wrong batch of experts, and others to just throw up their hands in confusion and store their passwords in plain text because it's too hard for them to figure out which group of experts is right.
This is a case where unanimity in the message is important. If all the experts say "use A," people will take that to mean there's no debate about the merits of A over B and C, and use A. If some say "use A" while others say "use B" or "use C", some fraction of listeners will give up and use nothing at all.
What a sensible change. Kudos to the git team. Most other teams would be wary of a non-backwards compatible change on such a critical path for a widely used tool.
I simply summarized your article. There is nothing wrong with not being up to handling C++'s error messages. I don't feel up to that task myself most of the time.
But your arguments for Go rang hollow. I'd urge you to go through the "pro-Go" arguments point by point and describe why, say, they apply to Go but not to Python 3.0. And I'm not a Python zealot by any means; I mention it as a comparison point mostly because it's very commonly used.
There are undoubtedly lots of great reasons to use Go, but your article did not enunciate them in a way that would win people over.
"why, say, they apply to Go but not to Python 3.0"
That's sort of a terrible example since python is ill-suited to many (might I dare say a clear majority of?) production environments. The field of general-purpose production languages is actually pretty narrow. I somewhat agree with your point that the reasons were on the superficial side. Nevertheless, people are dissatisfied with the C/C++/JAVA trio and efforts to replace them have so far failed to stick (D comes to mind).
If you're unhappy with Python as a baseline comparison, pick anything else you're happy with that other people understand and compare to that.
It's not like we're setting an impossibly high bar here. Just put forward a reasonable argument for why Go is cool. "Closures like salt shakers" will not win anyone over.
A tip: if you have say 4 cores, then using 4000 threads will most likely be slow due to lots of context switches. I say most likely because it depends on the details, but it's a safe guess.
That is part of the problem; if you have 4 cores, your program should be using 4 OS-threads. Your programming language's runtime should take care of distributing your 4000 lightweight/green threads to the 4 actual threads.
This is what e.g. Haskell does, and Go as well, I think.
> This is what e.g. Haskell does, and Go as well, I think.
This is what Erlang does by default, GHC >= 6.12 will do it when using `+RTS -N -RTS` and Go requires explicitly setting GOMAXPROCS, the runtime defaults to single-threaded (as far as I know, GOMAXPROCS still hasn't been retired) and there is no way to have it auto-detect the core count.
Are you processing data using 4000 threads? Do you have access to a cluster? Otherwise it seems counter-productive. BTW, Python has great facilities (IPython.parallel) for doing parallel and distributed computing. It is also pretty good at number crunching using Numpy.
Goroutines are multiplexed onto a set of threads defined by the Go program. They're not each a full thread; it's more like Stackless Python's version of threads.
http://www.stackless.com/
I don't generally wait for GHC more than a few seconds at a time, except when I'm bootstrapping a new installation. Then I wait some minutes for "cabal install".
Compilation times are not bad enough to be a problem.
It's hard to expect a really detailed analysis from someone still in high school. On the other hand, it is good to see how the various languages look like from a fresh perspective of someone with relatively little preconceptions from years of using C/C++/Python/whatever. After all, if he has lots of problems with writing something in C++, but much less when writing it in Go, it probably means lots of us had at some point to internalize lots of knowledge that's not directly relevant to the problem being solved, but to the intricacies of C++ or whatever. If the next generation of programmers can avoid this, that's a huge step ahead in what problems we will be able to solve.
Please, ignore people's age, it makes people in high school much happier. From someone in highschool (or the UK equivalent), the age anonymity of the internet is one of it's greatest strengths.
I think he'll still have to learn a fair few tricks to get around some features of Go. Things like type assertions, float32 vs float64, get ready for it lack of generics, and no distinction between stack and heap aren't common in other similar languages, and just getting to know the standard library is a huge part of being productive in a language. C could be seen as better in that respect; the core language is _very_ simple, which can't really be said for Go, though the advantages of Go probably out weigh the advantages of C for many people.
Wow, I had no idea that the OP was in high school. Kudos to him! His blog entry is close enough to other software dev blogs in quality that I think we can leave his age, identity and personality aside, and talk about what it takes to win people over to a new language.
Wow.
1. Have you written or debugged C sockets code? Because if you have, I doubt you would be dinging him on not wanting to do that.
2. He doesn't want to handle C/C++. I would say that most people who write application code are with him.
3. Yeah... I'm pretty sure Python, an interpreted language, is slower than Go, a compiled language. Even with the progress PyPy is making, I doubt it's going to beat Go.
The author needs no age defense. He makes a ton of valid points for why you should try out Go.
PyPy is an optimizing JIT compiler; 6g/8g is not an optimizing compiler. I'm pretty sure one could construct examples in which PyPy beats Go for this reason (try something that relies on loop-invariant code motion, for example).
Additionally, PyPy has many garbage collection algorithms, while Go has a stop-the-world mark-and-sweep collector.
PyPy is an amazing project, but it is my understanding that the Python language makes certain guarantees (particularly around thread safety) that will hamper the speed of any implementation for a long time.
Go is still really young, yet it's plenty fast. There's plenty of room for it to get much faster.
You don't say exactly what you're referring to, but I assume it's the GIL. The GIL exists because Python threads share a single global namespace, and synchronizing atomically on hash tables for module lookup would be way too slow. If you create isolated contexts (like goroutines, but without shared state), the GIL won't bite you (edit: the degenerate cases that folks like David Beazley has described notwithstanding, but those aren't part of the language semantics and have more to do with unfortunate edge cases arising from the way Python's runtime interacts with the OS scheduler).
The problems with making Python fast mostly have to do with its complicated semantics, particularly around things like name lookup; the GIL doesn't have much to do with it.
Interestingly, I predict Go will have a much tougher time here, unless a form of goroutine is created that can't share any state at all. PyPy has been able to do a lot of garbage collection work precisely because it doesn't have to do concurrent GC. Unfortunately, Go crossed that bridge and can't really go back at this point.
Not to downplay these criticisms or to imply that Python isn't a great choice for many of these requirements, but the things you've mentioned of Go's compiler and garbage collectors are (as I understand the intent of the language designers) simply the current stating of being and I believe are both known (and planned) targets for future improvement.
So, a developer likes that list of things and it fits their project. Your comment offers nothing as a counter to choosing Go in such a situation. Your comment literally brings nothing to the discussion except a tl;dr for anyone who chose to not read it (if you didn't, go read it)
I mean, if I want to write a cross platform websocket server signaling server for PeerConnection, what would you recommend that will let me write a statically typed server in 80 lines of code that's all standard libraries? (edit: Websockets was moved to a `go install`able package recently)
The baseline was a substantial test corpus that we scaled several orders of magnitude over a series of runs, all meant to simulate typical clip size with typical word frequencies. The 10-100x gains had two contributions that each was in the 3-10x range. We tested it against the standard search which incorrectly performed pagination because of the misorder on sort versus slice. We also tested in on two types of map/reduce jobs that correctly implemented the sort and slice (and had been in production).
Ideally, we would have kept the data around to give a fuller a report. But the truth is we did this over 9 months ago and didn't save the data. After informally sharing the impact with a lot of people, we heard a lot of encouragement to share the techniques.
And you're right, this is not hard core science nor engineering. But it is a good tip, which you can take or leave.
I'm still not seeing you report something that says "an operation that took X ms now takes Y ms." Note the absolute numbers whose units are in seconds. That's what I'd like to see for a baseline.
And I won't even comment on the "we did this 9 months ago and did not save the data" part. That kind of stuff would not fly in medicine or science or most traditional engineering fields. Why should you be exempt?
To be clear, operations in our test corpus that took X ms on average took 100X ms. Is that the statement you are looking for?
One thing that I think you're missing is that while we experienced a 100x gain for our application, our findings aren't strictly empirical in the sense that the gain is always 100x. In many applications it will be more. The insight in the blog post is analytical, not empirical. Specifically, most people (I think) would assume a text index to be document partitioned. In Riak it's not. On a fundamental level this means that AND operations take time O(max(|A|,|B|)) instead of O(min(|A|,|B|) where |A| and |B| are the sizes of results that match query A and B, respectively. If you pick words at random from a power law distribution (which is typical in all natural languages) you will more often than not see many orders of magnitude difference in the |A| and |B|. If you sample queries from real query streams, you will see the same. For some intuition, just think about a query which has a common tag (like "funny" used in the example) and something restricted (like a particular user). The first will typically be an understood fraction of the size of your corpus while the second will have size that is more like a constant.
Putting this all together, the savings that we find for moving the intersection where we did will only grow on a relative basis as we get more users and clips. This hack costs 2x the storage size but always give O(min(|A|,|B|) complexity results instead of O(max(|A|,|B|)). 100x win is just shorthand for saying that those two set sizes typically varied by two orders of magnitude because of power law distributions. But in academia, we refer to that as "really fucking big".
The "presort" option that we added will scale differently, but have equally dramatic impact because pagination is effectively broken in Riak search if you do not use relevance sort. In this case, the O(.) argument is murkier because one version is done entirely within the index (which is usually in RAM) and the other has to hit the disk. In formal engineering circles, we refer to this type of boost as "unfucking real".
So, we could have measured things with more precision, but this was such an obvious win that it was almost pointless to calibrate it further.
Hi TPSReport - You realize that "gwf" is the Gary William Flake who has written an award winning complex book that is used in colleges worldwide as well as has run R&D for many companies like Microsoft, Yahoo, and Overture right? I'm sorry to say this but personally I'm finding your comments on this post as well other posts negative and lacking value. It would be great and helpful if you have any constructive/positive ideas, tips or experience to share to add value to the community. I'm sharing this in a positive constructive spirit. Thanks.
Let me reiterate my constructive suggestion that the OP provide actual data. I can see how the author wrote a pop book, as he is quite verbose, but any engineer can tell you that he is also quite evasive on technical questions. "What's the baseline?" should not pose a tough question for anyone writing about performance improvements.
That could potentially indicate a database infrastructure problem. Eventually consistent databases can issue responses that appear to travel backwards in time. And [1] says this:
[1] http://www.mongodb.org/about/production-deployments/