More

tpsreport · on March 23, 2013

> -My account balance went from 10 to 0 to 10

That could potentially indicate a database infrastructure problem. Eventually consistent databases can issue responses that appear to travel backwards in time. And [1] says this:

  Coinbase uses MongoDB for their primary datastore for 
  their web app, api requests, etc. Coinbase is a
  decentralized, digital currency that is changing 
  the world of payments.

[1] http://www.mongodb.org/about/production-deployments/

wheaties · on March 23, 2013

As much as I love MongoDB it has way too many issues to use it as a primary data store for financial transactions. I hope they backed up and tested their backup recovery. Something tells me they're dealing with a data corruption/loss which wiped out their master and slaves without a backup. Perhaps if they've got decent logging they can piece it together with logs.

Xorlev · on March 23, 2013

I barely trust MongoDB with my personal projects. It's shot me in the foot enough given they're not a very stable vendor (case in point, see the 2.4.0 replication bug -- we didn't get hit by that thankfully).

We've had issues where databases can get on divergent paths, then MongoDB will keep up to 300mb of the stuff it can't match in a directory and after that you're hosed.

stuffihavemade · on March 23, 2013

It's absolutely insane if they are using Mongo as their source of truth (and not say some kind of caching layer). If there is one thing that should be ACID, it's financial transactions.

jacoblyles · on March 23, 2013

I wish I would have seen that they use MongoDB before using the site.

My account has data inconsistency issues. They are letting me double-sell coins, which makes me wonder if the first sale went through (at $70). Also, I have experienced up to 48 hours delay in sending BTC transactions out from my coinbase wallet. These sound like Mongo problems and they wouldn't be the first to have their Mongo databases fail under load. I am making screenshots of my major transactions to ensure that they are not lost. Hopefully they have the logs to get everything in the correct state eventually.

unclebucknasty · on March 23, 2013

But it does not appear to be used for the financial transaction component, so its use should not be able to cause inconsistencies in account balances, etc.

endian · on March 23, 2013

For a moment there, it seemed we were witnessing what http://www.mongodb-is-web-scale.com/ had prophesied:

"Now I've contracted hemorragic e-coli from cleaning cow stalls and I'm bleeding out my asshole. I'll be dead soon, but that is a welcome relief. I will never have to witness the collapse of the world economy because NoSQL radicals talked financial institutions into abandoning perfectly good datastores because they didn't support distributed fucking map/reduce."

rscale · on March 23, 2013

Thanks for posting this. It inspired me to figure out how to get my BitCoins out of Coinbase and into my own wallet.

I've used MongoDB enough to know that I don't want my money to be held by MongoDB.

tpsreport · on July 3, 2012

base+extension@domain type mail addresses are incredibly useful for legitimate people. For starters, they can be used to track who is leaking your personal information out.

So I hope this alarmist note does not prompt anyone into banning the "+extension" email addresses.

tpsreport · on June 20, 2012

Nice article, whose central message ("don't bundle your tools") is widely applicable to domains other than monitoring. Small, specialized tools that can be combined together is the very essence of the Unix philosophy.

dasil003 · on June 21, 2012

The problem is particularly acute in monitoring because you tend to need to monitor many disparate systems and subsystems in a very fine-grained manner. It needs to be reliable and not impact performance no matter where it is run which usually has to be everywhere. The Unix philosophy is definitely useful on a wider basis, but monitoring could be the poster child.

tpsreport · on April 4, 2012

For me, this was the dead giveaway that the person writing the article had no idea what he was writing about:

>Software-wise, most large porn sites will use a very-high-throughput database such as Redis to store and serve videos

No video comes out of a database. Mostly because it can't, but also because it makes no sense to make it come out of a DB.

vidarh · on April 4, 2012

My guess is that this one came from a post a while ago where someone at Youporn wrote about how they used Redis. Obviously not for the videos - the article writer clearly didn't read that part very thoroughly, or didn't understand it.

mrsebastian · on April 5, 2012

Redis can store binary data -- and YouPorn's Redis cluster apparently handles 300K queries per second. Those queries obviously aren't all page views (the site only peaks at 4000 PVs per second).

Why can't you store video in a database? YouPorn says that Redis is its primary data store.

smcguinness · on April 5, 2012

Unless I'm reading this wrong, YouTube stores their videos in MySQL. http://highscalability.com/blog/2012/3/26/7-years-of-youtube...

rplnt · on April 5, 2012

I think you are. That just means that you hit MySQL but doesn't necessarily imply that the data itself are served from MySQL. Filesystems are just fine for this task and as was already mentioned, most of the data are in the CDN anyway.

tpsreport · on March 20, 2012

One of the co-founders has a long diatribe about how awesome it was to have dropped out of De Paul University.

Perhaps if he had stayed in school, he would have learned how not to be a misogynist, and maybe figured out how to spell as well.

dreadsword · on March 20, 2012

Always interesting when people buy into the legend of themselves so thoroughly.

tpsreport · on March 19, 2012

On both a very unrelated and totally related note, one of the co-founders of Kiva, Rafaello D'Andrea, dominated robo-soccer for five years.

tpsreport · on March 19, 2012

The point here is that this particular cargo cult around bcrypt (one subscribed to by some really loud people) has a shaky foundation and does not deserve its reputation. He's offering alternatives that have been better studied.

So, by all means, subscribe to a cargo cult for crypto. But pick the cult carefully.

Firehed · on March 19, 2012

He's correct in that if you've selected bcrypt for key derivation, there's a good chance you could be doing things better (for one, its output is only 184 bits long; insufficient for AES256) where PBKDF2 works in a way where you can customize the output length.

However, the point of the bcrypt argument is not that bcrypt is the best algorithm for certain things, but that it's (at a minimum) about four orders of magnitude better than most people's "secure" password storage algorithm: sha1(password). Because it requires both a salt and a work factor, even a dictionary attack is wildly impractical unless there's a massive flaw discovered in the algorithm.

If developers are going to be trained to pick a specific algorithm for password storage, I'd much prefer bcrypt (no known flaws, many benefits) over sha1 or md5 (designed to be fast for checksumming, salt not required). Might PBKDF2 be a better choice still? Very possibly; I haven't done enough research to intelligently answer - and since this is crypto, I will not best-guess it.

My real point here? The article attacks bcrypt as a key derivation algorithm, but I've never seen someone suggest it to be used in such an application. Even the post that started what you may call the bcrypt movement (http://codahale.com/how-to-safely-store-a-password/) is linked in the article, and it's titled "How to safely store a password". It is NOT titled "How to safely derive encryption keys".

So yeah, I'm calling linkbait.

tptacek · on March 19, 2012

It's not better than bcrypt for password storage; it's marginally worse. See downthread.

tptacek · on March 19, 2012

He's wrong. (Read downthread for why I said this; my bluntness here is a mercy).

smacktoward · on March 19, 2012

But just the existence of multiple cargo cults causes damage, because it leads people to assume that "the experts are divided." Which will lead some people to go with the wrong batch of experts, and others to just throw up their hands in confusion and store their passwords in plain text because it's too hard for them to figure out which group of experts is right.

This is a case where unanimity in the message is important. If all the experts say "use A," people will take that to mean there's no debate about the merits of A over B and C, and use A. If some say "use A" while others say "use B" or "use C", some fraction of listeners will give up and use nothing at all.

tpsreport · on March 19, 2012

What a sensible change. Kudos to the git team. Most other teams would be wary of a non-backwards compatible change on such a critical path for a widely used tool.

tpsreport · on March 19, 2012

Here's a summary of the article:

* author cannot write portable C sockets code

* author cannot handle C/C++

* author believes his app would be too slow in Python, later abandons App, but retains his bias against Python

* Go has no parens for if/for

* Go has unicode support

* Go has closures "like salt shakers"

* Go is cohesively designed

* Go has nice libraries

Go may or may not be a good language, but this kind of argument is not going to win me over.

dhaivatpandya · on March 19, 2012

I usually wouldn't respond to mean comments, but, this one really throws me off (I'm the OP).

There is a very strong difference between "cannot write" and the need for something to be simple.

I can definitely handle C/C++, and I don't think writing something in a higher-level language changes that.

If something is doing processing with 4000 threads with around 10 years worth of by-the-minute data, it sure as hell will run too slow in Python.

And, the arguments stated after are not crucial to "making you switch to Go", which was not the point of the article anyway.

EDIT: Also, I did not abandon the app, and in fact open sourced the Bayesian filter stuff I wrote right here: https://github.com/Poincare/Bayesian.go

tpsreport · on March 19, 2012

I simply summarized your article. There is nothing wrong with not being up to handling C++'s error messages. I don't feel up to that task myself most of the time.

But your arguments for Go rang hollow. I'd urge you to go through the "pro-Go" arguments point by point and describe why, say, they apply to Go but not to Python 3.0. And I'm not a Python zealot by any means; I mention it as a comparison point mostly because it's very commonly used.

There are undoubtedly lots of great reasons to use Go, but your article did not enunciate them in a way that would win people over.

jvm · on March 19, 2012

"why, say, they apply to Go but not to Python 3.0"

That's sort of a terrible example since python is ill-suited to many (might I dare say a clear majority of?) production environments. The field of general-purpose production languages is actually pretty narrow. I somewhat agree with your point that the reasons were on the superficial side. Nevertheless, people are dissatisfied with the C/C++/JAVA trio and efforts to replace them have so far failed to stick (D comes to mind).

tpsreport · on March 19, 2012

If you're unhappy with Python as a baseline comparison, pick anything else you're happy with that other people understand and compare to that.

It's not like we're setting an impossibly high bar here. Just put forward a reasonable argument for why Go is cool. "Closures like salt shakers" will not win anyone over.

theorique · on March 19, 2012

* python is ill-suited to many (might I dare say a clear majority of?) production environments *

A bold statement, considering the broad range of production environments that have come to the opposite conclusion.

Do you have anything in specific to back up this claim or is this just name-calling?

Maro · on March 19, 2012

A tip: if you have say 4 cores, then using 4000 threads will most likely be slow due to lots of context switches. I say most likely because it depends on the details, but it's a safe guess.

tomp · on March 19, 2012

That is part of the problem; if you have 4 cores, your program should be using 4 OS-threads. Your programming language's runtime should take care of distributing your 4000 lightweight/green threads to the 4 actual threads.

This is what e.g. Haskell does, and Go as well, I think.

masklinn · on March 19, 2012

> This is what e.g. Haskell does, and Go as well, I think.

This is what Erlang does by default, GHC >= 6.12 will do it when using `+RTS -N -RTS` and Go requires explicitly setting GOMAXPROCS, the runtime defaults to single-threaded (as far as I know, GOMAXPROCS still hasn't been retired) and there is no way to have it auto-detect the core count.

casualsuperman2 · on March 19, 2012

The current weekly release has (and Go v1 will have) runtime.NumCPU() so you can do runtime.GOMAXPROCS(runtime.NumCPU())

saucetenuto · on March 19, 2012

Is that the default? If not, why not?

rocha · on March 19, 2012

Are you processing data using 4000 threads? Do you have access to a cluster? Otherwise it seems counter-productive. BTW, Python has great facilities (IPython.parallel) for doing parallel and distributed computing. It is also pretty good at number crunching using Numpy.

genbattle · on March 19, 2012

Goroutines are multiplexed onto a set of threads defined by the Go program. They're not each a full thread; it's more like Stackless Python's version of threads. http://www.stackless.com/

rocha · on March 19, 2012

Good point. Since the OP was talking about execution speed, I thought he was using the word thread to refer to OS threads.

sausagefeet · on March 19, 2012

Even still, performing number crunching on 4000 Go thread seems like a bad idea.

Peaker · on March 19, 2012

Did you try Haskell?

Knowing Haskell makes reading about Go a very underwhelming experience.

jbarham · on March 19, 2012

On the bright side, waiting for the Haskell compiler to grind away means you have plenty of time to read up on Go.

Peaker · on March 19, 2012

I don't generally wait for GHC more than a few seconds at a time, except when I'm bootstrapping a new installation. Then I wait some minutes for "cabal install".

Compilation times are not bad enough to be a problem.

sausagefeet · on March 19, 2012

Knowing any ML makes it pretty underwhelming.

Jach · on March 19, 2012

Or even just any Lisp. http://briancarper.net/blog/497/im-turning-into-a-lisp-snob

stiff · on March 19, 2012

It's hard to expect a really detailed analysis from someone still in high school. On the other hand, it is good to see how the various languages look like from a fresh perspective of someone with relatively little preconceptions from years of using C/C++/Python/whatever. After all, if he has lots of problems with writing something in C++, but much less when writing it in Go, it probably means lots of us had at some point to internalize lots of knowledge that's not directly relevant to the problem being solved, but to the intricacies of C++ or whatever. If the next generation of programmers can avoid this, that's a huge step ahead in what problems we will be able to solve.

lclarkmichalek · on March 19, 2012

Please, ignore people's age, it makes people in high school much happier. From someone in highschool (or the UK equivalent), the age anonymity of the internet is one of it's greatest strengths.

I think he'll still have to learn a fair few tricks to get around some features of Go. Things like type assertions, float32 vs float64, get ready for it lack of generics, and no distinction between stack and heap aren't common in other similar languages, and just getting to know the standard library is a huge part of being productive in a language. C could be seen as better in that respect; the core language is _very_ simple, which can't really be said for Go, though the advantages of Go probably out weigh the advantages of C for many people.

tpsreport · on March 19, 2012

Wow, I had no idea that the OP was in high school. Kudos to him! His blog entry is close enough to other software dev blogs in quality that I think we can leave his age, identity and personality aside, and talk about what it takes to win people over to a new language.

gukjoon · on March 19, 2012

Wow. 1. Have you written or debugged C sockets code? Because if you have, I doubt you would be dinging him on not wanting to do that. 2. He doesn't want to handle C/C++. I would say that most people who write application code are with him. 3. Yeah... I'm pretty sure Python, an interpreted language, is slower than Go, a compiled language. Even with the progress PyPy is making, I doubt it's going to beat Go.

The author needs no age defense. He makes a ton of valid points for why you should try out Go.

ootachi · on March 19, 2012

PyPy is an optimizing JIT compiler; 6g/8g is not an optimizing compiler. I'm pretty sure one could construct examples in which PyPy beats Go for this reason (try something that relies on loop-invariant code motion, for example).

Additionally, PyPy has many garbage collection algorithms, while Go has a stop-the-world mark-and-sweep collector.

enneff · on March 19, 2012

PyPy is an amazing project, but it is my understanding that the Python language makes certain guarantees (particularly around thread safety) that will hamper the speed of any implementation for a long time.

Go is still really young, yet it's plenty fast. There's plenty of room for it to get much faster.

ootachi · on March 19, 2012

You don't say exactly what you're referring to, but I assume it's the GIL. The GIL exists because Python threads share a single global namespace, and synchronizing atomically on hash tables for module lookup would be way too slow. If you create isolated contexts (like goroutines, but without shared state), the GIL won't bite you (edit: the degenerate cases that folks like David Beazley has described notwithstanding, but those aren't part of the language semantics and have more to do with unfortunate edge cases arising from the way Python's runtime interacts with the OS scheduler).

The problems with making Python fast mostly have to do with its complicated semantics, particularly around things like name lookup; the GIL doesn't have much to do with it.

Interestingly, I predict Go will have a much tougher time here, unless a form of goroutine is created that can't share any state at all. PyPy has been able to do a lot of garbage collection work precisely because it doesn't have to do concurrent GC. Unfortunately, Go crossed that bridge and can't really go back at this point.

drivebyacct2 · on March 19, 2012

Not to downplay these criticisms or to imply that Python isn't a great choice for many of these requirements, but the things you've mentioned of Go's compiler and garbage collectors are (as I understand the intent of the language designers) simply the current stating of being and I believe are both known (and planned) targets for future improvement.

pjmlp · on March 19, 2012

Not to mention that his remark regarding Unicode support in Go reveals a total lack of knowledge in programming languages.

drivebyacct2 · on March 19, 2012

So, a developer likes that list of things and it fits their project. Your comment offers nothing as a counter to choosing Go in such a situation. Your comment literally brings nothing to the discussion except a tl;dr for anyone who chose to not read it (if you didn't, go read it)

I mean, if I want to write a cross platform websocket server signaling server for PeerConnection, what would you recommend that will let me write a statically typed server in 80 lines of code that's all standard libraries? (edit: Websockets was moved to a `go install`able package recently)

noveltyaccount · on March 19, 2012

* Author is 14 years old

dhaivatpandya · on March 19, 2012

Even if I was 59, I still think that the comment above is quite misinformed.

tpsreport · on March 19, 2012

What's the baseline? When the baseline is terrible, getting 100x performance is not a big deal.

No good engineer would cite "100X improvement" without quantifying the baseline level. This is PR, not engineering and not science.

gwf · on March 19, 2012

The baseline was a substantial test corpus that we scaled several orders of magnitude over a series of runs, all meant to simulate typical clip size with typical word frequencies. The 10-100x gains had two contributions that each was in the 3-10x range. We tested it against the standard search which incorrectly performed pagination because of the misorder on sort versus slice. We also tested in on two types of map/reduce jobs that correctly implemented the sort and slice (and had been in production).

Ideally, we would have kept the data around to give a fuller a report. But the truth is we did this over 9 months ago and didn't save the data. After informally sharing the impact with a lot of people, we heard a lot of encouragement to share the techniques.

And you're right, this is not hard core science nor engineering. But it is a good tip, which you can take or leave.

tpsreport · on March 19, 2012

I'm still not seeing you report something that says "an operation that took X ms now takes Y ms." Note the absolute numbers whose units are in seconds. That's what I'd like to see for a baseline.

And I won't even comment on the "we did this 9 months ago and did not save the data" part. That kind of stuff would not fly in medicine or science or most traditional engineering fields. Why should you be exempt?

gwf · on March 19, 2012

To be clear, operations in our test corpus that took X ms on average took 100X ms. Is that the statement you are looking for?

One thing that I think you're missing is that while we experienced a 100x gain for our application, our findings aren't strictly empirical in the sense that the gain is always 100x. In many applications it will be more. The insight in the blog post is analytical, not empirical. Specifically, most people (I think) would assume a text index to be document partitioned. In Riak it's not. On a fundamental level this means that AND operations take time O(max(|A|,|B|)) instead of O(min(|A|,|B|) where |A| and |B| are the sizes of results that match query A and B, respectively. If you pick words at random from a power law distribution (which is typical in all natural languages) you will more often than not see many orders of magnitude difference in the |A| and |B|. If you sample queries from real query streams, you will see the same. For some intuition, just think about a query which has a common tag (like "funny" used in the example) and something restricted (like a particular user). The first will typically be an understood fraction of the size of your corpus while the second will have size that is more like a constant.

Putting this all together, the savings that we find for moving the intersection where we did will only grow on a relative basis as we get more users and clips. This hack costs 2x the storage size but always give O(min(|A|,|B|) complexity results instead of O(max(|A|,|B|)). 100x win is just shorthand for saying that those two set sizes typically varied by two orders of magnitude because of power law distributions. But in academia, we refer to that as "really fucking big".

The "presort" option that we added will scale differently, but have equally dramatic impact because pagination is effectively broken in Riak search if you do not use relevance sort. In this case, the O(.) argument is murkier because one version is done entirely within the index (which is usually in RAM) and the other has to hit the disk. In formal engineering circles, we refer to this type of boost as "unfucking real".

So, we could have measured things with more precision, but this was such an obvious win that it was almost pointless to calibrate it further.

tpsreport · on March 19, 2012

You have a knack for producing a lot of text with no supporting evidence. Especially of a quantitative nature.

This might go over well with a certain class of people, but it offends my engineering sensibilities.

Shalen · on March 19, 2012

Hi TPSReport - You realize that "gwf" is the Gary William Flake who has written an award winning complex book that is used in colleges worldwide as well as has run R&D for many companies like Microsoft, Yahoo, and Overture right? I'm sorry to say this but personally I'm finding your comments on this post as well other posts negative and lacking value. It would be great and helpful if you have any constructive/positive ideas, tips or experience to share to add value to the community. I'm sharing this in a positive constructive spirit. Thanks.

tpsreport · on March 19, 2012

Let me reiterate my constructive suggestion that the OP provide actual data. I can see how the author wrote a pop book, as he is quite verbose, but any engineer can tell you that he is also quite evasive on technical questions. "What's the baseline?" should not pose a tough question for anyone writing about performance improvements.

nikcub · on March 19, 2012

interesting bug where a new user who is downvoted no longer shows up as green.