HN2new | past | comments | ask | show | jobs | submitlogin

> Spanner’s external consistency invariant is that for any two transactions, T1 and T2 (even if on opposite sides of the globe):if T2 starts to commit after T1 finishes committing, then the timestamp for T2 is greater than the timestamp for T1.

This is impressive. But what exact point in time is when 'T2 starts to commit'? Is it when the user pushes the submit button on their device? When the request reaches a Google server? When it reaches a Spanner server within Google?

The point I'm getting at is given the uncertain 'delay' between the time the user interacts with the system and the 'time of commit', is time based linearizability a useful property? Could it be made simpler with actor based linearizability? Or revision based?



Ordinarily (and as indicated by your expectation), a timestamp records the time at which an event occurred. In most databases, the tranasction time is either the start of the transaction or the time at which the transaction coordinator initiates commit.

In spanner, there is a twist. The timestamp is decided by its transaction coordinator. Notice the word "decided". It means that the coordinator picks/computes a timestamp, not merely record an event's real time. The timestamp given to a transaction is the largest of: (1) timestamps from the various participants (other servers that may have been touched by that transaction) (2) its own latest clock time, and is strictly greater than any previously committed transactions.

(Bonus material:) Having picked this timestamp, it is possible that it (the timestamp) is greater than the coordinating machine's clock. A spanner server does not reflect the transaction's modifications to concurrent queries until the server's clock is past the transaction's commit timestamp. This is called "commit wait".


Thanks for the details! BTW, this sounds similar to the pseudotime ideas from David Reed (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126...), where you 'compute' a transaction timestamp and use it for ordering.

My question was more about how, of if, this 'feature' is usable externally? Guarantees made in terms of synthesized time are theoretically not useful to the callers, who are in a different 'real' time. I.e. when the doc says 'X happens after Y', it's talking about synthesized time, how does this translate in terms of 'real' time?

E.g. is the synthesized time returned to the caller, which can then be uses to make subsequent queries?


In pseudotime, a timestamp is a <<siteid, clock time at that site>>. The clock time only has meaning within that site; just because t1 < t2 does not mean that an event at t1 actually happened before t2, because the second server may have a higher id, or may be running faster. While the mechanism permits all servers to unambiguously order the two events in the same way, the actual time cannot be taken too seriously unless they are well outside the uncertainty interval of the clock sync algorithm. But a client does not know what the uncertainty interval is, so one can only make an educated guess.

Google's spanner makes that uncertainty a first-class entity; all timestamps are (equivalent to) a pair of <<start,end>>, where the actual real time is guaranteed to be somewhere within that interval. There is no siteid to disambiguate timestamps. The time is globally meaningful, which makes it trivial to make time-oriented queries, or instant global snapshots ("give me the values of these 20 objects at 10:02 am"). In that sense, the times are not really "synthesized". The coordinator picks a time to associate with the transaction, and that time is meaningful globally.


You can make time-oriented queries on snapshots using pseudo-time as well - given that all transactions are ordered. Timestamps are roughly ordered by real time, because NTP will synchronize to a few ms accuracy. IIUC within the margin of error, transactions will be arbitrarily ordered, but they'll be consistent - i.e. all readers will see the exact same order of previously applied transactions.

I'm trying to identify use cases where Spanner works better. Lets say, in Spanner, t1.end < t2.start for a pair of transactions - this still does not mean that t1 'happened' before t2, does it? All it means is that the t1 request arrived at a spanner server before the t2 arrived at a spanner server. But what does it mean about the event itself (going back to my original example of users interacting with their device)? Is ordering transactions by their arrival time on Spanner servers a more useful external property than an arbitrarily chosen (but consistent) order based on site id and millisecond level resolution?


Oh yes, that's precisely Spanner's guarantee. If t1.end < t2.start, it is _defined_ as t1 < t2, and t1 definitely happened before t2.

t1's effects will always be observable before t2's to any observer.

The notion of arrival time isn't important really. It is quite possible that the second one arrived at a server before the first one, but the first one's participants responded earlier. At that point, t1's commit time is picked. If t2 is also committed at the same server, then t2's time will be selected to be later than t1 (non-overlapping intervals).

If t2 is committed at a different server, and if t1 just happens to be < than t2, then one can be guaranteed that causality is preserved. There is no way t2 could have affected t1.

This latter property (external consistency) cannot be guaranteed by pseudotime. In PT it is possible to have t2 affect t1 in an external way (without involving the db, say by making a phone call), and yet get a smaller timestamp. From the db's point of view, t2 happened first, but that's not what really happened in reality, in an externally observable sense.


Oh I think I see what you mean. The specific scenario is:

1. External actor A1 sends a request t1 to Spanner

2. Spanner responds with a message m1 confirming t1 is committed

3. Based on receiving m1, A1 sends a private message to external actor A2

4. A2 then initiates a request t2, which affects a set of objects completely disjoint from the set affected by t1. t2 is committed.

Now Spanner guarantees that the timestamp for t2 > timestamp for t1. If the set of objects affected by t1 and t2 overlap then PT and Spanner provide the same guarantees (each object can only move 'forward' in PT). But if they affect disjoint sets of objects, then Spanner provides stronger guarantees. Is that correct?


Yes, correct.


I would think that is simply the statement that the ordering of timestamps preserves causality? What the "actual point in time" of anything is isn't really relevant--the question is whether you can show that a given transaction started after another committed, and if so, then you are guaranteed that timestamps assigned to those transactions preserve that ordering.


The way the timestamp is chosen depends on properties of the transaction. In the worst cases, servers must wait until the true time interval has passed an older commit before assigning a timestamp.


I agree that this is the best line in the paper. I interpret it to mean that while a second transaction is in progress, if a commit log comes in it's an extremely cheap operation to check the timestamp and ignore it.


There is no global time and no simultaneity. What is this notion that T2 is greater than T1 when they are separated by some significant distance? The sentence doesn't even make sense.


We synthesize idealized ephemeris time via a cohort of atomic clocks. Google's TrueTime tracks this along with the local uncertainty. This allows Spanner to explicitly wait out uncertainty windows, establishing a single externally consistent order to all possible observers, even in the presence of hidden communication channels.

This is the whole point of the system, and what the article covers.

What you're saying about time isn't even exactly true in the general case where we need account for relativity. While observers of clocks in different locations may not agree on clocks alone, they will agree on the total spacetime interval.

In any case, accounting for relativity is only necessary in high precision radio frequency systems operating between ground and orbit like GPS. And even then we can still track the frequency and phase offsets of the oscillators in real time and establish their relationship to idealized ephemeris time.

So yes, the sentences in the article DO make sense.

EDIT: at least the ones concerning time and consistent orders due. The words about effectively CA are ... not great.


They may agree on the spacetime interval but that is of no consequence, since they will still not agree on the ordering. Establishing a single externally consistent order of events for all possible observers is physically impossible, unless the spacetime interval separating them is timelike, a condition that is rarely true of distributed transactions that are happening independently: e.g. Abel puts in money in Spain, Beth removes it in Australia -- do you charge an overdraft fee/interest? (N.B., Abel and Beth may be names of HFT algorithms.)

"Spanner for causal transactions" would be more accurate.

P.S. Not sure what you mean by "hidden communication channels."


> Establishing a single externally consistent order of events for all possible observers is physically impossible

No, it is not. It does however require communication.

Additionally, what's true in an absolute physical sense, is somewhat removed from what's practical engineered product. For example, we cannot literally make a physical part that is perfectly 1 meter in length. This has not stopped us from making everything from microchips to spacecraft.

> Abel puts in money in Spain, Beth removes it in Australia -- do you charge an overdraft fee/interest? (N.B., Abel and Beth may be names of HFT algorithms.)

Spanner handles this by using two phase commit over sets of independent consensus replication groups. By using TrueTime's ability to provide absolute intervals, it can wait out uncertainty windows to enforce the total order.

> P.S. Not sure what you mean by "hidden communication channels."

Schemes such as lamport and vector clocks require all messages exchanged to include clock data. If two clients establish their own communication channel using some other protocol, they may not agree on the same event order, because the system has no way to know it needs to advance the logical clocks based on this hidden communication.

Spanner in combination with TrueTime avoids this limitation. External observers will see the same commit order in real time, regardless of how they communicate.

Please read and understand the papers to get what you're missing.


The paper is not talking about time, in the abstract, or simultaneity / ordering in the abstract (which indeed don't exist), but rather with reference to "timestamps". "Timestamps" are not abstract time, and as such may be ordered, compared, be simultaneous, etc.

Your comment is irrelevant and pointless since the paper is talking about timestamps.


No amount of sophistry will help the matter at hand. Spanner considers a true spacelike interval to be some kind of fuzzy clock measurement error (or "uncertainty"). Call it timestamp all you want, whatever shall you stamp when you should have one node on Earth and another on Mars?


> whatever shall you stamp when you should have one node on Earth and another on Mars?

Again, if you were familiar with the content of the papers, you could answer this yourself.

How it would work is establishing a phase locked loop between the oscillator on earth and the oscillator on mars, and then running a TrueTime style interval protocol. The downside of "Mars Spanner" is the very long round trip time would make the two phase commit protocol unacceptably long latency. However, the basic properties would still hold.

I don't know how else to explain it to you: a communicating distributed system can establish a single, canonical, total order on events. This is not a physical impossibility.


"Can establish a ... total order on events" is somewhat vacuous, the point is how it corresponds to the actual observed ordering. Otherwise you can just label arbitrarily, 1, 2, 3, ... that's also an ordering. I'm saying there is no such correspondence that is globally correct. It doesn't matter what protocol you run. You cannot synchronize two clocks that are apart. You can only synchronize them when they are in the same place.

I perfectly understand that with communication you can establish (read: impose) any arbitrary ordering you want and get nodes to agree to that. That's not really a useful resolution except in a narrow sense of "sometimes we really don't care about strict observed ordering," which may or may not be true.

The larger point is, Spanner doesn't solve any distributed database problem. It simply notices that on the timescale that today's workload appears to operate at, all the nodes can still be approximated as being collocated. There is, after all, a clock inside a computer as well, and this just extends it a little further.


> You cannot synchronize two clocks that are apart. You can only synchronize them when they are in the same place.

You are categorically mistaken. Please read the literature. I won't be replying further.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: