More

phihag_ · on Feb 12, 2018

Yes, any operation is easy in 2^(2^n). For instance, take addition of two 128-bit numbers x and y (seen as 64-bit int arrays) on a 64-bit big-endian CPU:

  sum[1] = x[1] + y[1]
  sum[0] = x[0] + y[0] + carry from previous operation

In contrast, if you'd use 96 bits, you couldn't just use 64 bit integer operations. Instead, you'd have to cast a lot:

  sum[4..11] = *((int64*) x) + *((int64*) y)
  sum[0..3] = (int32) ( (int64) *((int32*) x) + (int64) *((int32*) y) + carry)

So you'd read 32 bit-values into 64 bit registers, set the top 32 bits to zero, perform the addition, and then write out a 32bit value again.

It gets much worse if your CPU architecture does not support the addition to 2^(2^n); if you were to use 100 bits, you'd have to AND the values with a bitmask, and write out single bytes.

So 128 is far easier to implement, faster on many CPU architectures, plus you get the peace of mind that your code works for a long time. For instance, let's assume the lower bound of 9 months per doubling (which is unrealistic as described in this article), then you're going to hit:

  50 bits (baseline from article): 2004
  64 bits: 2014
  80 bits: 2026
  92 bits: 2035
  100 bits: 2040
  128 bits: 2062

Now, what's the expected lifetime of a long-term storage system? It's well-known that the US nuclear force uses 8 inch floppy disks. Those were designed around 1970. So a lifetime of roughly 50 years is to be expected. For ZFS, that would be 2054. By this (admittedly very conservative) calculation, 128 bits is only barely more than required.

tzs · on Feb 13, 2018

Don't 64-bit CPUs usually have efficient instructions for operating on narrower values?

For instance, consider this C code for adding two 96-bit numbers on a 64-bit machine (ignoring carry for now):

  #include <stdint.h>

  extern void mark(void);

  int sum(uint64_t * a, uint64_t * b, uint64_t * c)
  {
      mark();
      *c++ = *a++ + *b++;
      mark();
      *(uint32_t *)c = *(uint32_t *)a + *(uint32_t *)b;
      mark();
      return 17;
  }

The purpose of the mark() function is to make it easier to see the code for the additions in the assembly output from the compiler. Here is what "cc -S -O3" (whatever cc comes with MacOS High Sierra) produces for my 64-bit Intel Core i5 for the parts that actually do the math:

  callq   _mark
  movq    (%rbx), %rax
  addq    (%r15), %rax
  movq    %rax, (%r14)
  callq   _mark
  movl    8(%rbx), %eax
  addl    8(%r15), %eax
  movl    %eax, 8(%r14)
  callq   _mark

I'm not too familiar with x86-64 assembly, but I am assuming that this could be made to handle carry by changing the "addl" to whatever the 32-bit version of adding with carry is.

Taking out the (uint32_t * ) casts to turn the C code from 96-bit adding into 128-bit adding generates assembly code that only differs in that both movl instruction become movq instructions, and addl becomes addq.

So, if you were writing in C it looks like a 96-bit add would be a little uglier than a 128-bit add because of the casts but isn't slower or bigger under the hood. But note that this is assuming accessing the 96-bit number as an array of variable sized parts. It's that assumption that introduces the need for ugly casts.

If a struct is used, then there is no need for casts:

  #include <stdint.h>

  typedef struct {
      uint64_t low;
      uint32_t high;
  } addr;

  extern void mark(void);

  int sum(addr * a, addr * b, addr * c)
  {
      mark();
      c->low = a->low + b->low;
      mark();
      c->high = a->high + b->high;
      mark();
      return 17;
  }

This generates the same code as the earlier version.

(I still have no idea how to handle the carry in C, or at least no idea that is not ridiculously inefficient. When I've implemented big integer libraries I've either used a type for my "digits" that is smaller than the native integer size so that I could detect a carry by a simple AND, or I've handled low level addition in assembly).

smitherfield · on Feb 13, 2018

1. Accesses through pointers type-punned to something other than `(un(signed)) char` are undefined behavior.

  uint64_t n = 0xdeadbeef;

  uint32_t foo = (uint32_t)n; // OK

  uint32_t *bar = (uint32_t*)&n; // "OK" but useless
  foo = *bar; // undefined behavior!!!

  uint8_t *baz = (uint8_t*)&n;
  uint8_t byte = *baz; // OK, uint8_t is `unsigned char`

  // Same-size integral types are OK
  const volatile long long p = (const volatile long long*)&n;
  const volatile long long cvll = *p; // well-defined

2. Structs are aligned to the member with the strictest alignment requirement, so a struct of a `uint64_t` and a `uint32_t` will be aligned on an 8-byte boundary, meaning its size will be 128 bits.

tzs · on Feb 13, 2018

> Structs are aligned to the member with the strictest alignment requirement, so a struct of a `uint64_t` and a `uint32_t` will be aligned on an 8-byte boundary, meaning its size will be 128 bits.

Don't most C compilers support a pragma to control this? "#pragma pack(4)" for clang and gcc, I believe.

Given this (where I've made it add two arrays of 96-bit integers to make it easier to figure out the sizes in the assemply):

  #include <stdint.h>

  #pragma pack(4)
  struct block_addr {
      uint64_t low;
      uint32_t high;
  };

  int sum(struct block_addr * a, struct block_addr * b, struct block_addr * c)
  {
      for (int i = 0; i < 8; ++i)
      {
          c->low = a->low + b->low;
          c++->high = a++->high + b++->high;
      }
      return 17;
  }

here is the code for the loop body, which the compiler unrolled to make it even easier to see how the structure is laid out:

  movq    (%rbx), %rax
  addq    (%r15), %rax
  movq    %rax, (%r14)
  movl    8(%rbx), %eax
  addl    8(%r15), %eax
  movl    %eax, 8(%r14)
  
  movq    12(%rbx), %rax
  addq    12(%r15), %rax
  movq    %rax, 12(%r14)
  movl    20(%rbx), %eax
  addl    20(%r15), %eax
  movl    %eax, 20(%r14)
  
  movq    24(%rbx), %rax
  addq    24(%r15), %rax
  movq    %rax, 24(%r14)
  movl    32(%rbx), %eax
  addl    32(%r15), %eax
  movl    %eax, 32(%r14)
  
  ...
  
  movq    84(%rbx), %rax
  addq    84(%r15), %rax
  movq    %rax, 84(%r14)
  movl    92(%rbx), %eax
  addl    92(%r15), %eax
  movl    %eax, 92(%r14)

(Some white space added, and the middle cut out). The 96-bit inters are now only taking up 96-bits.

smitherfield · on Feb 13, 2018

Packed structs are possible, to be sure, but inhibit numerous optimizations, such as (relevant to this case) the use of vector instructions and vector registers.

Changing the loop to 4 iterations for compactness' sake, (aligned) structs of two u64s generate the following, vectorized code:

https://godbolt.org/g/jB4jki

  vmovdqu (%rsi), %xmm0
  vpaddq  (%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, (%rdx)
  vmovdqu 16(%rsi), %xmm0
  vpaddq  16(%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, 16(%rdx)
  vmovdqu 32(%rsi), %xmm0
  vpaddq  32(%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, 32(%rdx)
  vmovdqu 48(%rsi), %xmm0
  vpaddq  48(%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, 48(%rdx)
  retq

And if the pointer arguments are declared `restrict`, the loop can be vectorized even more aggressively:

  vmovdqu64       (%rsi), %zmm0
  vpaddq  (%rdi), %zmm0, %zmm0
  vmovdqu64       %zmm0, (%rdx)
  vzeroupper
  retq

Either of which is much more efficient than the code generated for unaligned, packed 96-bit structs:

  movq    (%rsi), %rax
  addq    (%rdi), %rax
  movq    %rax, (%rdx)
  movl    8(%rsi), %eax
  addl    8(%rdi), %eax
  movl    %eax, 8(%rdx)
  movq    16(%rsi), %rax
  addq    16(%rdi), %rax
  movq    %rax, 16(%rdx)
  movl    24(%rsi), %eax
  addl    24(%rdi), %eax
  movl    %eax, 24(%rdx)
  movq    32(%rsi), %rax
  addq    32(%rdi), %rax
  movq    %rax, 32(%rdx)
  movl    40(%rsi), %eax
  addl    40(%rdi), %eax
  movl    %eax, 40(%rdx)
  movq    48(%rsi), %rax
  addq    48(%rdi), %rax
  movq    %rax, 48(%rdx)
  movl    56(%rsi), %eax
  addl    56(%rdi), %eax
  movl    %eax, 56(%rdx)
  retq

A smaller cost is that in non-vector code, using a 64-bit register (rax) in 32-bit mode (eax) is wasting half of the register.

IIRC, unaligned loads and stores will also, at the hardware level, stall the pipeline and inhibit out-of-order execution.

smitherfield · on Feb 13, 2018

Oops, I used `#pragma pack` incorrectly in my code, but it doesn't change the codegen for the 96-bit structs other than offsets. Also `restrict` is only needed on the output argument to enable full vectorization of the 128-bit structs.

New link: https://godbolt.org/g/8uGn4h

jepler · on Feb 13, 2018

See my little test program at https://godbolt.org/g/53SAMq

I believe this program properly handles carry from the low to high part.

The 96- and 128-bit code have the same number of instructions, but the 128-bit code has more instruction bytes due to "REX prefixes" (i.e., 32-bit register add is 3 bytes of opcode, 64-bit register add is 4)

mmozeiko · on Feb 13, 2018

Here's a better variant: https://godbolt.org/g/xRtr4i

jepler · on Feb 13, 2018

mmozeiko · on Feb 13, 2018

You can do it like this: https://godbolt.org/g/r6WruQ

garmaine · on Feb 12, 2018

On the other hand they could have used 96bits of block pointer and 32bits of meta data in a sort of tagged reference or capability system, instead of shuffling around a bunch of high order zero bytes forever.

Samis2001 · on Feb 13, 2018

Assuming the tagged reference or capability system was built, wouldn't it need software to take advantage of it? If it's not actively used, no real point having it over more block pointer space - and I doubt significant amounts of software would use such a filesystem-specific feature.

kstrauser · on Feb 13, 2018

Is there any advantage in a CPU to doing 32-bit math instead of 128-bit? My first guess is that this would make pointer operations much slower.

garmaine · on Feb 19, 2018

Most CPUs do not support 128-bit integer math. They would do do 64-bit integer ops with carry. In most architectures that would be no different in code size from a 64-bit op followed by a 32-bit op.

phamilton · on Feb 13, 2018

Very complex compilers and/or cisc decoders on superscaler processors could theoretically rewrite some 128-bit to 32-bit and run the computations concurrently with other 128-bit computations.

phihag_ · on Jan 23, 2018

Why don't you get a passport, government ID, or driver's license? I'm honestly interested, if that's not too personal to ask.

myf01d · on Jan 23, 2018

I showed them my government ID but they refused, understandably because it's in Arabic language. They asked for an English passport. But since I didn't have one, I couldn't continue the registration. It's really frustrating. Why isn't my credit card enough to verify my identity? Almost all top cloud/vps providers don't ask such questions.

simplyinfinity · on Jan 23, 2018

Because of stolen credit cards? They are minimising potential damages like this. Malicious actors are less likely to give their ID to send spam and ID cards are harder to steal than Credit cards :)

Dolores12 · on Jan 23, 2018

Be assured malicious actors got plenty of stolen ids.

foepys · on Jan 23, 2018

They afaik had some of problems with botnets using their cheap infrastructure in the early to mid 2000s. Might have something to do with it.

zyx321 · on Jan 23, 2018

Maybe you could get them to accept a notarized translation (Beglaubigte Übersetzung) of your ID. It's gonna add about 50€ up-front cost, so I don't know if that's worth the price or the hassle at your project scope.

myf01d · on Jan 23, 2018

No, thanks :D. 50 euros are about 1000 EGP. I could issue 10 passports with that amount of money :D

merb · on Jan 23, 2018

not in germany. you would only get one which is only usable for 5 years.

anc84 · on Jan 23, 2018

Probably German laws.

lemagedurage · on Jan 23, 2018

It feels intrusive to be honest.

phihag_ · on Sept 25, 2016

He's not using Google CDN, but Google Project Shield: https://projectshield.withgoogle.com/public/ . Project Shield is a free service.

phihag_ · on Nov 24, 2014

We're simply making use of Python's ability to load a module from a zip file [0]. Therefore, the generation[1] is just zipping up all the files and prepending a shebang.

[0] http://bugs.python.org/issue1739468 [1] https://github.com/rg3/youtube-dl/blob/640743233389714dda8a3...

pyre · on Nov 24, 2014

This:

  youtube-dl: youtube_dl/*.py youtube_dl/*/*.py
  	zip --quiet youtube-dl youtube_dl/*.py youtube_dl/*/*.py
  	zip --quiet --junk-paths youtube-dl youtube_dl/__main__.py
  	echo '#!$(PYTHON)' > youtube-dl
  	cat youtube-dl.zip >> youtube-dl
  	rm youtube-dl.zip
  	chmod a+x youtube-dl

Might be less confusing if you append '.zip' in the first two commands:

  	zip --quiet youtube-dl.zip youtube_dl/*.py youtube_dl/*/*.py
  	zip --quiet --junk-paths youtube-dl.zip youtube_dl/__main__.py

When you echo the shebang overwriting the file, I was thrown off. I'm thinking, "Why did you just zip all those contents into the file to just throw them out?" Then I see the `cat` line, and it makes sense that the `zip` command appends the .zip to the end of the file.

phihag_ · on Nov 23, 2014

youtube-dl can do that out of the box, try

youtube-dl --username pimlottc :ytwatchlater

If you get a problem, please file a bug report at https://yt-dl.org/bug . Thanks!

phihag_ · on Nov 23, 2014

Sorry! The problem is that our userbase is split about wanting the playlist or the video. You can create a file ~/.config/youtube-dl.conf with the content --no-playlist so that you don't have to type it out every time.

dredmorbius · on Nov 24, 2014

Right. My point was that the mention prompted to me to read TFM and find the fix.

I've just done the config file setting you mentioned.

phihag_ · on Nov 23, 2014

The problem is that this only works for some YouTube videos (for example it will fail for basically all VEVO videos), not to mention maintainability issues.

101914 · on Nov 23, 2014

I had to look up what "VEVO" was. A joint venture of several major record labels and Google launched in 2009.

Personally I have no need for "VEVO" videos. Nor do I ever encounter VEVO youtube urls posted to websites, like HN. I wonder why?

As for maintainability, I beg to differ. The raison d'etre for this script arose out of frustration that early YouTube download solutions, e.g. gawk scripts, clive, etc., kept breaking whenever something at YouTube changed. I got tired of waiting for these programs to be fixed, if that ever happened.

I can fix this 164 line script faster if YouTube changes something than waiting for a third party to fix something they developed that is far more complex. Moreover, it does not rely on Python. Is there something wrong with DIY?

I see someone posted a link in this thread to another 208 line script, yget, that uses sed and awk. This further demonstrates the relative simplicity of downloading YouTube videos.

pbhjpbhj · on Nov 23, 2014

>Personally I do not watch "VEVO" videos but I am curious what they are. //

Go to youtube.com, you may need to scroll but most unlikely, bam! "VEVO" video with x-million views: it's a music video promotion brand.

Actually it's not globally promoted so outside of Western Europe and USA I'd guess you don't get VEVO vids so much?

According to https://www.youtube.com/watch?v=5zs1ClgqhLw their 100th most viewed video has 200 million views. Top 10 are all above 600 million.

They're quite a big brand.

101914 · on Nov 24, 2014

Interesting.

An alternative to goofing around on the youtube.com web site, scrolling constantly and getting hit with advertising and endless lists of "related" videos is to search and retrieve youtube urls from the command line via gdata.youtube.com.

phihag_ · on Nov 23, 2014

Please don't pass in -citw [0]! I have personally run youtube-dl on android, works fine. (Disclaimer though: I am the current lead developer, so may have missed a pitfall or two).

[0] https://github.com/rg3/youtube-dl/blob/master/README.md#do-i...

phihag_ · on Nov 23, 2014

Hi, I'm the current lead developer. We update extremely frequently because our release model is different from other software; there is usually little fear of regressions (fingers crossed), and lots of tiny features (i.e. small fixes or support for new sites) that are immediately useful for our users. We've had the experience that almost all users prefer it that way, so we try to enable every reporter to get the newest version by simply updating instead of having to check out the git repository.

As @fillipo said above, there is little if any pushback from video sites. Most of the time, they update their interface (we've gotten better in anticipating minor changes) and something breaks. The recent string of YouTube breaks (for some videos, mostly music videos - general video is unaffected) is caused by the complexity of their new player system, which forces us to behave more and more like a full-fledged webbrowser. But I think we usually manage to get out a fix and a new release within a couple of hours, so after a small youtube-dl -U (Caveats do apply[0]) you should be all set again. Sorry!

[0] https://yt-dl.org/update

e12e · on Nov 24, 2014

I'm really grateful for this tool: both to the creator, and those that keep making sure it works. I have a swath of my life tied up in a couple of youtube playlists, and every now and then videos disappear (presumably due to DMCA-requests) -- and it's always annoying. With youtube-dl, I can simply download my lists, and then I can be confident that whatever obscure (or not) tune or cover version I found some years ago, will be in my archives (I've yet to automate this -- but my goal isn't really to archive all the things -- I generally add few songs at a time).

Anyway, if you didn't write this tool (and update it) -- I'd have to do it myself. And I'd rather not do anything myself ;-)

phihag_ · on Nov 23, 2014

Thank you for all your contributions! Can you update that article to use video_id = self._match_id(url) and _VALID_URL = 'https?://...' though? We've also added a fair bit of "official" documentation at https://github.com/rg3/youtube-dl/blob/master/README.md#addi... .