The Estimator and Dataset API docs is so bad I'm planning on starting a blog wit...

mrry · on Feb 21, 2018

[Disclosure: I designed and wrote most of the docs for TensorFlow’s Dataset API.]

I’m sorry to hear that you’ve not had a pleasant experience with tf.data. One of the doc-related criticisms we’ve heard is that they aim for broad coverage, rather than being examples you can drop in to your project and run straight away. We’re trying to address that with more tutorials and blog posts, and it’d be great if you started a blog on that topic to help out the community!

If there are other areas where we could improve, I’d be delighted to hear suggestions (and accept PRs).

sytelus · on Feb 21, 2018

No offense but TensorFlow's Dataset API documentation also sucks. This combined with bad API design (that actually can be used as case study for bad design in classrooms) is disaster in making. For example, shuffle() takes a mysterious argument. Why? It's not to be found in docs except that it should be more than items in dataset. Why can't shuffle() just be shuffle() and why do I now always have to remember passing correct parameter for rest of my life? Whatever. I still don't get what exactly repeat() does. Does it rewinds back to start when you reach past end? Why you need it? Why not just stick to epochs? Why make things complicated with steps vs epochs anyway? Docs gives zero clue. Then there are whole bunch of mysteriously named unexplained methods like make_one_shot_iterator() or from_tensor_slices(). Why is make_one_shot_iterator() not just iterator()? Why do I have to rebuild dataset using from_tensor_slices()? The docs are designed with a point of view "take all these code calling mysteriously designed APIs, copy-paste and don't bother too much about understanding what those APIs really do". It really sucks.

cjalmeida · on Feb 21, 2018

IMO, shuffle is something they did really fine. Unlike PyTorch datasets, TF allows streaming unbounded data. For something like this work with shuffle, it must cache some data before passing it down the pipeline. You specify how much in the argument.

This may not seem useful this conventional training, where you usually work with a fixed amount of samples you know beforehand. But there may be cases where this is not true (for instance, in some special cases of augmentation) - the streaming part is useful but then you must use this caching trick.

But I agree API naming is not stellar, or at least should come with better documentation.

aub3bhat · on Feb 21, 2018

I think tf.data is amazing, its far far better than the previous queue and string_input_producer style approach.

More than documentation, I would argue that TF especially tf.data lacks a tracing tool that would let a user quickly debug how data is being transformed and if there are any obvious ways to speed up. E.g. image_load -> cast -> resize vs image_load -> resize -> cast had different behavior and lead to hard to identify bugs. For tf.data prefetch which ends up being key to improving speed yet its is not documented, the only way I actually found out about it was by reading your TF.Data presentation.

cjalmeida · on Feb 21, 2018

I guess you misread me. The Dataset API is somewhat fine, much better than queues for instance. However not clear from documentation how to do more complex stuff, or how to integrate it with the rest of TF stack, specially the new Estimator API.