Bacon Ipsum, because Unicode is hard

jhawthorn · on June 25, 2012

I've been using ｆｕｌｌｗｉｄｔｈ　ｃｈａｒａｃｔｅｒｓ. Due to their simplicity to convert to and read. This page of course has the advantage of more variety in characters.

Fullwidth conversion code in ruby:

  "string".tr(' !-~', "\u3000" + (0xFF01...0xFF5f).to_a.pack('U*'))

geon · on June 25, 2012

I wondered what was up with the weird kerning.

    Nａｍ
    Nam

(By the way, they seem to break tab-indentation for pre formatted text in the markdown.)

Danieru · on June 25, 2012

Full width characters have a fun habit of breaking most things. Even more so than unicode.

Since they are only used in CJK languages the majority of programmers are unaware of them. The seperate code points between half-width and full-width mean that you need a decent unicode library. Otherwise a user could spoof another user.

One cool feature is it lets you write monospaced even if you do not control the font. Since there are full-width codepoints for most programming symbols.

kenrikm · on June 25, 2012

I use http://www.fillerati.com/ because It's nice to use real sentences instead of ipsum.

viggity · on June 25, 2012

Isn't the whole point of lorem ipsum is that it doesn't make sense to the person reviewing the document, that way someone proofing a design will focus on the page elements, white space and text flow instead of getting distracted by the text?

sjs382 · on June 25, 2012

It's also valuable to test readability with something you can actually read.

babuskov · on June 25, 2012

Not very usable. If you really need to support Unicode on your website, you need to make sure that proper characters are shown. It does generate filler text, but would you know that character set and HTML code page are setup correctly if you don't know which glyphs have to be shown to the user? To give an example, this tool might generate character č, but if wrong code page is used (say ISO_8859-1 instead of UTF-8), you would see something like ł or ć in the browser. And you wouldn't even notice the difference. In some languages, such subtle change in one character can turn text into swearing or just gibberish.

I'd rather see a website with fixed text that shows some often used Unicode characters, The one you can validate (i.e. shows exactly the same on your website). Adding a JPG or PNG picture of what it should look like would be a plus. By "often used" I mean: Latin with accents, Cyrillic, Greek, Hebrew, Arabic, Chinese, Japanese, ... etc. I'm sure a nice "Bacon Ipsum" with one paragraph from each of these would be quite usable. Although, some are written right-to-left, so maybe two different sets would be better: one for RTL and other for LTR languages.

Shorel · on June 25, 2012

The idea is that you should only use UTF-8, otherwise why bother ?

jrabone · on June 25, 2012

Would be useful to have control over output encoding (ie. generate an octet stream output in UTF8, UTF16BE/LE). Generation of surrogate pairs would also be needed for encodings that support them, since many applications get this wrong (even in Java, the number of java.lang.Characters in a java.lang.String is not necessarily the same as the length of the String...)

You could even add options for "badly behaved" UTF8 (eg. overlong encodings, deliberate sync errors etc.) to stress test the other end.

mhansen · on June 25, 2012

Feature request: RTL. :)

gioele · on June 25, 2012

plus RTL inside LTR and LTR inside RTL. Have a look at Wikipedia's pages of Arabic dishes for good examples.

rthprog · on June 25, 2012

Link to the github repo: https://github.com/gvanderploeg/unicode-gen/tree/master/src/...

Groxx · on June 25, 2012

Do random accents help verify that you're handling Unicode correctly? You don't know what it's supposed to look like, so how do you know you're displaying it correctly?

lesterman · on June 25, 2012

Random accents do help to make sure you're seeing proper encoding. The output of this site will look like junk if you encode it as UTF-8 and try to view it with any other encoding.

Try it and see. Copy the output and paste it into a text document and save as UTF-8. In chrome, open that document and, under the View menu, choose a different encoding.

rogerbinns · on June 25, 2012

Did you try it? For example "Internationalization" becomes "Iｎｔèｒｎáｔìｏｎàｌïｚâｔｉòｎ". If your code did anything bad to the unicode (eg decided codepoints are one byte, stripped high bit) then it would be very obvious.

For my testing I also go to the wikipedia home page and copy the text in the middle of the page which lists how many articles there are in various languages. This is great because it uses a wide variety of code points, including ones greater than 0xffff.

Groxx · on June 25, 2012

I also see that "Iｎｔèｒｎáｔìｏｎàｌïｚâｔｉòｎ" breaks in the middle, instead of considering it a word. Is that correct? It may be important for checking your layout. There are also no Asian scripts, nor taller / shorter characters that might overlap if your line spacing is too small.

nilved · on June 25, 2012

Is anybody else as tired of seeing bootstrap as I am? I digress, though; it's a great project. :)

lewispb · on June 25, 2012

It does allow for variety but typically developers just go for the defaults. I like http://bootswatch.com/ though.

ihsw · on June 25, 2012

It's really depressing how many websites are still stuck on Bootstrap 1.4.