I've been using fullwidth characters. Due to their simplicity to convert to and read. This page of course has the advantage of more variety in characters.
Full width characters have a fun habit of breaking most things. Even more so than unicode.
Since they are only used in CJK languages the majority of programmers are unaware of them. The seperate code points between half-width and full-width mean that you need a decent unicode library. Otherwise a user could spoof another user.
One cool feature is it lets you write monospaced even if you do not control the font. Since there are full-width codepoints for most programming symbols.
Isn't the whole point of lorem ipsum is that it doesn't make sense to the person reviewing the document, that way someone proofing a design will focus on the page elements, white space and text flow instead of getting distracted by the text?
Not very usable. If you really need to support Unicode on your website, you need to make sure that proper characters are shown. It does generate filler text, but would you know that character set and HTML code page are setup correctly if you don't know which glyphs have to be shown to the user? To give an example, this tool might generate character č, but if wrong code page is used (say ISO_8859-1 instead of UTF-8), you would see something like ł or ć in the browser. And you wouldn't even notice the difference. In some languages, such subtle change in one character can turn text into swearing or just gibberish.
I'd rather see a website with fixed text that shows some often used Unicode characters, The one you can validate (i.e. shows exactly the same on your website). Adding a JPG or PNG picture of what it should look like would be a plus. By "often used" I mean: Latin with accents, Cyrillic, Greek, Hebrew, Arabic, Chinese, Japanese, ... etc. I'm sure a nice "Bacon Ipsum" with one paragraph from each of these would be quite usable. Although, some are written right-to-left, so maybe two different sets would be better: one for RTL and other for LTR languages.
Would be useful to have control over output encoding (ie. generate an octet stream output in UTF8, UTF16BE/LE). Generation of surrogate pairs would also be needed for encodings that support them, since many applications get this wrong (even in Java, the number of java.lang.Characters in a java.lang.String is not necessarily the same as the length of the String...)
You could even add options for "badly behaved" UTF8 (eg. overlong encodings, deliberate sync errors etc.) to stress test the other end.
Do random accents help verify that you're handling Unicode correctly? You don't know what it's supposed to look like, so how do you know you're displaying it correctly?
Random accents do help to make sure you're seeing proper encoding. The output of this site will look like junk if you encode it as UTF-8 and try to view it with any other encoding.
Try it and see. Copy the output and paste it into a text document and save as UTF-8. In chrome, open that document and, under the View menu, choose a different encoding.
Did you try it? For example "Internationalization" becomes "Intèrnátìonàlïzâtiòn". If your code did anything bad to the unicode (eg decided codepoints are one byte, stripped high bit) then it would be very obvious.
For my testing I also go to the wikipedia home page and copy the text in the middle of the page which lists how many articles there are in various languages. This is great because it uses a wide variety of code points, including ones greater than 0xffff.
I also see that "Intèrnátìonàlïzâtiòn" breaks in the middle, instead of considering it a word. Is that correct? It may be important for checking your layout. There are also no Asian scripts, nor taller / shorter characters that might overlap if your line spacing is too small.
Fullwidth conversion code in ruby: