Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DjVu, an open PDF alternative (wikipedia.org)
143 points by jacquesm on June 30, 2018 | hide | past | favorite | 98 comments


Back when I was graduate student, I developed a massive library of DjVu scans. Now, I find myself using Cloud Convert or djvu2pdf to make PDFs out of them on a slow on-going basis. Hardly anybody has DjVu readers on their machines now, especially with the advent of mobile.

Actually, now that I think about it, the slide started all the way back in 2010 when I couldn't find a decent DjVu viewer for the iPad.


People like to say using DjVu gives you a 10x space saving over PDF for scanned documents. That's not exactly true. If you are comparing djvu to your scanner's default PDF output then yeah maybe the PDF is 10x bigger. But I've scanned enough documents to know how to properly optimize and tune the resulting PDFs. Things like JBIG2 are a huge improvement over default scanner output for text. And the result PDF is no more than 2x the djvu file. The portability gain is well worth it.


Things like JBIG2 are a huge improvement over default scanner output for text.

Just be careful that it doesn't silently change the text, because JBIG2 compressors can be lossy that way:

https://news.ycombinator.com/item?id=6156238


There's another gain for DJVU here: the speed gain. For scanned documents, my DJVU viewer lets me page much faster through a DJVU document than a PDF viewer does through a PDF. At least that's been my experience on my old, slow laptop, where speed really matters.

Also, I don't always scan myself. Sometimes I'll get a PDF generated by someone else, which contains scanned images. Then I'll often convert it to DJVU for speed and space savings.


I find it depends on the PDF and also on the software. On Mac OS PDFs are very fast.


any tool to batch tune/compress ? for a friend


Can the PDF be compressed with, say, 7z?


PDF internally contains compression (though your particular PDF may not use it). Running a well optimized PDF over 7z is like compressing a movie with 7z: you get no space reductions.

One interesting trick I've tried, however, is to first remove all internal compression from PDF streams, and then apply LZMA to the resulting PDF. This can be a good way to compress PDFs losslessly.


> first remove all internal compression from PDF streams, and then apply LZMA to the resulting PDF

And this out-compresses PDF's internal compression?

Could PDF be updated in future to support superior compression algorithms?


Yes it's theoretically quite possible. PDF is designed with extensibility in mind and it's very easy (technology-wise) to introduce new compression formats. The only difficulty is whether collectively we have the willpower to implement this in enough products and ship a new ISO standard.

A less ambitious alternative might be to use something like zopfli. It is totally backwards compatible so it can be used today to compress PDFs better without sacrificing compatibility.


Yes it can be compressed (I just tried simple /usr/bin/zip w/o flags on a 10 MB PDF with results of ~10% file saving), but then it has to be uncompressed before usage. Its better to use some kind of transparent mechanic from within the OS (e.g. filesystem or compact.exe/CompactGUI [1]).

[1] https://github.com/ImminentFate/CompactGUI


Neat, I didn't know Windows 10 has an improved compression system.

I presume it only works with NTFS? Surprised the page doesn't say.


Yep, NTFS-only. The GUI is just a frontend for the Windows command compact.exe which does mention NTFS-only [1] [2]

[1] https://www.petri.com/compress-files-with-compact-exe

[2] https://stackoverflow.com/questions/7928840/how-to-use-compa...


Having just googled it, I'm surprised to see that ext4 doesn't support compression.

I presume that's for simplicity?


I suppose, yes. Its also an older FS which largely has backwards compatibility. You could ask Theodore Ts'o, the principal author of extXfs. I suppose you already read this one [1]

[1] https://serverfault.com/questions/617648/transparent-compres...


I hadn't, thanks for the link.

I'm not particularly knowledgeable about filesystems, or indeed of deep Linux matters in general, but the idea of layering ZFS and ext4 strikes me as the bad kind of 'cute' solution. If you want compression, ext4 surely isn't the place to start.


It seems to be a front-end for compact.exe, which exposes the on-the-fly compression feature of NTFS.


Indeed, I hadn't meant to imply otherwise.

I like the tool though. Couldn't get compact.exe to work, it reported compressing 0 of the files in the directory. CompactGUI worked fine though. Maybe it invokes compact.exe once per file, rather than across the directory.


What’s the point of this in 2018 - now that PDF is an open ISO standard?


I've been using it for years. For scanned materials it's simply better: files are smaller, rendering is snappy, and if you're into old books its background/foreground model really pays off.

Quite strange this scepticism here about good old djvu. Better fetch a book from the Internet Archive in both formats and check for yourselves why it's a thing.


There's just no need for it.

50 years from now, you are not going to have any trouble finding a .PDF viewer, but DjVu is a different question entirely.


Tell me how I can remove the background of a yellowed old book I have to read or print for my research side-project with a PDF file. I don't understand dismissive generalisations about a tool coming from people that don't use it. For the purpose it was designed for it does the job better than a format designed for an entirely different thing.

Frankly I don't expect having to read PDF files 50 years from now, but if I were thinking about preservation, the timeproof archiving medium still is acid-free paper –better yet, vellum– and probably India ink. As for the need of any technology, beyond stone tools and fire nobody really needs anything, it's just convenient.


Tell me how I can remove the background of a yellowed old book I have to read or print for my research side-project with a PDF file.

I think you may be confusing tools with file formats. There are any number of ways to do this with .PDFs, none of which I've personally had to explore.

Frankly I don't expect having to read PDF files 50 years from now, but if I were thinking about preservation, the timeproof archiving medium still is acid-free paper –better yet, vellum– and probably India ink. As for the need of any technology, beyond stone tools and fire nobody really needs anything, it's just convenient.

Meanwhile, back on this planet...


I just can open this djvu file with a viewer and display foreground, done. I can focus on reading it, which is the whole point of me accessing this file. I don't have to process a bloated PDF version with any other tools you just don't even know, me neither.

Exactly, in this planet we still can read, for instance, books printed by Gutenberg, yet try to convert some Word files rotting in a low-density macintosh floppy from 25 years ago. Paper beats everything else because it's a passive medium that lasts centuries and comes with all the hardware you need. A computer file needs a whole ecosystem and being copied around often.

Will we even use files in 2068? I don't know. Maybe in 2047 PDFs will be out of mainstream support because 99% of customers will be using the next better format or workflow.


Exactly, in this planet we still can read, for instance, books printed by Gutenberg, yet try to convert some Word files rotting in a low-density macintosh floppy from 25 years ago.

Meanwhile, a .PDF from the same era would still be perfectly readable. There's a lesson there.


There was no way no create a pdf file with a Mac back in 1993. It didn't take off until Acrobat 3 a few years later, and then you couldn't make decent pdfs unless you bought Acrobat 4 and use Distiller. You'd typically share your doc (or whatever) files, and if you wanted portability you'd use PostScript. Which still is a better format if you have to make changes at the very last stage before printing.

Ordinary users just couldn't make pdfs without paying for additional publishing software until some ten years ago.

The lesson is that some formats become de facto standards, then people forget how that happened (and that it could happen again), and ignore better technical solutions for specific jobs.


Djvu certainly may fall out of use some day, but I think you are greatly underestimating how popular the format is if you really think think there won't be programs around to read it. It's not an obscure format, and there are widely used open source libraries to do so.


Every Linux distribution comes bundled with a DjVu viewer by default. I don't think it will be a problem.


> 50 years from now, you are not going to have any trouble finding a .PDF viewer

That is a very bold statement. 50 years is a long time even in human years. It's an eternity in technology years.


Some people forget that 50 years ago punched cards was still the ubiquitous format and people were only just switching to magnetic tape.


The .PDF format is already 25 years old. What do you expect to happen in another 50 to render it obsolete?

Specifically, what will kill off the .PDF format while leaving DjVu untouched?


Maybe:

- Moving to HTML and signed HTML file.

- JPEG > PDF on mobile so people send the first over the later.


Is there a standard for signing HTML files yet? I've not one where the signature and content could be in the same HTML file.


Nope, I am just thinking of using GPG to sign and/or protect the file content (like with any other file).


And the adoption of signed, presumably-uncompressed HTML files will leave DjVu in a dominant position because...?


It would just remove PDF from the dominant position. DjVu would still be mostly ignored by most.


Your chances of things working would be much better with PDF as well given that the PDFA version is sometimes used.


From my own playing around figuring how I wanted to archive scanned documents, it's both smaller and much quicker to load.

But going out on a limb with a format doesn't instill confidence for long term. It's great there's one "perfect" implementation that seems to just work, but perceived lack of an ecosystem is still worrying. Not that looking through the source of say libjpeg instills confidence either! I was also considering JP2 for nicer-looking artifacts, but couldn't bring myself to go that route.

I eventually just increased my size budget (because disappearing a ream into 5GB is still damn useful!), and decided to just store lossless FLIF at 300dpi. I statically compiled the flif binary, do a test decode to make sure the bits round trip, and store checksums of the decoder binary and raw raster alongside.

I then stuff this into a zip container (aka .cbz), along with some really poor jpg thumbnails so present evince can view it. I still need to write the transcoder to .djvu - when I'm no longer buried under piles of paper!


I've been scanning a lot of books and switched over a while ago - you get books which can be 10x smaller, and which load and display faster too. And if you desperately need a PDF for some reason, they can be converted back easily.


i don't know about djvu format, but being in the process of writing a pdf parser atm i can tell you there's definitely some room for a simpler file format than pdf. The size of the specification is daunting, and i still encounter weird behaviors for some documents on some major platform's reader ( and now that i've read part of the spec i definitely understand why).

Something a little more up to date and more focused on today's technologies would be really nice.


Interesting, are you starting from scratch or building on poppler or something like it? I am writing a PDF parser too and am building on top of poppler to do the heavy lifting.


I’m building something on ios to extract text only from pdf along with some basic rendering features ( text position and font info, something that big commercial libraries don’t provide curiously ). So i’m using apple core foundation lib to do the basic parsing and use a project called pdfkitten as a starting point for font to unicode decoding and text positionning. I also plan to add obvious heuristics for « words » building later on as well as a global index for all the rendered blocks ( all done in swift). Images, forms, etc, are completely out of the scope.

It’s a very specific and ( hopefully) narrow need and is the only reason i’m starting this project in the first place.


This is a little off-topic but you seem like a good person to ask:

Do you know a way to crop a PDF? Basically, if I have a PDF of a single letter-size page with a 6x9 label on it, and I'd like to crop it down to that 6x9 label only, is there a way to do this? I've been looking into this with, for instance, the poppler library, and haven't found anything.


i suppose you mean you want to do that with code, and not using a tool ( because my first search would be for command line tool or automation software using a combination of pdf to jpg then crop then export in pdf).

You could parse the pdf for that bit of text then simply recreate a new one with the correct size. But it depends a lot of how that label is printed ( using one block for the whole label or one per letter). Basically it's a lot of work or not depending on the number of cases you need to deal with.

I know python has a python miner library, you may want to try that


Yep, I'm looking for a way to do that with automation somehow, either code or even a command-line tool, just not manually with a GUI.

My issue with JPG conversion is the lossiness factor introduced there; I'm trying to avoid that.

There should only be one case: a single label on a letter-size page surrounded by a lot of whitespace. I'll check out that miner library; thanks!!


Thank you, I missed your answer when it first was given.


Djvu are often a lot smaller in size than the corresponding PDF. I have a lot of math texts in electronic form and the difference is pretty dramatic.


It is much better optimized for scanned documents than PDF.


I'd avoid using this. This is a format that's lossy re: text, so text and numbers can get silently rewritten; it reuses a letter image, placing it throughout a page wherever it's recognized. A low resolution scan or bad ocr results in silently corrupted text.


I've been reading quite a few DjVu encoded documents and have yet to see a single occurrence of what you describe. If there are such errors they will be visually indistinguishable from the original, and a good OCR package would likely identify the right letter given the context.

Do you have some concrete examples of corrupted texts? I'd very much like them to use as test cases in a document processing pipeline.


This document caused me a fair bit of consternation transcribing it on a site that prefers DjVu files to PDF:

DjVU: https://ia800204.us.archive.org/28/items/cu31924026442156/cu...

PDF: https://ia800204.us.archive.org/28/items/cu31924026442156/cu...

I don't have an easy way to find all of the errors I observed but typical ones are on page 58 at the end of the third line, 'Germanic' with a little bit of bleed between the serifs at the bottom of 'n' gets transformed into 'Germauic', and on page 420 at the end of line 7, 'compass' with a small dot in the middle of the 'o' becomes 'cempass'.

FWIW I've transcribed a bunch of other DjVu files produced by the Internet Archive and not noticed anything like this.


Excellent. Thank you very much for providing this, it will make my life a lot easier (or, initially: a lot harder :) ).

If I can reliably flag problem cases then at least I will know which files are going to have a lot of manual work.


have yet to see a single occurrence of what you describe

How would you know? Maybe if a document had an unusual misspelling. But how would you know if DjVu munged a 3 into an 8?


But this is an entirely hypothetical weakness. Without at least one example of it happening in the wild, it may as well not exist.


Let me assure you, this does exist in the wild. I spotted it many times, especially in documents scanned 15+ years ago, when space and bandwidth were an issue and people did anything to reduce file size, at the cost of quality. It was rare enough to not make the format unusable, but not just once, making you worry "was it 3 sp. or 8 sp.?"

It seems also that the problem happens more often with cyrillic glyphs which have less descending/accending elements that typical latin ones.


With DjVu or with ocr'd documents in general?


While of course it's an issue with any scanned docs, the DjVu compression method makes it harsher, as, say, you can end with _all_ '8's in document replaced with '3' (or, say, 'h's with 'n's), at the same time keeping seemingly decent appearance so that you don't suspect that something is wrong.


I'd love to see some concrete examples.

I totally understand how this could theoretically happen but it all hinges on it actually happening in the documents that I'm working with and to have a test case where it happens for sure would make all the difference in determining whether or not the documents I'm working with are at risk or not and if so if I can detect which ones are at risk (that would be half the battle won).

It would have to be a case where a human would see a 3 or an 8 (or a similar transposition with other glyphs) resulting in a document that is corrupted in such a way that afterwards the human would see the (invalid) alternative.

Even a single instance of this happening would be very relevant.


Well I know that there are examples in some of 8k+ djvu files on my hard disk (mostly old textbooks), as I learned about this problem many years ago from live experience. But I couldn't invent a method to find an example deliberately, so you can only take my word that they exist.

To clarify, it's not about OCR errors, where advanced OCR engine could use dictionary, language autorecognition etc. It's about dvju-compressed text scans without OCR layer. The compression method relies only on glyphs similarity. The compression algorithm could not only mistake '3' as '8' with small gaps left, it would replace this dirty '3' with the image of reference '8' image, so that human reader think that he sees a scan, i.e. an image of page scanned, while in fact it sees 'edited' image, not corresponding to actual page.


> so you can only take my word that they exist.

I take your word, there is a technical reason behind this, not that I don't believe you.

> To clarify, it's not about OCR errors, where advanced OCR engine could use dictionary, language autorecognition etc. It's about dvju-compressed text scans without OCR layer. The compression method relies only on glyphs similarity.

Yes, I totally get that. It's about DjVu's compressor replacing the image of one character with the image of another.


IIRC Xerox revoked lots of their copiers which used the same compression algorithm internally after this was demonstrated by a document with lots of numbers. Google it, it's not that hard to find (I'm on a phone now so it's not very convenient for me atm)


> Let me assure you, this does exist in the wild.

You would obtain more credibility by actually providing such an example.


> But how would you know if DjVu munged a 3 into an 8?

By comparing it with the original, surely.


reaperducer's point was that jacquesm almost certainly hasn't done so, and so can't be entirely confident of not having been the victim of corrupt text.


In my case I don't actually have the originals so it would be hard to impossible to verify on this dataset but if there are concrete examples of such errors then I could use that to identify documents that are more at risk of such errors.


Do you have any sources for this? This is very interesting.

Particularly because Xerox photocopiers (yes, self-contained photocopiers) had exactly the same kinds of issues due to image compression engine glitches (https://news.ycombinator.com/item?id=6156238, https://news.ycombinator.com/item?id=9584172)


It's not sourced, but the Wikipedia page says it uses something called JB2 compression which is susceptible to this: https://en.wikipedia.org/wiki/DjVu#Compression


JBIG2 was the problem algorithm with the photocopiers.

JBIG2 was based upon JB2.


CJB2 from DjVuLibre has lossless mode and it's enabled by default.


The interesting thing is Yann LeCun’s involvement.


On my MacBook Air 2013 I still have to convert scanned pdf to djvu in order to read it; otherwise the rendering with poppler is too slow.


I feel like DjVu was never sufficiently better than PDF for me to bother using it. There are PDF readers for almost literally anything device imaginable, and often they are included with your choice of operating system. I feel like if DjVu would have more of an audience if it weren't for the fact that PDF is an open standard now.


10 years ago, my main computer was a pentium iii I had cobbled together from parts at a thrift store. Djvu was the only format for high resolution scans that would load without rendering the computer unusable.


"...divides a single image into many different images, then compresses them separately." Wouldn't it be better to go the other way and backpack-algo the pages into a consolidated texture atlas based on edge similarity, and then run compression on the result?


From what I understand, it segments the image into text and a background image, does something like JBIG2 on the text (building a global dictionary of glyphs and placing them on the page), and does a lossy compression on the background image.

There aren't a lot of viewers out there for DjVu and the encoding side is patent encumbered, so I'm not interested in the format.

You can get pretty close with JBIG2+Jpeg2k in a PDF file, I believe archive.org does this, but I don't know of an open source encoder that does it and sometimes PDF viewers don't decode jbig2/jpg2k efficiently.


How many software with the pun about déjà vu ?

DJV View http://djv.sourceforge.net/

WinDjView https://windjview.sourceforge.io/


Unfortunately some library sites still rely on DjVu plugin, and with Firefox dropping support for native plugins, those materials became inaccessible. Especially because single download is dynamic and generated by the plugin itself from multiple smaller files.

Other than that, DjVu is a great format.

For those who wonder how it's better than PDF for scanned texts - DjVu uses different compression for background and actual text, thus saving tons of space.


What about the DjVu.js extension?

But worst case scenario you could run something like Pale Moon or WaterFox only for accessing such documents.

Also I know how slow institutional entities can be to adapt, but they will catch up eventually.


I haven't tried DjVu.js - I'll take a look, thanks!


No comments here addressing why PDF is still so slow at rendering?


I use SumatraPDF. It’s a lightweight PDF viewer and it loads really fast. Especially compared to Adobe.


Are you sure the problem is not Acrobat Reader? I haven't used it for more than 10 years but I remember it being slow.

Okular(provided by KDE), for example, is blazingly fast. Or at least, I haven't noticed any slowness.


Sadly there are not good enough DjVu readers. I use iOS 11.4


Yeah. I use Kybook on iOS.


I too use the same, but I found it on AppStore after searching really hard. Most of them were asking to pay.


DjVu rocks. It's been around a long time too. It's a shame it hasn't been picked up more.


Browsers should support DjVu natively. It may help with wider adoption of the format.


This may be an acceptable format for some electronic versions of scanned books, but it's not a "PDF Alternative".

Also PDF is an open format.


PDF wasn't an open format when DjVu was first created.

Now that it is, there's probably no reason to use DjVu.


PDF is great for electronically created documents. DjVu is far superior for scanned documents, though. It compresses much smaller and DjVu viewers are far more performant at scrolling through large scanned documents.


even compared to MuPDF? PDF.js is well known to be very slow, and even poppler is pretty slow compared to MuPDF.


I still prefer Adobe Acrobat Reader despite everything, and absolutely hate PDF.js because it seems to take like few seconds to render a page when jumping around, whereas Adobe Acrobat Reader is pretty much interactive. Nevermind the fact that pdf.js has still problems rendering quite many documents I encounter, while Acrobat is basically the de-facto reference implementation for PDFs.

btw if you want quickly some random large scanned PDFs for testing, archive.org is a goldmine. For example here is a scanned 1200 electronics handbook: https://archive.org/details/NationalSemiconductorLinearAppli...

They also do DjVu, so you can see the size difference there. For that particular book, its 58M for the PDF vs 31M for DjVu. So basically half the size.


I use mupdf all the time for PDFs. It's much faster than PDF.js, but still much, much slower than djvu for scanned documents.


In what circumstances does it actually matter to be able to get better compression than CCITT Group 4? For example, when is size such a deciding factor that it makes sense to forgo the compatibility of CCITT Group 4 in PDF and to use something else like JBIG2 in PDF or DjVu [assuming the latter actually is smaller; I didn't measure]?


no one mentioning EPUB? https://en.wikipedia.org/wiki/EPUB


That's a very different beast. Not page aware.


That is also exactly it's strengh. Though epub3 now also allow to enforce paged content intended for comic books.


It's not a strength if you need that old book that has no electronic version and can only be scanned. AFAIK EPUB 3 allows for SVG layouts that can be, I imagine, very sophisticated; still, it doesn't address the old book case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: