Back when I was graduate student, I developed a massive library of DjVu scans. Now, I find myself using Cloud Convert or djvu2pdf to make PDFs out of them on a slow on-going basis. Hardly anybody has DjVu readers on their machines now, especially with the advent of mobile.
Actually, now that I think about it, the slide started all the way back in 2010 when I couldn't find a decent DjVu viewer for the iPad.
People like to say using DjVu gives you a 10x space saving over PDF for scanned documents. That's not exactly true. If you are comparing djvu to your scanner's default PDF output then yeah maybe the PDF is 10x bigger. But I've scanned enough documents to know how to properly optimize and tune the resulting PDFs. Things like JBIG2 are a huge improvement over default scanner output for text. And the result PDF is no more than 2x the djvu file. The portability gain is well worth it.
There's another gain for DJVU here: the speed gain. For scanned documents, my DJVU viewer lets me page much faster through a DJVU document than a PDF viewer does through a PDF. At least that's been my experience on my old, slow laptop, where speed really matters.
Also, I don't always scan myself. Sometimes I'll get a PDF generated by someone else, which contains scanned images. Then I'll often convert it to DJVU for speed and space savings.
PDF internally contains compression (though your particular PDF may not use it). Running a well optimized PDF over 7z is like compressing a movie with 7z: you get no space reductions.
One interesting trick I've tried, however, is to first remove all internal compression from PDF streams, and then apply LZMA to the resulting PDF. This can be a good way to compress PDFs losslessly.
Yes it's theoretically quite possible. PDF is designed with extensibility in mind and it's very easy (technology-wise) to introduce new compression formats. The only difficulty is whether collectively we have the willpower to implement this in enough products and ship a new ISO standard.
A less ambitious alternative might be to use something like zopfli. It is totally backwards compatible so it can be used today to compress PDFs better without sacrificing compatibility.
Yes it can be compressed (I just tried simple /usr/bin/zip w/o flags on a 10 MB PDF with results of ~10% file saving), but then it has to be uncompressed before usage. Its better to use some kind of transparent mechanic from within the OS (e.g. filesystem or compact.exe/CompactGUI [1]).
I suppose, yes. Its also an older FS which largely has backwards compatibility. You could ask Theodore Ts'o, the principal author of extXfs. I suppose you already read this one [1]
I'm not particularly knowledgeable about filesystems, or indeed of deep Linux matters in general, but the idea of layering ZFS and ext4 strikes me as the bad kind of 'cute' solution. If you want compression, ext4 surely isn't the place to start.
I like the tool though. Couldn't get compact.exe to work, it reported compressing 0 of the files in the directory. CompactGUI worked fine though. Maybe it invokes compact.exe once per file, rather than across the directory.
I've been using it for years. For scanned materials it's simply better: files are smaller, rendering is snappy, and if you're into old books its background/foreground model really pays off.
Quite strange this scepticism here about good old djvu. Better fetch a book from the Internet Archive in both formats and check for yourselves why it's a thing.
Tell me how I can remove the background of a yellowed old book I have to read or print for my research side-project with a PDF file. I don't understand dismissive generalisations about a tool coming from people that don't use it. For the purpose it was designed for it does the job better than a format designed for an entirely different thing.
Frankly I don't expect having to read PDF files 50 years from now, but if I were thinking about preservation, the timeproof archiving medium still is acid-free paper –better yet, vellum– and probably India ink. As for the need of any technology, beyond stone tools and fire nobody really needs anything, it's just convenient.
Tell me how I can remove the background of a yellowed old book I have to read or print for my research side-project with a PDF file.
I think you may be confusing tools with file formats. There are any number of ways to do this with .PDFs, none of which I've personally had to explore.
Frankly I don't expect having to read PDF files 50 years from now, but if I were thinking about preservation, the timeproof archiving medium still is acid-free paper –better yet, vellum– and probably India ink. As for the need of any technology, beyond stone tools and fire nobody really needs anything, it's just convenient.
I just can open this djvu file with a viewer and display foreground, done. I can focus on reading it, which is the whole point of me accessing this file. I don't have to process a bloated PDF version with any other tools you just don't even know, me neither.
Exactly, in this planet we still can read, for instance, books printed by Gutenberg, yet try to convert some Word files rotting in a low-density macintosh floppy from 25 years ago. Paper beats everything else because it's a passive medium that lasts centuries and comes with all the hardware you need. A computer file needs a whole ecosystem and being copied around often.
Will we even use files in 2068? I don't know. Maybe in 2047 PDFs will be out of mainstream support because 99% of customers will be using the next better format or workflow.
Exactly, in this planet we still can read, for instance, books printed by Gutenberg, yet try to convert some Word files rotting in a low-density macintosh floppy from 25 years ago.
Meanwhile, a .PDF from the same era would still be perfectly readable. There's a lesson there.
There was no way no create a pdf file with a Mac back in 1993. It didn't take off until Acrobat 3 a few years later, and then you couldn't make decent pdfs unless you bought Acrobat 4 and use Distiller. You'd typically share your doc (or whatever) files, and if you wanted portability you'd use PostScript. Which still is a better format if you have to make changes at the very last stage before printing.
Ordinary users just couldn't make pdfs without paying for additional publishing software until some ten years ago.
The lesson is that some formats become de facto standards, then people forget how that happened (and that it could happen again), and ignore better technical solutions for specific jobs.
Djvu certainly may fall out of use some day, but I think you are greatly underestimating how popular the format is if you really think think there won't be programs around to read it. It's not an obscure format, and there are widely used open source libraries to do so.
From my own playing around figuring how I wanted to archive scanned documents, it's both smaller and much quicker to load.
But going out on a limb with a format doesn't instill confidence for long term. It's great there's one "perfect" implementation that seems to just work, but perceived lack of an ecosystem is still worrying. Not that looking through the source of say libjpeg instills confidence either! I was also considering JP2 for nicer-looking artifacts, but couldn't bring myself to go that route.
I eventually just increased my size budget (because disappearing a ream into 5GB is still damn useful!), and decided to just store lossless FLIF at 300dpi. I statically compiled the flif binary, do a test decode to make sure the bits round trip, and store checksums of the decoder binary and raw raster alongside.
I then stuff this into a zip container (aka .cbz), along with some really poor jpg thumbnails so present evince can view it. I still need to write the transcoder to .djvu - when I'm no longer buried under piles of paper!
I've been scanning a lot of books and switched over a while ago - you get books which can be 10x smaller, and which load and display faster too. And if you desperately need a PDF for some reason, they can be converted back easily.
i don't know about djvu format, but being in the process of writing a pdf parser atm i can tell you there's definitely some room for a simpler file format than pdf. The size of the specification is daunting, and i still encounter weird behaviors for some documents on some major platform's reader ( and now that i've read part of the spec i definitely understand why).
Something a little more up to date and more focused on today's technologies would be really nice.
Interesting, are you starting from scratch or building on poppler or something like it? I am writing a PDF parser too and am building on top of poppler to do the heavy lifting.
I’m building something on ios to extract text only from pdf along with some basic rendering features ( text position and font info, something that big commercial libraries don’t provide curiously ). So i’m using apple core foundation lib to do the basic parsing and use a project called pdfkitten as a starting point for font to unicode decoding and text positionning. I also plan to add obvious heuristics for « words » building later on as well as a global index for all the rendered blocks ( all done in swift). Images, forms, etc, are completely out of the scope.
It’s a very specific and ( hopefully) narrow need and is the only reason i’m starting this project in the first place.
This is a little off-topic but you seem like a good person to ask:
Do you know a way to crop a PDF? Basically, if I have a PDF of a single letter-size page with a 6x9 label on it, and I'd like to crop it down to that 6x9 label only, is there a way to do this? I've been looking into this with, for instance, the poppler library, and haven't found anything.
i suppose you mean you want to do that with code, and not using a tool ( because my first search would be for command line tool or automation software using a combination of pdf to jpg then crop then export in pdf).
You could parse the pdf for that bit of text then simply recreate a new one with the correct size. But it depends a lot of how that label is printed ( using one block for the whole label or one per letter). Basically it's a lot of work or not depending on the number of cases you need to deal with.
I know python has a python miner library, you may want to try that
I'd avoid using this. This is a format that's lossy re: text, so text and numbers can get silently rewritten; it reuses a letter image, placing it throughout a page wherever it's recognized. A low resolution scan or bad ocr results in silently corrupted text.
I've been reading quite a few DjVu encoded documents and have yet to see a single occurrence of what you describe. If there are such errors they will be visually indistinguishable from the original, and a good OCR package would likely identify the right letter given the context.
Do you have some concrete examples of corrupted texts? I'd very much like them to use as test cases in a document processing pipeline.
I don't have an easy way to find all of the errors I observed but typical ones are on page 58 at the end of the third line, 'Germanic' with a little bit of bleed between the serifs at the bottom of 'n' gets transformed into 'Germauic', and on page 420 at the end of line 7, 'compass' with a small dot in the middle of the 'o' becomes 'cempass'.
FWIW I've transcribed a bunch of other DjVu files produced by the Internet Archive and not noticed anything like this.
Let me assure you, this does exist in the wild. I spotted it many times, especially in documents scanned 15+ years ago, when space and bandwidth were an issue and people did anything to reduce file size, at the cost of quality. It was rare enough to not make the format unusable, but not just once, making you worry "was it 3 sp. or 8 sp.?"
It seems also that the problem happens more often with cyrillic glyphs which have less descending/accending elements that typical latin ones.
While of course it's an issue with any scanned docs, the DjVu compression method makes it harsher, as, say, you can end with _all_ '8's in document replaced with '3' (or, say, 'h's with 'n's), at the same time keeping seemingly decent appearance so that you don't suspect that something is wrong.
I totally understand how this could theoretically happen but it all hinges on it actually happening in the documents that I'm working with and to have a test case where it happens for sure would make all the difference in determining whether or not the documents I'm working with are at risk or not and if so if I can detect which ones are at risk (that would be half the battle won).
It would have to be a case where a human would see a 3 or an 8 (or a similar transposition with other glyphs) resulting in a document that is corrupted in such a way that afterwards the human would see the (invalid) alternative.
Even a single instance of this happening would be very relevant.
Well I know that there are examples in some of 8k+ djvu files on my hard disk (mostly old textbooks), as I learned about this problem many years ago from live experience. But I couldn't invent a method to find an example deliberately, so you can only take my word that they exist.
To clarify, it's not about OCR errors, where advanced OCR engine could use dictionary, language autorecognition etc.
It's about dvju-compressed text scans without OCR layer. The compression method relies only on glyphs similarity. The compression algorithm could not only mistake '3' as '8' with small gaps left, it would replace this dirty '3' with the image of reference '8' image, so that human reader think that he sees a scan, i.e. an image of page scanned, while in fact it sees 'edited' image, not corresponding to actual page.
I take your word, there is a technical reason behind this, not that I don't believe you.
> To clarify, it's not about OCR errors, where advanced OCR engine could use dictionary, language autorecognition etc. It's about dvju-compressed text scans without OCR layer. The compression method relies only on glyphs similarity.
Yes, I totally get that. It's about DjVu's compressor replacing the image of one character with the image of another.
IIRC Xerox revoked lots of their copiers which used the same compression algorithm internally after this was demonstrated by a document with lots of numbers. Google it, it's not that hard to find (I'm on a phone now so it's not very convenient for me atm)
reaperducer's point was that jacquesm almost certainly hasn't done so, and so can't be entirely confident of not having been the victim of corrupt text.
In my case I don't actually have the originals so it would be hard to impossible to verify on this dataset but if there are concrete examples of such errors then I could use that to identify documents that are more at risk of such errors.
I feel like DjVu was never sufficiently better than PDF for me to bother using it. There are PDF readers for almost literally anything device imaginable, and often they are included with your choice of operating system. I feel like if DjVu would have more of an audience if it weren't for the fact that PDF is an open standard now.
10 years ago, my main computer was a pentium iii I had cobbled together from parts at a thrift store. Djvu was the only format for high resolution scans that would load without rendering the computer unusable.
"...divides a single image into many different images, then compresses them separately." Wouldn't it be better to go the other way and backpack-algo the pages into a consolidated texture atlas based on edge similarity, and then run compression on the result?
From what I understand, it segments the image into text and a background image, does something like JBIG2 on the text (building a global dictionary of glyphs and placing them on the page), and does a lossy compression on the background image.
There aren't a lot of viewers out there for DjVu and the encoding side is patent encumbered, so I'm not interested in the format.
You can get pretty close with JBIG2+Jpeg2k in a PDF file, I believe archive.org does this, but I don't know of an open source encoder that does it and sometimes PDF viewers don't decode jbig2/jpg2k efficiently.
Unfortunately some library sites still rely on DjVu plugin, and with Firefox dropping support for native plugins, those materials became inaccessible. Especially because single download is dynamic and generated by the plugin itself from multiple smaller files.
Other than that, DjVu is a great format.
For those who wonder how it's better than PDF for scanned texts - DjVu uses different compression for background and actual text, thus saving tons of space.
PDF is great for electronically created documents. DjVu is far superior for scanned documents, though. It compresses much smaller and DjVu viewers are far more performant at scrolling through large scanned documents.
I still prefer Adobe Acrobat Reader despite everything, and absolutely hate PDF.js because it seems to take like few seconds to render a page when jumping around, whereas Adobe Acrobat Reader is pretty much interactive. Nevermind the fact that pdf.js has still problems rendering quite many documents I encounter, while Acrobat is basically the de-facto reference implementation for PDFs.
They also do DjVu, so you can see the size difference there. For that particular book, its 58M for the PDF vs 31M for DjVu. So basically half the size.
In what circumstances does it actually matter to be able to get better compression than CCITT Group 4? For example, when is size such a deciding factor that it makes sense to forgo the compatibility of CCITT Group 4 in PDF and to use something else like JBIG2 in PDF or DjVu [assuming the latter actually is smaller; I didn't measure]?
It's not a strength if you need that old book that has no electronic version and can only be scanned. AFAIK EPUB 3 allows for SVG layouts that can be, I imagine, very sophisticated; still, it doesn't address the old book case.
Actually, now that I think about it, the slide started all the way back in 2010 when I couldn't find a decent DjVu viewer for the iPad.