john pfeiffer
  • Home
  • Categories
  • Tags
  • Archives

Pdf from command line pdftohtml text merge html docs

sudo apt-get install pdftohtml

pdftohtml filename.pdf

sudo apt-get install lynx OR sudo apt-get install elinks

lynx filename.html OR elinks filename.html

OR for just text

pdftotext filename.pdf less filename.txt


open source pdf reader and utilities we can easily move from pdf to html (e.g. you have a pdf that won't reflow properly on your handheld device)

sudo apt-get xpdf

this should install not only the xpdf reader but the xpdf-utils

xpdf filename.pdf //find the first and last page you want by browsing the document

pdftohtml -c -f 157 -l 299 -nodrm filename.pdf output.html

-c = complex = export images to image files -f = first page -l = last page -nodrm = remove any digital rights management stuffs note the output file name is optional - by default it will output the source filename-page#.html

"pdftotext" is a similar tool "pdftohtml -h" show the help

UNFORTUNATELY it makes each page a single html document... (about the same size as the original pdf) there's a nice "index" feature, filename_ind.html and a "outline" (like table of contents)

So, manually merging them from the bash command line isn't fun and makes a big file but it works...

cd /path/to/ouput/html/and/images mkdir merge cat output_filename-.html >> output_filename_merged.html mv output_filename_merged.html merge mv .png merge now your "merge" directory is a self contained single file of the pdf

//unfortunately the above seems to run into some funny html formatting problems (weird large text) //i've found it was due to the last exported page... so by rm the last page number and then running the //above command it works very well...

using another gnu utility that is especially made for "downloading" htmls and doing stuff.. //wget recursive only 1 level wget -r -l1 -k -O output_merged.html http://localhost/output_fillename_ind.html

The above won't work unless you have a web server installed OR you could upload all of them to a website (easy way to share a huge pdf for someone to read remotely)...

Finally, yet another alternative is to download the windows pdf2html gui (which includes the pdftohtml.exe) http://www.divshare.com/download/4115853-20e You'll also need WINE as this is a windows app... The whole pdf2html directory is "portable" BUT you need ghostscript (gswin32c.exe) http://pages.cs.wisc.edu/~ghost/

(in my case I've a dual boot so after ntfs-3g /dev/hda1 /mnt/windows I could cp -a /mnt/windows/Program\ Files/gs/gs8.63 /home/username/pdf2html

wine /home/username/pdf2html/pdf2htmlgui.exe The above will prompt you for the location of the (windows) pdftohtml.exe (which should be in the .39 subdir) AND the gswin32c.exe (in pdf2html/gs8.63/bin subdir)

I chose "complex" which means images too!

Note that after running your wine CLI will display a bunch of output (e.g. page1 page2 etc.) and will finally seem to hang - it's still running but it's processing the images... wait until the pdf2htmlgui reappears!

The above will produce exactly the same output as the linux xpdf util "pdftohtml -c"...


  • « servlet load on start init
  • gmail download all imap access migration thunderbird »

Published

Sep 19, 2012

Category

linux

~456 words

Tags

  • command 29
  • docs 1
  • from 24
  • html 23
  • line 31
  • linux 249
  • merge 1
  • pdf 2
  • pdftohtml 1
  • text 16