[Dcmlib] [Fwd: pdf parser for generating XML like document]

Tue Oct 25 21:04:58 CEST 2005

No, PDF has no concept of tables, as such.  It's just commands to select
fonts and draw text, and some other commands to draw horizontal lines,
etc.

I don't know of any easy way to convert PDF to XML for the sort of
application you're working on, sorry.

- Derek

-------- Original Message --------
Subject: pdf parser for generating XML like document
Date: Sun, 23 Oct 2005 17:31:56 -0400

Hello,

	I did search for a mailing list on the following web site:
http://www.foolabs.com/xpdf/

	and since I could not find it, I am writting to you directly.

	I have the following problem. DICOM is a file format that is specified
by NEMA at:

http://medical.nema.org/dicom/2004.html

	In particular if you look at the document: (1)
http://medical.nema.org/dicom/2004/04_06PU.PDF

  The spec is huge. Therefore I am using pdftotext + python script to
generate a custom output. You can find everything here:

The python script
(bascially takes as input the output of `pdftotext -raw -nopgbrk`
http://cvs.creatis.insa-lyon.fr/viewcvs/viewcvs.cgi/gdcm/Dicts/ParseDict.py

And here is the cleanup output (python script+hand writting):
http://cvs.creatis.insa-lyon.fr/viewcvs/viewcvs.cgi/gdcm/Dicts/dicomV3.dic

This is very difficult to maintain as every year a new spec is release.

	Therefore I was wondering if you could give me some advice on how to
parse the PDF document(1). Is there some table start/end marker in the
pdf file that I can use. Is there any API, of the pdf lib that would
allow me to generate an 'XML' like description of the PDF in a neutral way ?

Thanks so much for your time,
Mathieu
Ps: If such ML exist, forgive me and please give the reference so that I
can ask this question.