CSS selectors for PDF elements?

Is there something like this?

  • What problem are you trying to solve through this approach?

    There is neither any concept of element selectors nor any such ready implementation I've come across.

    However, some things in a produced structured PDF (as opposed to a PDF from a scanned page) can be searched. For example, if you want all text objects, you can look for sequences like "BT..Tf..Tj..ET" which are operands for Begin Text, set Text font, show text, end text.

    But not all objects are easily queryable. For example, I've seen tables that don't show any indication that they are tables. Instead, in the PDF, they are encoded as a series of move, stroke operations, and text operands. You'd have to know such a pattern represents a table and look for the table by searching the pattern.

    I don't know what you are trying to solve, but assuming selectors are the right solution, the approach I'd choose is: 1. Use a good framework like PDFBox and its COSObject object model to reverse engineer simple PDFs while keeping the PDF spec [1] close at hand. This way you can understand what the operands and patterns are.

    2. Use a framework like JXPath to build arbitrary XPath like query interface over PDFBox's object model.

    It's easier if all the PDFs are produced by the same program, and far harder if you want to process any PDF in the wild.

    Alternately, perhaps you can convert PDF to HTML, and then run the selector on that HTML.

    [1]: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...

  • It seems unlikely. PDF files are generated by a huge number of programs, each doing it differently. There's also almost no semantic information in the PDF format.

    Every time I have to extract information from PDFS that isn't simply text or pixel graphics I'm basically starting from scratch.

  • The source code of a pdf looks very different from what it looks after rendering (no correlation of parent/tree/sibling). Most of it is literally absolute coordinates. Cannot have css like selectors on that.