Inside an EPUB File: Structure, Files, and How It All Works
An EPUB is not a mystery format — it's a ZIP archive containing a specific set of files. Once you understand the structure, you can fix broken EPUBs, build converters, or create files from scratch. Here's what's inside every EPUB.
The EPUB ZIP Structure
mybook.epub (ZIP archive)
├── mimetype ← must be first, uncompressed
├── META-INF/
│ └── container.xml ← points to the OPF file
└── OEBPS/ (or any folder name)
├── content.opf ← package document (manifest + spine)
├── toc.ncx ← EPUB 2 navigation (NCX)
├── nav.xhtml ← EPUB 3 navigation (NAV)
├── chapter01.xhtml ← content files
├── chapter02.xhtml
├── css/
│ └── styles.css
└── images/
├── cover.jpg
└── figure1.png
The mimetype File
The first file in the ZIP must be named mimetype, stored without compression, and contain exactly:
application/epub+zip
No newline, no BOM, no spaces. This is how e-readers and validators identify the file as an EPUB without reading the full archive. Creating EPUBs with Python:
import zipfile
with zipfile.ZipFile('book.epub', 'w') as z:
# mimetype MUST be first and uncompressed
z.writestr(zipfile.ZipInfo('mimetype'), 'application/epub+zip',
compress_type=zipfile.ZIP_STORED)
# All other files can be compressed
z.write('META-INF/container.xml', compress_type=zipfile.ZIP_DEFLATED)
z.write('OEBPS/content.opf', compress_type=zipfile.ZIP_DEFLATED)
META-INF/container.xml
This file tells the reading system where to find the OPF package document:
<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="OEBPS/content.opf"
media-type="application/oebps-package+xml"/>
</rootfiles>
</container>
The full-path is relative to the root of the ZIP. The OEBPS folder name is conventional but not required — you can use any folder name or put the OPF in the root.
The OPF Package Document (content.opf)
The OPF file is the heart of an EPUB. It has four sections:
- <metadata> — Dublin Core metadata (title, author, language, identifier)
- <manifest> — lists every file in the publication with its id, href, and media-type
- <spine> — defines the reading order by referencing manifest item ids
- <guide> — EPUB 2 landmark references (optional, replaced by NAV landmarks in EPUB 3)
<manifest>
<item id="ch1" href="chapter01.xhtml" media-type="application/xhtml+xml"/>
<item id="ch2" href="chapter02.xhtml" media-type="application/xhtml+xml"/>
<item id="css" href="css/styles.css" media-type="text/css"/>
<item id="cover-img" href="images/cover.jpg" media-type="image/jpeg"
properties="cover-image"/>
<item id="nav" href="nav.xhtml" media-type="application/xhtml+xml"
properties="nav"/>
</manifest>
<spine toc="ncx">
<itemref idref="nav" linear="no"/>
<itemref idref="ch1"/>
<itemref idref="ch2"/>
</spine>
Content Files — XHTML, Not HTML
Chapter files must be valid XHTML — XML-conformant HTML. Key differences from HTML5:
- Must have the XML declaration or at least the XHTML doctype
- All tags must be closed:
<br/>not<br> - Attribute values must be quoted
- Case-sensitive: use lowercase element names
- The namespace declaration is required:
xmlns="http://www.w3.org/1999/xhtml"
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:epub="http://www.idpf.org/2007/ops">
<head>
<title>Chapter 1</title>
<link rel="stylesheet" type="text/css" href="../css/styles.css"/>
</head>
<body>
<section epub:type="chapter">
<h1>Chapter 1: Introduction</h1>
<p>Text here.</p>
</section>
</body>
</html>
Inspecting an EPUB
# List contents without extracting
unzip -l book.epub
# Extract to a folder
unzip book.epub -d book_extracted/
# View OPF
unzip -p book.epub OEBPS/content.opf | xmllint --format -
EPUBs from PDF Conversion
When toolkit.bot converts a PDF to EPUB, it generates all required files: mimetype, container.xml, content.opf, toc.ncx, nav.xhtml, chapter XHTML files, embedded images, and a stylesheet. The output passes EPUBCheck validation and includes EPUB Accessibility 1.1 metadata.