How to Extract Text from an EPUB — Plain Text, Markdown, and JSON
Need to extract readable text from an EPUB for NLP processing, search indexing, content migration, or plain reading? Here are the best tools for EPUB text extraction in different output formats.
Method 1: Pandoc — Best for Markdown and Structured Text
# Extract to plain text
pandoc book.epub -o book.txt
# Extract to Markdown (preserves headings, bold, lists)
pandoc book.epub -o book.md
# Extract all text to stdout
pandoc book.epub -t plain
# Preserve heading hierarchy in Markdown
pandoc book.epub -t markdown --wrap=none -o book.md
Pandoc is the best choice when you need to preserve document structure. Markdown output retains # headings, **bold**, bullet lists, and table structure.
Method 2: Python — Unzip and Parse HTML
An EPUB is a ZIP archive containing XHTML files. You can extract text directly with Python:
import zipfile
from bs4 import BeautifulSoup
import os
def extract_epub_text(epub_path):
texts = []
with zipfile.ZipFile(epub_path, 'r') as z:
# Find content files from OPF
opf_files = [f for f in z.namelist() if f.endswith('.opf')]
for opf in opf_files:
opf_dir = os.path.dirname(opf)
with z.open(opf) as f:
soup = BeautifulSoup(f.read(), 'xml')
items = soup.find_all('item', {'media-type': 'application/xhtml+xml'})
for item in items:
href = item.get('href', '')
path = os.path.join(opf_dir, href).lstrip('/')
try:
with z.open(path) as cf:
csoup = BeautifulSoup(cf.read(), 'html.parser')
texts.append(csoup.get_text(separator='
'))
except KeyError:
pass
return '
'.join(texts)
print(extract_epub_text('book.epub'))
Method 3: epub2txt (Command Line, Simple)
epub2txt is a minimal command-line tool focused on plain text extraction:
# Install
pip install epub2txt
# Extract
epub2txt book.epub > book.txt
# With line wrapping off
epub2txt --width 0 book.epub > book.txt
Method 4: Calibre's ebook-convert
# To plain text
ebook-convert book.epub book.txt
# To Markdown (via txt2rtf intermediate)
ebook-convert book.epub book.txt --txt-output-formatting markdown
Output Format Comparison
| Tool | Plain TXT | Markdown | JSON | Best use |
|---|---|---|---|---|
| Pandoc | Yes | Yes (best quality) | No | Content editing, publishing |
| Python/BeautifulSoup | Yes | Custom | Yes (custom) | NLP pipelines, search indexing |
| epub2txt | Yes | No | No | Quick text dump |
| Calibre | Yes | Partial | No | Desktop workflow |
Extracting Text from a PDF via EPUB
For extracting text from PDFs — especially scanned or multi-column — the EPUB intermediate gives better text ordering:
- Convert PDF to EPUB with toolkit.bot (OCR + column reordering included)
- Extract EPUB text with Pandoc:
pandoc book.epub -t plain -o book.txt
Direct PDF text extraction tools often produce scrambled text on multi-column layouts. The EPUB conversion step reorders the content correctly before extraction.
Convert PDF to EPUB free →