How do I extract text from an EPUB file?

The easiest method is Pandoc: pandoc book.epub -t plain outputs all text to stdout. For Markdown with structure preserved: pandoc book.epub -o book.md. For scripted extraction in Python, unzip the EPUB and parse the XHTML content files with BeautifulSoup.

Can I convert an EPUB to plain text?

Yes. Multiple tools do this: Pandoc (pandoc book.epub -t plain), Calibre command line (ebook-convert book.epub book.txt), or the epub2txt Python package. Pandoc gives the best structure preservation; epub2txt is the simplest for a quick text dump.

How do I convert EPUB to Markdown?

Use Pandoc: pandoc book.epub -o book.md --wrap=none. This preserves heading hierarchy as # headings, bold/italic, lists, and tables in Markdown syntax. Use --wrap=none to avoid line wrapping that breaks prose paragraphs.

How do I extract EPUB text programmatically in Python?

An EPUB is a ZIP archive. Use zipfile to open it, find the OPF package file, locate the content XHTML files from the manifest, and parse each with BeautifulSoup. Call element.get_text() on the body to extract clean text. This approach gives full control for NLP pipelines or search indexing.

Why does direct PDF text extraction produce scrambled text for multi-column PDFs?

PDF text extraction reads characters in the order they appear in the file, which is often column-by-column for multi-column layouts. This produces interleaved text from both columns. Converting to EPUB first with a tool that has layout detection (like toolkit.bot) reorders the text correctly before extraction.

How to Extract Text from an EPUB — Plain Text, Markdown, and JSON

June 11, 2026 · 6 min read

Need to extract readable text from an EPUB for NLP processing, search indexing, content migration, or plain reading? Here are the best tools for EPUB text extraction in different output formats.

Method 1: Pandoc — Best for Markdown and Structured Text

# Extract to plain text
pandoc book.epub -o book.txt

# Extract to Markdown (preserves headings, bold, lists)
pandoc book.epub -o book.md

# Extract all text to stdout
pandoc book.epub -t plain

# Preserve heading hierarchy in Markdown
pandoc book.epub -t markdown --wrap=none -o book.md

Pandoc is the best choice when you need to preserve document structure. Markdown output retains # headings, **bold**, bullet lists, and table structure.

Method 2: Python — Unzip and Parse HTML

An EPUB is a ZIP archive containing XHTML files. You can extract text directly with Python:

import zipfile
from bs4 import BeautifulSoup
import os

def extract_epub_text(epub_path):
    texts = []
    with zipfile.ZipFile(epub_path, 'r') as z:
        # Find content files from OPF
        opf_files = [f for f in z.namelist() if f.endswith('.opf')]
        for opf in opf_files:
            opf_dir = os.path.dirname(opf)
            with z.open(opf) as f:
                soup = BeautifulSoup(f.read(), 'xml')
            items = soup.find_all('item', {'media-type': 'application/xhtml+xml'})
            for item in items:
                href = item.get('href', '')
                path = os.path.join(opf_dir, href).lstrip('/')
                try:
                    with z.open(path) as cf:
                        csoup = BeautifulSoup(cf.read(), 'html.parser')
                        texts.append(csoup.get_text(separator='
'))
                except KeyError:
                    pass
    return '

'.join(texts)

print(extract_epub_text('book.epub'))

Method 3: epub2txt (Command Line, Simple)

epub2txt is a minimal command-line tool focused on plain text extraction:

# Install
pip install epub2txt

# Extract
epub2txt book.epub > book.txt

# With line wrapping off
epub2txt --width 0 book.epub > book.txt

Method 4: Calibre's ebook-convert

# To plain text
ebook-convert book.epub book.txt

# To Markdown (via txt2rtf intermediate)
ebook-convert book.epub book.txt --txt-output-formatting markdown

Output Format Comparison

Tool	Plain TXT	Markdown	JSON	Best use
Pandoc	Yes	Yes (best quality)	No	Content editing, publishing
Python/BeautifulSoup	Yes	Custom	Yes (custom)	NLP pipelines, search indexing
epub2txt	Yes	No	No	Quick text dump
Calibre	Yes	Partial	No	Desktop workflow

Extracting Text from a PDF via EPUB

For extracting text from PDFs — especially scanned or multi-column — the EPUB intermediate gives better text ordering:

Convert PDF to EPUB with toolkit.bot (OCR + column reordering included)
Extract EPUB text with Pandoc: pandoc book.epub -t plain -o book.txt

Direct PDF text extraction tools often produce scrambled text on multi-column layouts. The EPUB conversion step reorders the content correctly before extraction.

Convert PDF to EPUB free →