Installing and Using MinerU with Docker
Introduction
MinerU is a powerful tool that converts various document formats (PDF, PPT, PPTX, DOC, DOCX, PNG, JPG) into machine-readable formats such as Markdown and JSON. This tutorial will guide you through installing MinerU using Docker and demonstrate how to use it effectively.
Prerequisites
- Docker installed on your system
- Basic knowledge of terminal/command line
Installation
MinerU officially requires a GPU with at least 16GB of VRAM for optimal performance. However, this tutorial includes instructions for setting up a CPU-only version that works on systems without dedicated GPU support.
Step 1: Clone the MinerU Repository
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
Step 2: Create a CPU-Compatible Dockerfile
Create a file named Dockerfile.cpu
in the docker/global
directory with the following content:
# Use the official Ubuntu base image
FROM ubuntu:22.04
# Set environment variables to non-interactive to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive
# Update the package list and install necessary packages
RUN apt-get update && \
apt-get install -y \
software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y \
python3.10 \
python3.10-venv \
python3.10-distutils \
python3-pip \
wget \
git \
libgl1 \
libreoffice \
fonts-noto-cjk \
fonts-wqy-zenhei \
fonts-wqy-microhei \
ttf-mscorefonts-installer \
fontconfig \
libglib2.0-0 \
libxrender1 \
libsm6 \
libxext6 \
poppler-utils \
&& rm -rf /var/lib/apt/lists/*
# Set Python 3.10 as the default python3
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
# Create a virtual environment for MinerU
RUN python3 -m venv /opt/mineru_venv
# Copy the configuration file template and install magic-pdf latest
RUN /bin/bash -c "wget https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json && \
cp magic-pdf.template.json /root/magic-pdf.json && \
source /opt/mineru_venv/bin/activate && \
pip3 install --upgrade pip && \
pip3 install -U magic-pdf[full]"
# Download models and update the configuration file
RUN /bin/bash -c "pip3 install huggingface_hub && \
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models.py && \
python3 download_models.py"
# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
Step 3: Build the Docker Image
cd docker/global
docker build -t mineru-cpu:latest -f Dockerfile.cpu .
This process may take some time as it downloads and installs all necessary dependencies and models.
Using MinerU
Once the Docker image is built, you can use MinerU to convert various document formats to Markdown and JSON.
Basic Usage
To convert a document, use the following command structure:
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p /data/your_document.pdf -o /data/output_directory
This command:
- Mounts your current directory to
/data
inside the container - Processes
your_document.pdf
from your current directory - Saves the output to
output_directory
in your current directory
Conversion Methods
MinerU offers three different methods for document conversion:
-
OCR Mode: Uses optical character recognition to extract information from documents
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p /data/your_document.pdf -o /data/output_directory -m ocr
-
Text Mode: Better for text-based PDFs, more efficient than OCR
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p /data/your_document.pdf -o /data/output_directory -m txt
-
Auto Mode: Automatically selects the best method (default)
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p /data/your_document.pdf -o /data/output_directory -m auto
Language Support
To improve OCR accuracy for documents in specific languages, use the language parameter:
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p /data/your_document.pdf -o /data/output_directory -l en
Replace en
with the appropriate language code (e.g., fr
for French, de
for German, etc.).
Processing Specific Pages
To convert only specific pages of a document:
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p /data/your_document.pdf -o /data/output_directory -s 0 -e 5
This processes pages 0 through 5 (inclusive).
Debug Mode
For detailed debugging information during conversion:
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p /data/your_document.pdf -o /data/output_directory -d true
Output Files
After processing, MinerU generates several output files:
- Markdown (.md): A human-readable version of the document with preserved formatting
- JSON (.json): A structured representation of the document content
- Images folder: Contains extracted images from the document
Advanced Usage
Processing Multiple Files
To process all PDF files in a directory:
for file in *.pdf; do
docker run --rm -v $(pwd):/data mineru-cpu:latest magic-pdf -p "/data/$file" -o "/data/output_${file%.pdf}"
done
Integration with Other Tools
The generated Markdown and JSON files can be easily integrated with other tools and workflows:
- Use the Markdown files with static site generators
- Process the JSON files with data analysis tools
- Import the structured content into knowledge bases or LLM systems
Troubleshooting
Common Issues
-
Memory Issues: If you encounter memory errors, try processing fewer pages at a time using the
-s
and-e
options -
Font Issues: For documents with special fonts, ensure the necessary font packages are installed in the Docker image
-
Performance: CPU-only mode will be slower than GPU-accelerated processing. For large documents, consider using a system with GPU support
Conclusion
MinerU is a powerful tool for converting documents to machine-readable formats. With Docker, you can easily set up and use MinerU without worrying about complex dependencies or installation issues. This approach is particularly useful for scientific papers and technical documents with complex formulas and symbols.
For more information, visit the official MinerU documentation.