Dhruvil Patel dhruvilp

Here's a complete, battle-tested end-to-end script specifically designed for fine-tuning the MXFP4-quantized MoE GPT-oss-20B model on your 4×A10G (96GB) setup. This leverages QLoRA for memory efficiency while handling MXFP4 quantization properly.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Fine-tune MXFP4-quantized MoE GPT-oss-20B with QLoRA
Hardware: 4× NVIDIA A10G (24GB VRAM each)
Key Tech: bitsandbytes (MXFP4), PEFT (QLoRA), FlashAttention-2, DeepSpeed ZeRO-3
"""

Granite Docling Document Converter

A high-performance, parallel-processing library for converting documents to Markdown, JSON, and DocTags using the Granite Docling model. No FastAPI, Flask, or web frameworks required - pure Python library with sync and async support.

🚀 Features

No Web Framework Required: Pure Python library - use it directly in your code
Parallel Processing: Process large PDFs with multiple workers for maximum speed
Async Support: Full async/await support for non-blocking operations
Multiple Output Formats: Convert to Markdown, JSON, DocTags

[
  {
    "content": "reasoning language: English\n\nYou are an intelligent assistant that can answer customer service queries",
    "role": "system",
    "thinking": null
  },
  {
    "content": "Can you provide me with a list of the top-rated series currently on Netflix?",
 "role": "user",

	# Use an AWS Deep Learning Container (DLC) as a base or a vLLM specific image
	# Ensure the base image has the necessary CUDA drivers and PyTorch
	FROM vllm/vllm-openai:latest # Or a specific version that matches your CUDA

	# Copy the pre-downloaded model weights into the container image
	COPY /mnt/models/granite-docling-258M /app/local_model

	WORKDIR /app

	# The entrypoint command will use the local directory path for the --model argument

	# train.py
	# Run with: accelerate launch --num_processes 4 train.py
	# Make sure to have accelerate config set up for DDP, or it will auto.

	import os
	import torch
	from datasets import load_dataset
	from peft import LoraConfig, get_peft_model, TaskType
	from transformers import (
	AutoModelForCausalLM,

	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	import time


	model_path = './gpt-oss-model-local'

	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=True,

	import json
	import pandas as pd
	from langchain.text_splitter import RecursiveCharacterTextSplitter
	from openai import OpenAI
	import os
	from typing import List, Dict
	import random

	class SyntheticDataGenerator:
	def __init__(self, api_key: str = None, model: str = "gpt-4"):

	import asyncio
	import json
	import os
	from base64 import b64decode
	from typing import List, Dict, Optional, Any
	from pydantic import BaseModel, Field

	from crawl4ai import (
	AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode,
	JsonCssExtractionStrategy, LLMExtractionStrategy, LLMConfig,

	nference Providers
	NEW

	Fireworks
	Text Generation
	Reset
	Examples
	Input a message to start chatting with deepseek-ai/DeepSeek-V3-0324.
	How can I convert an app running on tomcat Catalina 8 server to spring boot app with jdk 17

	from together import Together
	client = Together(api_key = TOGETHER_API_KEY)

	question = "Which is larger 9.9 or 9.11?"

	thought = client.chat.completions.create(
	model="deepseek-ai/DeepSeek-R1",
	messages=[{"role": "user", "content": question}],
	stop = ['</think>']
	)