KpKqwq

CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki and looks like wiki--pages-articles.xml.bz2 or wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).

It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text article sections from each article.

For each extracted article, it prints its title, section names and plaintext section contents, in json-line format.

Examples

bash

	"""

	The code below combines approaches published by both @eugene-yh and @jinyongyoo on Github.

	Thanks for the contributions guys!

	"""

	import torch
	import peft

	#!/usr/bin/env python3

	"""
	This is code to take a trained Fairseq model and discard the ADAM optimizer state,
	which is not needed at test time. It can reduce a model size by ~70%.

	Original author: Brian Thompson
	"""

	from fairseq import checkpoint_utils

	import torch
	from transformers import BertTokenizer, BertModel, BertForMaskedLM
	import logging
	logging.basicConfig(level=logging.INFO)# OPTIONAL



	tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
	model = BertForMaskedLM.from_pretrained('bert-base-uncased')
	model.eval()

	# -- coding: utf-8 --

	from __future__ import unicode_literals
	from nltk.tokenize import RegexpTokenizer
	import argparse
	import os

	"""
	Script for tokenizing Portuguese text according to the Universal Dependencies
	(UD) tokenization standards. This script was not created by the UD team; it was