Skip to content

Instantly share code, notes, and snippets.

View minhpqn's full-sized avatar

Pham Quang Nhat Minh minhpqn

View GitHub Profile
@minhpqn
minhpqn / LLMs.md
Created January 4, 2023 08:28 — forked from yoavg/LLMs.md

Some remarks on Large Language Models

Yoav Goldberg, January 2023

Audience: I assume you heard of chatGPT, maybe played with it a little, and was imressed by it (or tried very hard not to be). And that you also heard that it is "a large language model". And maybe that it "solved natural language understanding". Here is a short personal perspective of my thoughts of this (and similar) models, and where we stand with respect to language understanding.

Intro

Around 2014-2017, right within the rise of neural-network based methods for NLP, I was giving a semi-academic-semi-popsci lecture, revolving around the story that achieving perfect language modeling is equivalent to being as intelligent as a human. Somewhere around the same time I was also asked in an academic panel "what would you do if you were given infinite compute and no need to worry about labour costs" to which I cockily responded "I would train a really huge language model, just to show that it doesn't solve everything!". We

@minhpqn
minhpqn / video-subtitles-via-whisper.py
Created September 27, 2022 04:40 — forked from rasbt/video-subtitles-via-whisper.py
Script that creates subtitles (closed captions) for all MP4 video files in your current directory
# Sebastian Raschka 09/24/2022
# Create a new conda environment and packages
# conda create -n whisper python=3.9
# conda activate whisper
# conda install mlxtend -c conda-forge
# Install ffmpeg
# macOS & homebrew
# brew install ffmpeg
# Ubuntu

Quick Start

sudo curl https://gist.github.com/pankaj28843/3ad78df6290b5ba931c1/raw/soffice.sh > /usr/local/bin/soffice && sudo chmod +x /usr/local/bin/soffice

Create an bash script at /usr/local/bin/soffice with following content

#!/bin/bash

# Need to do this because symlink won't work
@minhpqn
minhpqn / tmux-cheatsheet.markdown
Created December 29, 2021 09:08 — forked from MohamedAlaa/tmux-cheatsheet.markdown
tmux shortcuts & cheatsheet

tmux shortcuts & cheatsheet

start new:

tmux

start new with session name:

tmux new -s myname
@minhpqn
minhpqn / quantize.py
Created July 8, 2020 08:55 — forked from aleju/quantize.py
Simple quantization function for python
def quantize(val, to_values):
"""Quantize a value with regards to a set of allowed values.
Examples:
quantize(49.513, [0, 45, 90]) -> 45
quantize(43, [0, 10, 20, 30]) -> 30
Note: function doesn't assume to_values to be sorted and
iterates over all values (i.e. is rather slow).
@minhpqn
minhpqn / gradient_accumulation.py
Created June 15, 2020 16:07 — forked from thomwolf/gradient_accumulation.py
PyTorch gradient accumulation training loop
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad() # Reset gradients tensors
if (i+1) % evaluation_steps == 0: # Evaluate the model when we...
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""This module's docstring summary line.
This is a multi-line docstring. Paragraphs are separated with blank lines.
Lines conform to 79-column limit.
Module and packages names should be short, lower_case_with_underscores.
Notice that this in not PEP8-cheatsheet.py
@minhpqn
minhpqn / regex-japanese.txt
Created April 13, 2020 15:04 — forked from terrancesnyder/regex-japanese.txt
Regex for Japanese
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
([一-龯])
Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])
Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])
Regex for matching Hirgana or Katakana or basic punctuation (、。’)
@minhpqn
minhpqn / download_glue_data.py
Created November 1, 2019 01:33 — forked from W4ngatang/download_glue_data.py
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC
@minhpqn
minhpqn / rnn_minibatch.py
Created April 12, 2019 22:44 — forked from ritchie46/rnn_minibatch.py
minibatches in pytorch
"""
How to do minibatches for RNNs in pytorch
Assume we feed characters to the model and predict the language of the words.
"""
def prepare_batch(x, y):
# determine the maximum word length per batch and zero pad the tensors
n_max = max([a.shape[0] for a in x])