Pham Quang Nhat Minh minhpqn

Some remarks on Large Language Models

Yoav Goldberg, January 2023

Audience: I assume you heard of chatGPT, maybe played with it a little, and was imressed by it (or tried very hard not to be). And that you also heard that it is "a large language model". And maybe that it "solved natural language understanding". Here is a short personal perspective of my thoughts of this (and similar) models, and where we stand with respect to language understanding.

Intro

Around 2014-2017, right within the rise of neural-network based methods for NLP, I was giving a semi-academic-semi-popsci lecture, revolving around the story that achieving perfect language modeling is equivalent to being as intelligent as a human. Somewhere around the same time I was also asked in an academic panel "what would you do if you were given infinite compute and no need to worry about labour costs" to which I cockily responded "I would train a really huge language model, just to show that it doesn't solve everything!". We

Quick Start

sudo curl https://gist.github.com/pankaj28843/3ad78df6290b5ba931c1/raw/soffice.sh > /usr/local/bin/soffice && sudo chmod +x /usr/local/bin/soffice

Create an bash script at `/usr/local/bin/soffice` with following content

#!/bin/bash

# Need to do this because symlink won't work

tmux shortcuts & cheatsheet

start new:

tmux

start new with session name:

tmux new -s myname

	# Sebastian Raschka 09/24/2022
	# Create a new conda environment and packages
	# conda create -n whisper python=3.9
	# conda activate whisper
	# conda install mlxtend -c conda-forge

	# Install ffmpeg
	# macOS & homebrew
	# brew install ffmpeg
	# Ubuntu

	def quantize(val, to_values):
	"""Quantize a value with regards to a set of allowed values.

	Examples:
	quantize(49.513, [0, 45, 90]) -> 45
	quantize(43, [0, 10, 20, 30]) -> 30

	Note: function doesn't assume to_values to be sorted and
	iterates over all values (i.e. is rather slow).

	model.zero_grad() # Reset gradients tensors
	for i, (inputs, labels) in enumerate(training_set):
	predictions = model(inputs) # Forward pass
	loss = loss_function(predictions, labels) # Compute loss function
	loss = loss / accumulation_steps # Normalize our loss (if averaged)
	loss.backward() # Backward pass
	if (i+1) % accumulation_steps == 0: # Wait for several backward steps
	optimizer.step() # Now we can do an optimizer step
	model.zero_grad() # Reset gradients tensors
	if (i+1) % evaluation_steps == 0: # Evaluate the model when we...

	#! /usr/bin/env python
	# -- coding: utf-8 --
	"""This module's docstring summary line.

	This is a multi-line docstring. Paragraphs are separated with blank lines.
	Lines conform to 79-column limit.

	Module and packages names should be short, lower_case_with_underscores.
	Notice that this in not PEP8-cheatsheet.py

	Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
	([一-龯])

	Regex for matching Hirgana or Katakana
	([ぁ-んァ-ン])

	Regex for matching Non-Hirgana or Non-Katakana
	([^ぁ-んァ-ン])

	Regex for matching Hirgana or Katakana or basic punctuation (、。’)

	''' Script for downloading all GLUE data.

	Note: for legal reasons, we are unable to host MRPC.
	You can either use the version hosted by the SentEval team, which is already tokenized,
	or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
	For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
	You should then rename and place specific files in a folder (see below for an example).

	mkdir MRPC
	cabextract MSRParaphraseCorpus.msi -d MRPC

	"""
	How to do minibatches for RNNs in pytorch

	Assume we feed characters to the model and predict the language of the words.
	"""


	def prepare_batch(x, y):
	# determine the maximum word length per batch and zero pad the tensors
	n_max = max([a.shape[0] for a in x])