# [Building a dataset of Python versions with regular expressions](https://www.dataschool.io/web-scraping-with-regex/)

[Python Documentation by Version](https://www.python.org/doc/versions/)

In [1]:
import requests

In [2]:
r = requests.get('https://www.python.org/doc/versions/')

In [3]:
print(r.text[21646:22424])

<h1>Python Documentation by Version</h1>
<p>Some previous versions of the documentation remain available
online.  Use the list below to select a version to view.</p>
<p>For unreleased (in development) documentation, see
<a class="reference internal" href="#in-development-versions">In Development Versions</a>.</p>
<ul class="simple">
<li><a class="reference external" href="https://docs.python.org/release/3.11.2/">Python 3.11.2</a>, documentation released on 8 February 2023.</li>
<li><a class="reference external" href="https://docs.python.org/release/3.11.1/">Python 3.11.1</a>, documentation released on 6 December 2022.</li>
<li><a class="reference external" href="https://docs.python.org/release/3.11.0/">Python 3.11.0</a>, documentation released on 24 October 2022.</li>


## Extracting the dates

In [4]:
import re

In [5]:
dates = re.findall(r'\d+ \w+ \d{4}', r.text)
dates[0:3]

['8 February 2023', '6 December 2022', '24 October 2022']

## Extracting the version numbers

In [6]:
versions = re.findall(r'Python (\d.+?)<', r.text)
versions[0:3]

['3.11.2', '3.11.1', '3.11.0']

## Creating the dataset

In [7]:
import pandas as pd

In [8]:
pd.DataFrame(zip(versions, dates), columns=['Version', 'Date'])

Unnamed: 0,Version,Date
0,3.11.2,8 February 2023
1,3.11.1,6 December 2022
2,3.11.0,24 October 2022
3,3.10.10,8 February 2023
4,3.10.9,6 December 2022
...,...,...
184,1.5.2,30 April 1999
185,1.5.1p1,6 August 1998
186,1.5.1,14 April 1998
187,1.5,17 February 1998
