# VGMdb HTML Data Extractors

> **Status:** Active
> **Category:** Data Extraction
> **Language:** Python 3
> **Script files:** `pullcatnumbersfromcollection.py`, `script to pull http addresses.py`

## Purpose

Two standalone scripts that extract structured data from saved VGMdb HTML pages: one pulls catalog numbers from a collection page, the other pulls and deduplicates all album URLs.

## Requirements

### Dependencies

Both scripts use the Python standard library only — no installation required.

```bash
# No pip install needed
# Requires: re, csv (stdlib)
```

## Input

Both scripts read from a file named `input.txt` placed in the same directory as the script.

| Item | Description | Example |
|------|-------------|---------|
| `input.txt` | Raw HTML source saved from a VGMdb collection or album page | Paste from browser → Save As → Plain text or HTML |

## Output

### `pullcatnumbersfromcollection.py`

| Item | Description |
|------|-------------|
| `outputcatnumbers.csv` | One catalog number per row, with a header row (`Extracted Text`) |

### `script to pull http addresses.py`

| Item | Description |
|------|-------------|
| `vgmdb_urls.csv` | Deduplicated list of VGMdb album URLs, one per row, with a header row (`VGMdb URL`) |

## Usage

### Extract catalog numbers from a collection page

```bash
python pullcatnumbersfromcollection.py
```

### Extract album URLs from any VGMdb page

```bash
python "script to pull http addresses.py"
```

## How Each Script Works

### `pullcatnumbersfromcollection.py`

Searches the HTML for `<span class="catalog album-game">...</span>` tags and writes each match to a CSV row. Works against VGMdb collection pages where albums are listed with their catalog numbers visible in the page source.

### `script to pull http addresses.py`

Uses a regex to find all strings matching `http://vgmdb.net/album/<numeric_id>` anywhere in the file, deduplicates with a set, then writes to CSV. Note that VGMdb now uses HTTPS — the script matches `http://` URLs as that is the form present in page source at time of writing.

## Examples

```bash
# Step 1: Save your VGMdb collection page source to input.txt
# (In your browser: Ctrl+S or View Source → copy all → paste into input.txt)

# Step 2a: Extract catalog numbers
python pullcatnumbersfromcollection.py
# Output: Exported 142 items to outputcatnumbers.csv

# Step 2b: Extract album URLs
python "script to pull http addresses.py"
# Output: Saved 710 URLs to vgmdb_urls.csv
```

## Notes

- Both scripts hardcode the input filename as `input.txt` and output filenames as described above. To use different filenames, edit the `input_file` / `output_file` variables at the top of each script.
- `script to pull http addresses.py` only matches `http://` (not `https://`) URLs. If VGMdb page source uses HTTPS links, no URLs will be matched.
- Deduplication in the URL script uses an unordered `set`, so output row order is not guaranteed to be stable between runs.
- The `hold/` subdirectory contains archived CSV outputs from previous runs and is not used by the scripts.

## Related Scripts

- [VGMdb Nicotine Shares](vgmdb-nicotineshares.md) — extracts catalog numbers from Nicotine+ JSON exports rather than HTML
- [VGMdb Redbook Check](vgmdb-redbookcheck.md) — takes VGMdb search URLs as input and checks each one via HTTP
