# Grab URLs

> **Status:** Active
> **Category:** VGMdb Scrapers
> **Language:** Python 3
> **Script file:** `collectionurls.py`

## Purpose

Extracts all VGMdb album URLs from a locally saved HTML file (e.g., a saved VGMdb collection page), deduplicates them, and exports the result to an Excel spreadsheet. A minimal, dependency-light alternative to the Docker-based live scraper for one-off or offline URL harvesting.

## Requirements

### Dependencies

```bash
pip install pandas openpyxl
```

| Package | Purpose |
|---------|---------|
| `pandas` | Building the DataFrame and writing to Excel |
| `openpyxl` | Excel file writer backend used by pandas |

## Input

| Item | Description | Example |
|------|-------------|---------|
| `input.html` | A saved HTML file containing VGMdb album links; must be in the same directory as the script | Save a VGMdb collection page via browser "Save Page As" |

The script looks specifically for URLs matching the pattern `http://vgmdb.net/album/<numeric ID>`.

### How to get `input.html`

1. Open your VGMdb collection page (or any VGMdb page containing album links) in your browser.
2. Use **File > Save Page As** (or Ctrl+S) and save as "Web Page, Complete" or "Web Page, HTML Only".
3. Rename or move the saved `.html` file to `input.html` in the same directory as `collectionurls.py`.

## Output

| Item | Description |
|------|-------------|
| `vgmdb_links.xlsx` | Excel file with a single column `URL` containing all deduplicated VGMdb album URLs found in the HTML |

## Usage

```bash
# Place input.html in the same directory as the script, then:
python collectionurls.py
```

### Options / Arguments

None. All configuration is implicit: `input.html` → `vgmdb_links.xlsx`.

## Examples

```bash
# Standard workflow
cp ~/Downloads/collection.html ./input.html
python collectionurls.py
# Output: Extracted 847 links to vgmdb_links.xlsx

# Quick check of how many links were found
python collectionurls.py && echo "Done"
```

## Notes

- The regex pattern used is `http://vgmdb\.net/album/\d+` (HTTP, not HTTPS). VGMdb collection pages typically embed `http://` URLs in their HTML even if the page is served over HTTPS. If no URLs are found, check whether your saved HTML uses `https://` links and update the regex in `collectionurls.py` line 9 accordingly:
  ```python
  urls = re.findall(r'https?://vgmdb\.net/album/\d+', html)
  ```
- Deduplication preserves first-occurrence order (`dict.fromkeys`).
- The script reads the entire HTML file into memory. Very large pages (500+ albums with full page assets saved) may use significant RAM; "HTML Only" save mode is recommended.
- `input.html` must be in the same directory the script is run from. There is no command-line argument to specify an alternative path; edit line 5 of `collectionurls.py` if you need a different path.
- The output file `vgmdb_links.xlsx` is always written to the current working directory.

## Related Scripts

- [docker-vgmdb-scraper](docker-vgmdb-scraper.md) — performs a live authenticated scrape of a VGMdb collection without needing to save the page manually; also exports to Excel
- [extract-urls-final](extract-urls-final.md) — extracts VGMdb URLs from `.url` Internet Shortcut files in a music library rather than from a saved HTML page
