# VGMDB Restructure — Two-Step URL Extractor + Album Scraper

> **Status:** Active
> **Category:** File Organization
> **Language:** Python 3
> **Script files:** `Step1 (place in directory).py`, `Step2.py`

## Purpose

A two-step pipeline that extracts VGMdb URLs from Windows `.url` shortcut files scattered across a music library, then scrapes album metadata for each URL and saves the results to a CSV. Used to build a structured dataset of album information from an existing collection.

## Requirements

### Dependencies

Step 1 has no external dependencies (standard library only).

Step 2 requires:

```bash
pip install requests beautifulsoup4 tqdm
```

## Input

| Item | Description | Example |
|------|-------------|---------|
| Music library folder | Directory tree containing Windows `.url` shortcut files | `D:\Music\FLAC\` |
| `.url` files | Windows internet shortcut files with a `URL=https://vgmdb.net/album/...` line | `Album info.url` |
| `urls.txt` (Step 2) | Output of Step 1; one VGMdb URL per line | Produced automatically |

## Output

| Item | Description |
|------|-------------|
| `urls.txt` | One VGMdb URL per line, written to the hardcoded output path (see Notes) |
| `vgmdb_data_YYYY-MM-DD_HHMM.csv` | Album metadata CSV with one row per URL |

### CSV columns

| Column | Description |
|--------|-------------|
| URL | VGMdb album page URL |
| Album ID | Numeric ID extracted from the URL |
| Album Title | English title scraped from the page |
| Release Year | Four-digit year from the Release Date field |
| Catalog Number | Catalog number (truncated before "Other" if multiple are listed) |
| Publish Format | e.g. `Commercial`, `Doujin/Indie` |
| Category | e.g. `Game`, `Animation` |
| Classification | e.g. `Original Soundtrack`, `Vocal` |

## Usage

### Step 1 — Extract URLs

Copy (or run) `Step1 (place in directory).py` **from inside the root of your music library**. It walks the entire tree, finds every `.url` file, extracts the `URL=` value, and writes them to `urls.txt`.

```bash
# Place the script inside your music library root, then run it
cd "D:\Music\FLAC"
python "Step1 (place in directory).py"
```

### Step 2 — Scrape album metadata

Run `Step2.py` from the same directory that contains `urls.txt`. It reads the file, scrapes each VGMdb page with a 2-second delay between requests, and writes a timestamped CSV.

```bash
python Step2.py
```

Progress is shown via a `tqdm` progress bar. Failed requests produce a row of `Unknown` values rather than stopping the run.

## Examples

```bash
# Full two-step workflow

# Step 1: extract URLs from your library
cd "D:\Music\FLAC"
python "Step1 (place in directory).py"
# Output: C:\scripts\vgmdb-restructure\urls.txt  (hardcoded — see Notes)

# Step 2: scrape metadata
cd "C:\scripts\vgmdb-restructure"
python Step2.py
# Output: vgmdb_data_2026-05-01_1430.csv
```

## Notes

- **Hardcoded output path (Step 1):** The URL list is written to `C:\scripts\vgmdb-restructure\urls.txt` regardless of where Step 1 is run from. Edit the `output_file` variable in the script before using it on a different machine:
  ```python
  output_file = r"C:\scripts\vgmdb-restructure\urls.txt"
  ```
- Step 2 looks for `urls.txt` in the **current working directory**, not the hardcoded path. Copy or move the file if necessary.
- The 2-second delay between requests is intentional to avoid overloading VGMdb. Do not remove it for large batches.
- If a page cannot be fetched, the row is recorded with all fields set to `Unknown` and processing continues.
- Multiple CSV output files accumulate in the directory across runs (each gets a unique timestamp). Old files are not overwritten.

## Related Scripts

- [vgmdb-restructure-combined](vgmdb-restructure-combined.md) — single-script version that also records the local folder path alongside each URL
- [vgmdb-url-rename-combo-python](vgmdb-url-rename-combo-python.md) — creates `.url` shortcut files inside folders (the files that Step 1 reads)