# VGMdb Reprint Crawler — ChatGPT-Assisted Version

> ⚠️ **DEPRECATED** — This is an experimental ChatGPT-assisted rewrite of the reprint finder. It has been superseded by the final production version. Documented for reference only.

> **Status:** Deprecated
> **Category:** Deprecated
> **Language:** Python 3
> **Script files:** `v3.py`, `v4.py`

## Purpose

Crawls VGMdb album pages to find reprint relationships between albums, outputting results to a timestamped Excel file. This version was developed with ChatGPT assistance and introduced improved logging and a cleaner architecture compared to the earlier deprecated crawlers. `v4.py` is the more accurate of the two versions.

## Versions

| File | Notes |
|------|-------|
| `v3.py` | Original ChatGPT-assisted version — searches for "printing of" / "reprint of" text anywhere on the page |
| `v4.py` | Improved — restricts Scenario 2 to the **Catalog Number row only**, significantly reducing false positives. Recommended over v3. |

## Requirements

### Dependencies

```bash
pip install requests beautifulsoup4 openpyxl
```

Note: this version uses `openpyxl` directly (not `pandas`), resulting in a leaner dependency list compared to the earlier deprecated crawlers.

## Input

| Item | Description | Example |
|------|-------------|---------|
| `urls.txt` | One VGMdb album URL per line, in the same directory as the script | `https://vgmdb.net/album/1234` |

## Output

| Item | Description |
|------|-------------|
| `results_YYYYMMDD_HHMMSS.xlsx` | Excel file with three columns |
| `crawler.log` | Append-mode log file written alongside the script |
| `urls.txt` (updated) | Successfully processed URLs are removed; failed ones remain for re-run |

### Output Columns

| Column | Description |
|--------|-------------|
| `Source URL` | The input album page URL |
| `Extracted URL` | A related album URL found on that page (blank if none found) |
| `Reprint Link (Scenario 2)` | Populated when the link was found via the Catalog Number row; same value as Extracted URL in that case |

## How It Works

### Scenario 1 — Other Printings Block
Looks for a `<td>` element containing the text "Other Printings". If found, extracts all album links from the parent table.

### Scenario 2 — Catalog Number Row (v3 vs v4)

- **v3:** Scans the full page for any text matching `"printing of"` or `"reprint of"` and extracts adjacent album links — broad but prone to false positives.
- **v4:** Locates the `<span>` labelled "Catalog Number" specifically, then checks only the adjacent data cell for matching text. Much more precise.

## Usage

```bash
# Recommended — use v4 for better accuracy
python v4.py

# Original version
python v3.py
```

## Examples

```bash
# Create urls.txt with target album pages
echo "https://vgmdb.net/album/1234" > urls.txt
echo "https://vgmdb.net/album/5678" >> urls.txt

# Run the crawler
python v4.py
# → results_20260501_120000.xlsx
# → crawler.log updated
# → processed URLs removed from urls.txt
```

## Logging

Both scripts write to `crawler.log` in append mode. Each run adds timestamped entries for:
- Each URL processed (with progress counter)
- Each extracted link (labelled Scenario1 or Scenario2)
- Any HTTP errors (URL is written as `RUN AGAIN` in the Excel output)
- Final save path and remaining URL count

## Notes

- A 1-second delay is applied between requests to avoid rate-limiting.
- If a URL fails to fetch, it is written to the Excel with `"RUN AGAIN"` in the Extracted URL column and is **not** removed from `urls.txt`, allowing for a re-run.
- The `backups/` and `masters/` subdirectories contain historical Excel output files from September–November 2025 runs.
- Several sample output `.xlsx` files are present at the root of the folder from September 2025 runs.
- The `urls.txt` file included in the repository root appears to be from a previous run and may not contain valid URLs.

## Related Scripts

- [VGMdb Reprint Finder (Deprecated Crawlers)](vgmdbcrawl-reprintfinder-deprecated.md) — the earlier generation of scripts this rewrites
- [XX Other Scripts Archive](xx-other-scripts.md) — earlier collection scraper versions
