# VGMdb Reprint Finder — Deprecated Crawlers

> ⚠️ **DEPRECATED** — These scripts are superseded by `vgmdb-v5-reprintfinder` and then [`vgmdb-reprintfinder-final`](../vgmdb-scrapers/vgmdb-reprintfinder-final.md). Documented for reference only.

> **Status:** Deprecated
> **Category:** Deprecated
> **Language:** Python 3
> **Script files:** `vgmdb_webcrawler_updatedv5.py`, `vgmdb_webcrawler_updatedmoreaccurate.py`, `vgmdb-reprintfinder-final.py`

## Purpose

These scripts crawl VGMdb album pages to identify reprint relationships between albums — detecting when one release is an "alternate printing of", "bootleg printing of", or is linked via a shared catalog number to another release. Results are exported to a timestamped Excel file.

## Version History

Three scripts are present in the `vgmdbcrawl-reprintfinder-deprecated/` directory, representing the evolution of the crawler:

| File | Description |
|------|-------------|
| `vgmdb_webcrawler_updatedv5.py` | v5 — two-scenario extraction (alternate printing text + catalog number pattern matching). Two-column output: `Extracted_URL`, `Source_URL`. |
| `vgmdb_webcrawler_updatedmoreaccurate.py` | "More Accurate" — same two scenarios but with stricter adjacent-text matching for Scenario 1 (only looks within the immediate parent element, not sibling elements). Two-column output. |
| `vgmdb-reprintfinder-final.py` | "Final" (still in this deprecated folder) — three-column output adding `Run_Next_Time` column; Scenario 1 matches (alternate/bootleg printing) are flagged for re-crawling. |

All three scripts share the same core architecture:

- Read URLs from `urls.txt` (one per line)
- Process each URL with a 1-second rate limit delay
- Try Scenario 1 (text match), then Scenario 2 (catalog number pattern) per page
- Write results to a timestamped `.xlsx` file
- Remove successfully processed URLs from `urls.txt`
- Highlight cross-matching URLs in yellow in the Excel output

## Extraction Strategies

### Scenario 1 — Text Match
Searches the page for the strings `"alternate printing of"` or `"bootleg printing of"` and extracts the adjacent album link. The "more accurate" version restricts the search to the immediate parent element only.

### Scenario 2 — Catalog Number Pattern
Scans all links on the page for hrefs containing `/album/` whose link text matches a catalog number pattern (`[A-Z]{2,}\d+`, `CD`, `Vinyl`, or a 4-digit year). This approach has higher false-positive risk.

## Requirements

### Dependencies

```bash
pip install requests beautifulsoup4 pandas openpyxl
```

## Input

| Item | Description | Example |
|------|-------------|---------|
| `urls.txt` | One VGMdb album URL per line in the same directory as the script | `https://vgmdb.net/album/1234` |

## Output

| Item | Description |
|------|-------------|
| `vgmdbcollection_YYYYMMDD_HHMMSS.xlsx` | Excel file with extracted URL pairs; matching URLs highlighted in yellow |
| `urls.txt` (updated) | Successfully processed URLs are removed; failed ones remain for re-run |

### Output Columns

| Script | Column A | Column B | Column C |
|--------|----------|----------|----------|
| v5 / more-accurate | `Extracted_URL` | `Source_URL` | — |
| final (3-col) | `Extracted_URL` | `Source_URL` | `Run_Next_Time` |

## Usage

```bash
# Place urls.txt in the same directory as the script, then:
python vgmdb_webcrawler_updatedmoreaccurate.py

# Or the v5 version
python vgmdb_webcrawler_updatedv5.py
```

The script pauses at the end and waits for Enter before closing (Windows-friendly behaviour).

## Notes

- All scripts normalise URLs to HTTPS before writing to Excel.
- The `junk versions.7z` archive in this folder contains earlier experimental builds.
- Three sample output `.xlsx` files from September–November 2025 are present in the folder.
- The "more accurate" variant was developed specifically to reduce false positives from Scenario 2's broad catalog number matching.
- Supersession path: these deprecated scripts → `vgmdb-v5-reprintfinder` → `vgmdb-reprintfinder-final` (the ChatGPT-assisted rewrite documented separately)

## Related Scripts

- [VGMdb Reprint Crawler (ChatGPT)](vgmdbcrawl-reprint-chatgpt.md) — experimental ChatGPT-assisted rewrite, also deprecated
- [XX Other Scripts Archive](xx-other-scripts.md) — earlier collection scraper versions
