# Reprint Finder v5

> **Status:** Superseded — use [Reprint Finder v6](reprintfinder-v6.md) for new work
> **Category:** VGMdb Scrapers
> **Language:** Python 3
> **Script file:** `v5.py`

## Purpose

Recursively crawls VGMdb album pages breadth-first to discover complete reprint chains, starting from a list of seed URLs in a plain text file. Outputs a styled, timestamped Excel workbook with a Summary sheet and a Relationships edge-list sheet. Superseded by v6, which adds persistent cross-session history, directory scanning for `.url` files, and an appended persistent workbook.

## Requirements

### Dependencies

```bash
pip install requests beautifulsoup4 openpyxl
```

| Package | Purpose |
|---------|---------|
| `requests` | HTTP requests to VGMdb |
| `beautifulsoup4` | HTML parsing |
| `openpyxl` | Writing the Excel workbook |

### Authentication

This script requires an authenticated VGMdb session. You **must** edit the two constants at the top of `v5.py` before running:

```python
USER_AGENT = r"Mozilla/5.0 ..."   # your browser's User-Agent string
COOKIE     = r"..."               # your vgmdb.net session cookie string
```

Copy these values from your browser's DevTools (Network tab, any request to vgmdb.net).

## Input

| Item | Description | Example |
|------|-------------|---------|
| `urls.txt` | One VGMdb album URL per line; used as crawl seeds | `https://vgmdb.net/album/1234` |
| `urls_file` (optional arg) | Path to an alternative URL list file | `/path/to/my_seeds.txt` |

Example `urls.txt`:

```
https://vgmdb.net/album/1234
https://vgmdb.net/album/5678
```

## Output

| Item | Description |
|------|-------------|
| `results_YYYYMMDD_HHMMSS.xlsx` | Timestamped Excel workbook created in the current directory |
| `crawler.log` | Append-only crawl log written to the current directory |
| `urls.txt` (overwritten) | After the run, any failed URLs are written back to the input file for easy re-running |

### Excel Sheets

| Sheet | Contents |
|-------|---------|
| **Summary** | One row per album crawled — URL, Was Seed?, Had Error?, count of Other Printings, count of Reprint Of, plus one column per individual link. Seed albums highlighted green; discovered reprints highlighted yellow; errors highlighted orange. |
| **Relationships** | Edge list: Source Album → Target Album with relationship type (Other Printing / Reprint Of) and seed/crawled flags |
| **Errors** | Present only if fetch failures occurred; lists failed URLs with a note to add them back to `urls.txt` |

## Usage

```bash
# Read urls.txt in the current directory
python v5.py

# Read a specific file
python v5.py /path/to/seeds.txt
```

### Options / Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `urls_file` | Path to the plain-text URL list | `urls.txt` in current directory |

## Examples

```bash
# Standard usage with urls.txt in the same folder
python v5.py

# Use a custom seed file
python v5.py my_albums.txt

# After a run with errors, re-run to retry failed URLs
# (failed URLs are automatically written back to urls.txt)
python v5.py
```

While running, a compact curses UI (~1/3 of terminal height) shows:
- Path to the URL file
- Queued, Crawled, and Error counts
- Currently-processing URL
- Live scrolling per-album log (green = success, red = error)
- Progress bar

Press any key to exit after the "Complete" message appears.

## How It Works

For each seed URL the crawler:

1. Fetches the album page.
2. Looks for an **"Other Printings"** table block and collects all album links found there.
3. Looks for a **"printing of" / "reprint of"** anchor in the Catalog Number row and collects those links.
4. Newly discovered `reprint_of` links are added to the BFS queue and crawled in turn.
5. `other_prints` links are recorded but not recursively crawled (they are siblings, not ancestors/descendants).

This continues until no new URLs are found.

## Notes

- **Hardcoded credentials:** `USER_AGENT` and `COOKIE` at lines 38–39 of `v5.py` must be replaced with your own values before running. The placeholder values in the file are real but will expire.
- **Rate limit:** 3-second delay between every HTTP request.
- **No persistent history:** Unlike v6, v5 has no memory between runs. Every run crawls all seed URLs fresh. Failed URLs are written back to `urls.txt` so they can be retried on the next run.
- **Fresh output file per run:** Each run produces a new `results_YYYYMMDD_HHMMSS.xlsx`; old files are never modified.
- **curses UI:** Requires a proper TTY. Use `screen` or `tmux` if running in a non-interactive context.
- **Crawler log:** `crawler.log` is opened in append mode; it will grow across multiple runs. Delete or rotate manually as needed.

## Related Scripts

- [Reprint Finder v6](reprintfinder-v6.md) — current version; adds persistent history, `.url` file scanning, and an appended persistent workbook
- [extract-urls-final](extract-urls-final.md) — generates a URL list from `.url` shortcut files in a music library, suitable for use as `urls.txt`
