# Reprint Finder v6

> **Status:** Active
> **Category:** VGMdb Scrapers
> **Language:** Python 3
> **Script file:** `reprint-crawler-absolute-final.py`

## Purpose

Recursively crawls VGMdb album pages breadth-first to discover complete reprint chains. Maintains a persistent cross-session history of every URL ever checked, so re-runs skip already-crawled albums. Results are appended to a persistent `results.xlsx` workbook with four sheets: Summary, Relationships, Seen History, and Errors.

This is the current production version. See [Reprint Finder v5](reprintfinder-v5.md) for the predecessor.

## Requirements

### Dependencies

```bash
pip install requests beautifulsoup4 openpyxl
```

| Package | Purpose |
|---------|---------|
| `requests` | HTTP requests to VGMdb |
| `beautifulsoup4` | HTML parsing |
| `openpyxl` | Writing and appending to the Excel workbook |

### Authentication

This script requires an authenticated VGMdb session. You **must** edit the two constants at the top of the file before running:

```python
USER_AGENT = r"Mozilla/5.0 ..."   # your browser's User-Agent string
COOKIE     = r"..."               # your vgmdb.net session cookie string
```

Copy these values from your browser's DevTools (Network tab, any request to vgmdb.net, copy the `User-Agent` and `Cookie` request headers).

## Input

| Item | Description | Example |
|------|-------------|---------|
| `*.url` files | Windows Internet Shortcut files anywhere under the search directory | `Chrono Trigger OST.url` |
| `search_dir` (optional arg) | Directory to scan recursively for `.url` files; defaults to `.` | `/mnt/music/library` |

The script scans the given directory recursively for all `*.url` files, extracts the VGMdb album URL from each, and uses those as crawl seeds. It does **not** read a plain `urls.txt` file (that is the v5 behavior).

## Output

| Item | Description |
|------|-------------|
| `results.xlsx` | Persistent Excel workbook in the log directory; new data is appended on each run |
| `seen_urls.json` | JSON history file tracking every URL ever crawled or discovered |
| `crawler.log` | Append-only text log of all crawl activity |

All output files are written to `P:\VGMPP-LOGS\Reprints` (Windows mapped drive). If `P:` is not mounted, the script falls back to `./VGMPP-LOGS/Reprints` in the current directory.

### Excel Sheets

| Sheet | Contents |
|-------|---------|
| **Summary** | One row per album — URL, directory source, status, first-seen timestamp, counts of Other Printings and Reprint Of links. Seed albums highlighted green; discovered reprints highlighted yellow. Hidden columns hold the raw reprint URLs. Separator rows divide album families. |
| **Relationships** | Edge list: one row per directed relationship (Source Album → Target Album) with relationship type and seed/crawled flags |
| **Seen History** | Every URL ever recorded (both crawled sources and discovered variants) with ISO timestamps; rebuilt fresh on each run |
| **Errors** | URLs that failed to fetch; appended each run if any failures occurred |

## Usage

```bash
# Scan current directory for .url files and crawl
python reprint-crawler-absolute-final.py

# Scan a specific directory
python reprint-crawler-absolute-final.py /mnt/music/vgm-library
```

### Options / Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `path` | Directory to scan recursively for `*.url` files | `.` (current directory) |

## Examples

```bash
# Basic usage — scan current directory
python reprint-crawler-absolute-final.py

# Point at your music library root
python reprint-crawler-absolute-final.py "D:\Music\VGM"

# On Linux/Mac with a mounted network share
python reprint-crawler-absolute-final.py /mnt/synology/VSPLAY
```

While running, a compact curses UI (~1/3 of terminal height) shows:
- Current directory being scanned
- Log directory path
- History count, crawled count, skipped count, error count
- Live scrolling per-album log (green = success, red = error, magenta = skipped)
- Progress bar

Press any key to exit after the "Complete" message appears.

## Notes

- **Hardcoded credentials:** `USER_AGENT` and `COOKIE` at lines 43–44 must be replaced with your own values before running. The example values in the file are real but will expire.
- **Rate limit:** 3-second delay between every HTTP request.
- **Skip logic:** URLs already recorded in `crawled_sources` in `seen_urls.json` are skipped without re-fetching. URLs only in `known_variants` (discovered as reprints but never fetched as a primary source) are crawled normally so their own reprint links can be captured.
- **Log directory:** Hardcoded to `P:\VGMPP-LOGS\Reprints`. On non-Windows systems or if `P:` is not mounted, the fallback `./VGMPP-LOGS/Reprints` is used automatically.
- **Persistent workbook:** `results.xlsx` is appended to, not overwritten. New family groups are added below existing data each run. Duplicate seeds (already present in the Summary sheet) are skipped.
- **Migration:** If you have a `seen_urls.json` from an older flat format (URL → timestamp), v6 automatically migrates it by treating all existing entries as `crawled_sources`.
- **curses UI:** The terminal UI requires a proper TTY. Running via `nohup` or in a non-interactive shell will fail; use `screen` or `tmux` if needed.

## Related Scripts

- [Reprint Finder v5](reprintfinder-v5.md) — predecessor; reads a plain `urls.txt` instead of scanning for `.url` files; no persistent history; produces a fresh timestamped `.xlsx` per run
- [extract-urls-final](extract-urls-final.md) — tool for generating a list of VGMdb URLs from `.url` shortcut files in your library (useful for building input for other scrapers)
- [VGMPP Cookie Tagger](vgmpp-cookie.md) — uses a similar cookie-auth approach to write FLAC tags based on VGMdb metadata
