# Extract URLs Final

> **Status:** Active
> **Category:** VGMdb Scrapers
> **Language:** Python 3
> **Script file:** `extract_urls_final.py`

## Purpose

Recursively scans a directory tree for Windows `.url` Internet Shortcut files, extracts the URL from each one, and writes all found URLs — one per line — to `output.txt`. Displays a live progress UI in the terminal while running using Python's `curses` library. No external dependencies required.

## Requirements

### Dependencies

Standard library only — no `pip install` needed.

```
os, sys, time, curses, configparser, pathlib, collections
```

## Input

| Item | Description | Example |
|------|-------------|---------|
| `search_dir` (optional arg) | Directory to scan recursively for `*.url` files; defaults to `.` | `/mnt/music/VGM` |

`.url` files are Windows Internet Shortcut files. Each file must contain a URL in one of these formats:

**INI-style (standard):**
```ini
[InternetShortcut]
URL=https://vgmdb.net/album/1234
```

**Raw URL line:**
```
https://vgmdb.net/album/1234
```

Both formats are handled automatically.

## Output

| Item | Description |
|------|-------------|
| `output.txt` | One URL per line, written to the **current working directory** (not the search directory). Existing file is overwritten. |

## Usage

```bash
# Scan current directory
python extract_urls_final.py

# Scan a specific directory
python extract_urls_final.py /path/to/music/library
```

### Options / Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `search_dir` | Directory path to scan recursively | `.` (current working directory) |

Note: `output.txt` is always written to the current working directory, regardless of where `search_dir` points.

## Examples

```bash
# Extract all VGMdb URLs from a music library into output.txt
python extract_urls_final.py /mnt/synology/VSPLAY

# Scan the current directory
cd /mnt/music/VGM
python extract_urls_final.py

# On Windows
python extract_urls_final.py "D:\Music\VGM"
```

## Terminal UI

While running, the script displays a compact coloured panel (~1/3 of the terminal height) showing:

- **Path:** the directory being scanned
- **Stats:** total files found, URLs extracted, files skipped (no URL found)
- **Scrolling log:** most recent files processed, colour-coded green (URL found) or red (no URL)
- **Progress bar:** percentage complete
- **Status line:** current file count vs. total

Press any key to exit after the "Done" message appears and `output.txt` has been saved.

## Notes

- `output.txt` is written to the **current working directory**, not the scanned directory. If you run from a different directory than where you want the file, move it manually or `cd` first.
- The script does not deduplicate URLs. If multiple `.url` files contain the same URL, it will appear multiple times in `output.txt`. Use `sort -u output.txt > deduped.txt` if you need unique URLs.
- Files are processed in sorted (alphabetical) order within each directory.
- A 20ms artificial delay is added per file to keep the UI readable. On very large libraries this adds up; for 5000 files expect ~1.7 minutes of UI time on top of disk I/O.
- If a `.url` file exists but contains no parseable URL (empty, corrupt, or non-standard format), it is counted as "Skipped" in the stats and logged in red.
- The `curses` UI requires a proper TTY. Piping output or running via `nohup` without a terminal will cause a crash. Use `screen` or `tmux` for remote sessions.

## Related Scripts

- [Reprint Finder v6](reprintfinder-v6.md) — can directly scan for `.url` files itself; `extract_urls_final.py` is useful when you need the URL list as a plain text file for other tools
- [VGMPP Cookie Tagger](vgmpp-cookie.md) — also walks a directory tree looking for `.url` files; `extract_urls_final.py` is a lightweight audit tool to see what albums have shortcut files before running the tagger
- [graburls](graburls.md) — alternative URL extraction from a saved HTML file rather than from `.url` shortcut files
