# VGMdb Nicotine+ Share Catalogue Extractor

> **Status:** Active
> **Category:** Data Extraction
> **Language:** Python 3
> **Script files:** `allyears.py`, `extracall.py`, `extract2021.py`

## Purpose

Three scripts that parse a Nicotine+ (Soulseek client) shared directory JSON export and extract catalogue numbers embedded in folder path names. Each script applies a different filtering strategy — all years combined, all years to a flat text file, or a specific year only.

## Requirements

### Dependencies

All scripts use the Python standard library only — no installation required.

```bash
# No pip install needed
# Requires: json, re, csv (stdlib)
```

## Input

All three scripts expect a file named `usershare_export.json` in the working directory.

| Item | Description | Example |
|------|-------------|---------|
| `usershare_export.json` | JSON export of a Nicotine+ shared directory listing | Exported from Nicotine+ via Shares → Export |

### JSON Structure

The file is expected to be a JSON array of entries. Each entry is itself a list where the first element (`entry[0]`) is a folder path string. Catalogue numbers are embedded in the path using bracket notation:

```
\\Music\\2021\\[LACA-9266~7] Artist - Album Title
\\Music\\2020\\[ANZX-15352] Artist - Album Title
```

### Catalogue Number Pattern

All scripts match this regex: `\[([A-Z]{2,5}-\d{3,5}(?:~\d+)?)\]`

| Part | Meaning |
|------|---------|
| `[A-Z]{2,5}` | 2–5 uppercase letter label prefix |
| `-` | Literal hyphen |
| `\d{3,5}` | 3–5 digit number |
| `(?:~\d+)?` | Optional suffix for multi-disc sets (e.g. `~7`) |

Examples matched: `LACA-9266~7`, `ANZX-15352`, `CPCA-10105`

## Output

| Script | Output File | Format |
|--------|-------------|--------|
| `allyears.py` | `catalogue_numbers_unique.csv` | CSV with columns: `Catalogue Number`, `Folder Path` |
| `extracall.py` | `catalogue_numbers_all.txt` | Plain text, one catalogue number per line |
| `extract2021.py` | `catalogue_numbers_2020.txt` | Plain text, one catalogue number per line |

## Scripts

### `allyears.py`

Scans every entry in the JSON, extracts all matching catalogue numbers, and deduplicates them. For each unique catalogue number, stores the **first** folder path it appeared in (with the catalogue number brackets stripped from the path). Outputs a UTF-8 BOM CSV sorted alphabetically by catalogue number.

**Use this when:** you want a deduplicated catalogue list with folder context, suitable for spreadsheet work.

### `extracall.py`

Scans every entry, extracts all matches, deduplicates with a set, and writes sorted unique catalogue numbers to a plain text file.

**Use this when:** you want a clean flat list of all catalogue numbers across the entire share for further processing.

### `extract2021.py`

Same logic as `extracall.py` but only processes entries whose path string contains the literal text `2020`. Despite the script name containing "2021", the filter and output filename both reference 2020 — this appears to be a naming inconsistency in the original script.

**Use this when:** you want catalogue numbers restricted to a specific year's folder entries.

## Usage

```bash
# Deduplicated CSV with folder paths (all years)
python allyears.py

# Flat text file, all years
python extracall.py

# Flat text file, 2020 folders only
python extract2021.py
```

## Examples

```bash
# Place export next to the scripts, then run:
python allyears.py
# Output: Extracted 1842 unique catalogue number(s) with folder paths.
# File:   catalogue_numbers_unique.csv

python extracall.py
# Output: Extracted 1842 catalogue number(s) from all entries.
# File:   catalogue_numbers_all.txt

python extract2021.py
# Output: Extracted 214 catalogue number(s) from 2021.
# File:   catalogue_numbers_2020.txt
```

## Notes

- All three scripts hardcode the input filename as `usershare_export.json`. To use a different filename, edit the `json_file` variable at the top of each script.
- `allyears.py` outputs with UTF-8 BOM encoding (`utf-8-sig`) so it opens correctly in Microsoft Excel without encoding issues.
- `extract2021.py` is named after 2021 but filters for `2020` in the path and outputs to `catalogue_numbers_2020.txt`. This is a known inconsistency — verify the year filter if reusing this script for other years.
- The `test/` subdirectory contains development scratch files (`blah.py`, `New Text Document.py`) and a sample `usershare_export.json` used during testing. These are not part of the production workflow.
- The `New Text Document.py` file in the root of the directory is also a development scratch file.

## Related Scripts

- [VGMdb Redbook Dummy](vgmdb-redbookdummy.md) — also parses a share JSON export, but creates a dummy file structure instead of extracting catalogue numbers
- [VGMdb HTML Data Extractors](extract.md) — extracts catalogue numbers from VGMdb HTML pages instead of share exports
