Recover music files

Recently I had to reorder my family music collection.

As some files where recovered from a corrupted drive, not all of them were correctly tagged or named (or even valid audio files). I also agglomerated different sources together: multiple computers, external drives, the recovered files, CDs, and a couple of mp3 players.

Thus some will also appear multiple times, maybe in different formats or with different names.

As the total size of recovered files was more than 100GB, adjusting all files one by one would not have been a feasible approach, especially because I had no idea about what type of music (which albums/artist and so on) was part of the collection.

How did I get there

I got in this messy situation because:

  • I did not have a reliable automated backup solution for big archives, I had to recover part of the library from a corrupted drive

  • used at least once opinionated music players like iTunes, which renamed and retagged many files

  • duplicated a subset of my library, as some (portable) players did not support my audio files, or needed smaller files

  • eventually changed metadata of some audio files

  • never defined a clear approach for collecting audio files, so they were saved on different devices with different conventions

  • No strategy for synchronizing all copies of family members/no central location with the complete audio collection

I finally decided to bring some order in this situation, this is what I’ve done.

Remove empty files and directories and duplicates

First I removed all empty files and folder with find, as it would make little sense to process them in any way:

find ~/Music/recovered -empty -delete

I also removed all duplicates files wit fdupes:

fdupes --recurse --delete ~/Music/recovered

Files count as duplicate only if the content is 100% identical byte-by-byte. The name and location of the file are irrelevant.

The process is interactive, and I paid particular attention to always remove files with meaningless names, like "ASKDFJ.mp4", or that where located under "Unknown Artist" and "Unknown Album", preferring well-named files, like "01 - song title.mp3", or files located under an Artist or Album name (hoping that the location would be correct).

There are other types of duplicates. For example "01 - song title.mp3" and "01 - song title (1).mp3" or "01 - song title-1.mp3" or "01 - song title.mp4" are probably duplicates, even if the files were not identical.

I do not know any program like fdupes that would ask me if I would like to remove the duplicated file (and notice, not all of those files are necessarily duplicates!), so I had to put something together.

I’ve used fpcacl for calculating the fingerprint of every audio file. As this process will take some time, I saved on a text file all file names and it’s the corresponding fingerprint.

Thus I’ve created following shell script

#!/bin/sh

for i in "$@"; do
  fingerprint="$(fpcalc "$i" | grep '^FINGERPRINT\=' )";
  printf "%s %s\n" "$fingerprint" "$i";
done

And the called it from find:

find ~/Music/recovered -type f -exec fpcalcfilename {} \; > a.out

After that, I had a file where every line looks like (supposing no filename has newline character)

FINGERPRINT=<string without whitespaces> ./path/to/file.ext

As I wanted to process the duplicate file, ie files with the same fingerprint, I reached to awk.

awk 'NR==FNR { dup[$0]; next; } $1 in dup' <(awk '{print $1}' a.out | sort | uniq -d) a.out > a.out2

This reduced the file size from 32MB to 20MB, so a big win. But those 20MB are still 4200 files to process. Going through every line and copy-pasting the filename to delete is still a lot of repetitive work.

Something like fdupes, which would ask which file to keep, would be faster and less error-prone.

So I needed to:

  • read every line

  • list all unique fingerprints

  • for every fingerprint, associate all filenames

  • ask which file to keep

  • delete the others

As I felt there was too much logic, I avoided bash and preferred to use a more structured language. In this case, I decided to use python

#!/usr/bin/env python3

import os.path

fingerprintmap = {}
with open("a.out2") as f:
    for line in f: # FIXME: suppose no file contains newlines
        fingerprint,sep,filename = line.rstrip().partition(' ');
        print(filename);
        fingerprintmap.setdefault(fingerprint, []).append(filename);

for filenames in fingerprintmap.values():
    queriedfilenames = []
    for filename in filenames:
        if os.path.isfile(filename):
            queriedfilenames.append(filename);

    if len(queriedfilenames) < 2:
        continue; # only one file, skip

    print("\nFollowing files are considered duplicates: ")
    index = 0
    for filename in queriedfilenames:
        print("\t", index, " ", filename)
        index = index +1;
    print()
    tokeep = input("Enter number of file to keep, or skip (s): ");

    if tokeep == "s":
        continue;
    queriedfilenames.pop(int(tokeep));
    for filename in queriedfilenames:
        os.remove(filename)

You might notice the lack of documentation, test, possibly best practices, error validation, and sanity checks (ie checking for links, duplicate entries, …​). As the script was meant only for this task, I did not want to spend too much time on it.

The script also verifies if the file did exist, just in case, I had to stop and restart the script.

While removing files, I’ve realized that without fpcalc or something similar, I would have never found some duplicates files and others so easily.

For example:

Following files are considered duplicates:
         0   ./Howard Shore/The Lord of the Rings_ The Fellowship of the Ring Extended Edition/1-05 The Treasure of Isengard.mp3
         1   ./Lord of the Rings Soundtrack/Unknown Album/COIS.mp3
         2   ./Lord of the Rings Soundtrack/Unknown Album/Lord of the Rings Soundtrack - Pippin's Song.mp3

[...]

Following files are considered duplicates:
         0   ./Unknown Artist/Unknown Album/GNBY.mp3
         1   ./Unknown Artist/Unknown Album/Unknown Artist - CGHR.mp3

[...]

Following files are considered duplicates:
         0   ./Brian May/Driven by You/01 Driven By You.mp3
         1   ./The Queen/Unknown Album/06 Track 6.mp3

[...]

Of course, there could have been false positives, that’s why I also implemented a skip option, which I used in a couple of cases.

Not save as described in https://rmlint.readthedocs.io/en/latest/cautions.html, so use this code with care.

There is surely value in having a generic fdupes function that can take an arbitrary hashing/fingerprinting function, but it was out of scope for my needs.

Process files from the biggest folder

As I had a good experience with Picard and MusicBrainz, I used mainly them for adding missing metadata to all my files.

Most files were already sorted correctly in a Folder structure that resembled artist name/album name/cd track. title.ext.

I’ve used Picards ability to rename and move files to change the folder structure to album artist name/album name.

As there where a lot of folders, and letting Picard parse 100+ GB of mixed data is confusing, the only way to bring some order was to parse the music collection piece by piece.

I used ncdu for finding the biggest folder, which gave the best result, as there were complete releases, and give an idea where should song in smaller folders go. Except the "Various", "Various Artist", "Unknown Artist" and "Unknown Album", folders, I opened them one by one with Picard.

Many audio files had enough meta information and Picard was able to find the corresponding metadata, but for most of them I had to improvise, ie take into account the folder name and file name if they made any sense, what tracks of the same artist or album I already had and thus where it made the most sense to place a file. In some cases, there was no hint of what a file could be, thanks to https://acoustid.org/ (integrated into Picard) I was able to get at least a title and author.

Note to future self: Submit all fingerprints of CDs or other tracks where I can prove that the information is correct, as such service has been a lifesaver to recover a lot of files.

I instructed Picard to remove all metadata as many files hade tags created by iTunes or other music players (or coming from the music store where they were bought). I had no interest in this informations and wanted to clean my files from his mess too.

Notes on a couple of tags…​

I noticed too late that I was probably deleting some metadata too much, for example

  • cddb

  • musicbrainzcdid

  • lyrics

The first two tags are useful in case I want to find again from which cd release where a track came from. Those are useful pieces of information for music players for fetching metadata. Since I am embedding most metadata directly in the audio files, it is probably/hopefully not that important.

I’ve also probably never used the embedded lyrics, but it might be a nice-to-have. Unfortunately Picard does not search for lyrics (nor there is a plugin for it AFAIK), so I will eventually add them again later with some separate program.

For every eventuality, I instructed Picard to leave those tags alone, especially because I might add in the future the lyrics.

Reduce number of file types

While iterating over the shared library, I noticed there are primarily 4 audio file types: .flac, .mp3, .mp4 and .m4a. I explicitly decided not to reencode anything (yet), I just wanted to sort everything (or as much as possible) out.

But mp4 is a generic video and audio format.

AFAIK none of the .mp4 files had any video in it, but just to be sure I did not simply want to rename them to .m4a, asit’s a convention for denoting .mp4 files with only audio.

In fact, file gives different outputs for .mp4 and .m4a:

file.m4a: ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
file.mp4: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]

So I used ffprobe to extract the information about the audio:

ffprobe file.mp4 2>&1 | grep Audio

And then used ffmpeg to copy only the audio, without changing the encoding:

> ffmpeg -i file.mp4 -vn -acodec copy file.m4a
ffmpeg version 3.4.6-0ubuntu0.18.04.1 Copyright (c) 2000-2019 the FFmpeg developers
  ...
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'file.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
  Duration: 03:31:24.60, start: 0.000000, bitrate: 33 kb/s
    Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 22050 Hz, mono, fltp, 30 kb/s (default)
    Metadata:
      handler_name    : VideoHandler
Output #0, ipod, to 'file.m4a':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf57.83.100
    Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 22050 Hz, mono, fltp, 30 kb/s (default)
    Metadata:
      handler_name    : VideoHandler
Stream mapping:
  Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
size=   48754kB time=03:31:24.55 bitrate=  31.5kbits/s speed=1.96e+04x
video:0kB audio:47682kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 2.248258%

This also reduced the file size a little on all files.

Conclusion

The library is much better organized than before.

I made surely multiple mistakes without noticing, like tagging some songs wrongly, leaving some duplicates, or removing/overwriting some files(!) or tags that I should not have.

But compared to before

  • the library size of recovered files shrunk approx from 120GB to 60GB (60GB difference!)

  • tracks are consistently tagged, named and located

  • cover arts saved as separate files

  • some invalid files (empty audio files) where pruned

  • removed one file format (mp4)

I believe the lost files and incorrectly tagged files are worth it, as I (or someone else) would have never noticed those issues before.

Further improvements and considerations

Of course, I also wanted to avoid getting in the same situation again, considering how many days I needed to process all files (and I did not process them all yet).

The first and most important step is to define a backup strategy.

I do not have an ideal backup strategy yet, especially because it needs to work both on GNU/Linux and Windows. In the meantime, I’m using/testing git-annex with a remote server and an external drive. It has all the features I like and works on Windows too.

Unfortunately, Windows and FAT32 do not support symlinks very well. Hopefully, because of WSL2, Windows support will improve, while for fat32 nothing will change, which makes it hard to use on an external SD card (for example when using an Android device), or generic music players. It is not an alternative to a serious automated backup solution, but thanks to the structured duplication(!) of data on different drives it can be used as a backup strategy.

Then I wanted to avoid unnecessarily duplicating/reencoding the data like I had to do before for playing my audio files on different devices, which lead to multiple duplicates of the same file in different formats and quality.

As of today, it is, fortunately, possible to play nearly all audio files unchanged on android devices. Thanks to Rockbox I can also play my .ogg files unchanged on my old Music Player.

The synchronization process, if possible, is a simple diff between my ~/Music folder, and the folder on the target device.