Designing better file organization around tags, not hierarchies (2017)
Hello everyone, thank you for all the comments. Seeing this on the HN front page caught me by surprise. In the past year I shared this article publicly (Reddit) and privately (with tech-savvy acquaintances) for comment, and the general sentiment I received was that these ideas were not ready to be read by a mass audience. The article is way too long and pulls in many disparate ideas; it explains both why traditional features are problematic and how new features would work better. In the end, it is unclear what a real implementation would look like, and what concrete benefits and annoyances would come out of real-world usage. I was hoping to build an ugly prototype before asking for feedback.
Regarding the comments on this HN thread, it seems the general discussion is around tagging. This is indeed the title of the article and the main idea that motivated my exploration, but I believe the other ideas are just as important. I explored notions like no-filenames, strong preference for hash addressing and references, location independence, immutability, backups and deduplication, preference for external (non-embedded) file metadata, first-class media libraries, and more.
I think the debate about tagging is quite adequate, and would be happy to hear comments about the other features/non-features, and whether all the ideas fit or don't fit cohesively as a system.
This is good thinking, and mostly overlaps with what I've tried to do a few times, so I would love to see it finally happen somehow.
For example, the Newton storage system I worked on at Apple 1990-1996 was based on separating organization from storage so we could have multiple (tag and/or hierarchy) organization systems. That eventually became the "soup" system in the shipping Newton OS, where objects were retrieved by content rather than hierarchy.
More recently, I spent some painful years working on Microsoft's WinFS, which had a ton of overlap with the principles here, and demonstrates just how hard it is to go from some nice principles that all seem like the Right Thing to an actual successful adopted implementation of the principles.
Have you considered a file system organized as a timeline that _also_ supports tagging?
I find one of the key concepts that's not a first-class concept is _when_ the file was modified. Rather than a file-and-folder physical analogy for the file system UI, I think a timeline-oriented UI could present some advantages for the way that humans actually think and work. Tags would be a helpful orthogonal organization scheme, but I don't think they work as a primary UI for navigation.
This is great work, though! I love the compilation of various other works, the references, and the way you've dug into the details!
>But fundamentally, there is a mismatch between the narrowness of hierarchies and the rich structure of human knowledge, and the proposed system will not presuppose the features of HFSes.
This hits the nail on the head !
All the fileSystems I had to work with are fine as engineering tools. By that I mean using them as an engineer works just fine, their own implementation is off topic.
As a user though.
What the hell !
I don't want to go to c:/users/me/documents/talks/stockholm2018/draft3
I just want to open my document !
I really hope that someday we expose a document based filesystem to the user.
The underlying implementation does not matter, we can always add a layer on top of the hierarchical file system.
I just want to be able to display :
-all the games installed on my system .
-all the pictures
-all of my text documents .
-all of my pictures of Paris
etc
Every time I read an article about attempts at non-hierarchical filesystems, I try to figure out how I'd take the huge piles of stuff I generate when I'm drawing (and publishing) a graphic novel and reorganize it under tags. It's never pretty.
Like, okay, sure, I tag everything with the name of the project, that's a no-brainer. But if I just do that then I get the hundreds of files I generate (one per page) mixed up with everything else - web-res renderings of each page, model sheets (and their source files), promotional material, stuff sent to publishers to try and convince them to deal with that part of the process, and the huge mass of files I generate for each book I print (which can be more than one for a single multi-year project): source files tweaked for print, print-res renderings thereof, files for the kickstarter for each book... So I tag all of these attributes too, and imagining putting all these tags on a file as I save it sure is a lot of fun, even if I imagine some sort of save requestor that keeps a list of all my previously-used tags, including ways to filter those - I don't care about any of the tags attached to my music collection or my collection of cartoon porn or my programming projects when I'm working on my comics projects, for instance, so I'd want to quickly narrow it down to just tags found in my art projects, and...
Ultimately it just starts to look like a hierarchical structure in my mind, except for the fact that I'm interacting with it by some kind of tag-filtering file browser on top of a huge filesystem that mixes everything together in a non-human-browseable structure.
Tags are great BUT
It's pretty important to realize that a files position is merely it's default tag (and you can tag it further with many different types of systems like extended attributes, as I think both Gnome and KDE have used at times).
Without that default tag you have a mess.
It's also important to note that Tags have a very high maintenance cost of their own.
Duplicate, inconsistently applied and redundant tags are a aggressive cancer in any of these systems.
No you can't just ignore them as they make it more and more difficult to accomplish even basic viewing /scanning over files for the system and the user.
Many many users have trouble even doing basic maintenance on their file locations (that default tag) that makes a tag based system even more prone to failure.
I spent last year ruminating on this (Independently. However I find it interesting the article was published around the same time that I was voicing my ideas to a friend on this!) and toying with a few prototypes. This year I committed myself to hacking on a proper implementation (named 'libkoios' and 'koios') of it, using the Extended Filesystem Attributes. What I found interesting is that while there was a lot of prior work systems existing for tagging, none of them use the extended attributes system, which to me feels like a waste. However there are problems with ext(2,3,4)'s implementation of file tags that make it difficult to store a lot of data without compression (I'm storing one bit per tag, which allows fast masking and comparison operations per file), so I guess that is understandable.
I believe that for image-based systems there is 8ch's /hydrus/ (probably the only good thing to come out of the chan-networks). One upshot of there being existing network sharing systems for tags is that it should be possible to scrape them when autotagging things (Nobody. NOBODY, wants to manually tag hundreds of photo memes, which is the main forseeable problem with file tagging).
I never personally used it, but I've heard the BeFS was designed to have significant non-hierarchical use cases:
https://en.wikipedia.org/wiki/Be_File_System:
> [BeFS] includes support for extended file attributes (metadata), with indexing and querying characteristics to provide functionality similar to that of a relational database.
IIRC, this was pretty hyped at the time, but they had to back away from it. I don't know if it was because if the concept was too unfamiliar to people familiar with the hierarchical paradigm or if it didn't work as well in practice as it was imagined.
There's also a book about it written by its designer and now freely available: http://www.nobius.org/dbg/practical-file-system-design.pdf
Tagging is the first step, but how do you know if you don't have overlapping or duplicate tags, say country and folk music? If you need something that fits both categories, you eventually start designing taxonomies and eventually ontologies, there's just no end to it. I think tagging is a sensible, lightweight approach, but it has limitations...
I find it hard to remember if Google Docs originally had tags instead of folders. Or if I just imagined it. I can’t find it through googling.
I would rather use tags than folders, but can’t find good support in an operating system.
Google used to be the closest since you could use the search bar as a command line and search queries as tags. There are no folders. But they changed now that try to guess what you’re looking for rather than what you type.
OSX has tags, but their search is slow and inaccurate.
The closest is I’ve been trying to use Gmail as an organizer with inbox infinity rather than inbox zero. Nothing organized other than tags. Using search to find anything.
Google Drive used to be based around tags, not hierarchies. It was wonderful. Then as it matured and catered to more and more 'normal' people it introduced the concept of folders. The folders were initially tags 'really' - the same file could exist in multiple folders at the same time. But they made that harder and harder, and now I think it's hierarchical folders through and through.
I miss the old days.
I found this to be a great collection of insights!
My criticisms:
Organizing your files and digital “stuff” has very little to do with the “rich structure of human knowledge,” to me, any more than organizing your kitchen or garage is an exercise in philosophy. The goal should be as usable a system as possible, full stop. Now, the actual content of the article is extremely practically oriented, so I have no beef with that. I just think people get carried away with the idea that storing a file is “representing knowledge,” and it takes them in weird directions like trying to create elaborate universal ontologies. The question is, is it easier or harder to find your files, and save your files?
Whenever a phrase like “representing knowledge” or “augmenting intelligence” comes up, it’s like everyone gets a boner, and then moves on to something unrelated, like (hopefully) usability.
Mutability: Everything changes. The only way to have immutable facts is to have timestamps. Image hosting sites, message boards, etc are misleading examples of file storage because they are really means of publishing. When you publish something, and people link to it, there’s a case for thinking of immutability as the default, though even then, most things that can be published can be retracted or edited. This comment can be edited after I publish it. I think true immutability as a default, for files as opposed to time-stamped facts, only makes sense in a very narrow domain.
I've used a self-hosted Booru for all of my image sorting. It took me roughly 2 months to upload and tag 53,000 images and another week of cleaning up rarely used or redundant tags. Since I used it outside of the Artist/Series/Character hierarchies my Artist/Series/Character hierarchies refer to Topic/Subject/Details.
For example, visualizations of various algorithms would be filed under:
Topic: Computer Science / Subjects: Algorithms, Visualizations / Data: {Algorithm Name}
By personal restriction - something may only be filed under one topic, no more than three subjects, and can have as much data as is relevant. It gets stored under what I believe to be the primary subject.
The #1 problem is I add files to my filesystem without uploading and tagging them to the Booru. Also, since the only open-source Booru software I could find is quite dated/buggy, I'm often fighting the Booru for how I use it. Now that I think about it, this might be a good problem for me to solve myself.
ALL available meta data should be exposed and forced into tags, categories and folders.
Move everything file-like into the file system. Make emails into folders with tagged files. Link Torrents to their files and folders. Treat zips like folders. etc
Bit more on date range sliders, colors, files as tags and 3d models here:
https://steemit.com/filesystem/@gaby-de-wilde/how-a-file-sys...
An approach I find interesting is the [Perkeep](https://perkeep.org/) or Google Drive model for post hierarchy.
Storing all files as objects and then indexing..
An interesting indexer for images would be one that groups objects by faces recognized or exif data(camera model, GPS location, lens, date, etc) Google Drive does this.
Perkeep can deal with tags, span devices, deal with permissions. Check out HN user @bradfitz
Every time I use a tagging-based system, I become more convinced that tags are what I want for almost all things, not just files.
I've wanted something along these lines for a long time as well. I have trouble drawing hard lines and distinctions (this is pervasive; things like having a "favorite" anything, or the desire to debate what genres a song or movie fall into, are rather alien to me). This makes picking "one" place for something difficult. Because these are fine/fuzzy distinctions for me, it's also tricky to reason my way back to where I would have put something.
The biggest directory in my document hierarchy is "flotsam".
I think part of the problem is that organizational tactics/schema/heuristics aren't global. We need an array of safe, high-quality tools with good system support/interfaces, and the knowledge to reason about how and when to use which. Patterns.
A stack is probably a fine way to think about organizing mail or clothes. It's probably less useful for deciding where furniture or paintings should go. A filesystem that made sense for organizing source code is probably not the best tool for organizing a movie collection or a lifetime of personal documents. Genre apparently seems like a great way to organize most of the world's movie, book, and music store/sections, but I (unless I can get someone to check the store's inventory system) never know whether what I'm looking for is out of stock or just hiding in the taxonomical hinterlands.
Search can help. Tags can help. Hierarchy can help. Metadata can help.
Just a braindump of what I don't like about tags instead of directories, no need to repay the advantages (I agree with some of them) :
- lack of identity. Somewhat watered down in presence of hardlinks/symlinks, but still much closer to identity than a tag cloud.
- does a file without a tag exist? It is conceptually very clear how removal of the path identity causes the actual bytes to go back into the free storage pool (wiggle a bit for hardlinks, but still pretty clear), it would be quite weird however to have files stick around based solely on secondary tags like "blue".
- lack of a consistent threshold for tagging: tags are binary, but relevance is not. If some files are tagged close to a full text index while others are tagged following a more minimalistic approach, the combined soup will not be very useful.
- too powerful for convenience: file creation in a hierarchical filesystem usually happens with a somewhat meaningful default. PWD, an app-wide default or an app-specific last used folder. The default sets some of the information you might put into tags, and this information is easily corrected if it was wrong, with a single operation that might be as easy as dragging to a different folder. "Is the default folder the right one for this file?" is easily discerned and corrected, a default tag cloud however would require a full mental scan to check for applicability to the new file. Every attempt of making those defaults more clever would just force even more scrutiny onto the user.
https://web.archive.org/web/20070927003401/http://www.namesy...
Hans Reiser had some good ideas.
And by the way there is a good analogy with www - originally we had just the addresses (somehow hierarchical), then we had hierarchical catalogue of Yahoo, and then quickly it became too much for that and we now rely on search.
I 100% agree that a tag based file system is better than a hierarchical or folder based system. The problem is that people seem to fall into one of two camps - too confused by how to make tags work, or too enamored with organizing things into folders.
A way to see folders & tags is to treat folders as tags but with the property that items may only be of one folder. The same happens in biology where scientists try to classify animals into one folder; a cat is put into the mammals folder (which has multiple hierarchies of subfolders). Instead, this classification system may be much more effective if it is classified as tags instead, where items may belong to multiple folders (or categories). This solves problems like the platypus, which belongs to multiple "biological folders".
It seems that humans have a hard time reasoning with the tag concept. I'm not sure where this comes from; is it decades of working with the folder/subfolder idiom in Windows, where most people are grown up with? Is it the resemblance of the physical world where we also put documents into one folder and one folder only? Or is it our intuition to simplify matters, and therefore seemingly make things simpler to uniquely have items belong to one container? I don't know; most likely, it's all the reasons above plus a few that I didn't mention.
Some relevant past discussions on similar articles: https://news.ycombinator.com/item?id=14537650 ; https://news.ycombinator.com/item?id=15492795
I'm going to really enjoy piecing apart the incredible detail put into this article.
https://tmsu.org/ This is nice.
I wouldn't really call a hierarchical FS a DAG (directed acyclic graph) because of the flaws with links you've already called out. It's not a true DAG.
Are there graph-structure filesystems?
Tagging has always seemed confusing to me because it seems like a degenerate case of a graph if your tags can't contain other tags. A graph that is only two levels deep doesn't have the flexibility of a real DAG. I'm having trouble visualizing the true correspondence between tagging and a graph structure, but I think they're pretty much the same thing if you can tag tags. Does that sound right?
Finding easy fast ways to navigate graphs in various UXs (shell, file explorers, etc) is an interesting challenge. Deletion is tricky.
I think this is fundamentally NOT how humans remembers things - I think we are masters in "geospatial" memory compared to abstract unrelated concepts, and "geospatial" memory is probably organised in hierarchies.
Worse, it's next to impossible to efficiently explore a large tag-cloud compared to a hierarchical structure, which means it's much harder to learn about things organised in a tag-cloud compared to hierarchical tree, or a graph that is mostly tree-like.
As an example; it totally breaks the xkcd techsupport cheat sheet - you end up in the "click one at random" branch basically all the time. https://xkcd.com/627/
Obviously, tagging things is good too, but file systems and computers should emphasise the tree (a better one than the Windows file strucutre though)- rather than inventing a confusing cloud/fog of unrelated things.
The pain with tags is the overhead of generating them and semantic drift. Likely the best solution is simply search (some smart semantic search) with ad hoc tags to help.
Just use tagspaces (https://github.com/tagspaces/tagspaces).
I believe many developers and designers have been annoyed by hierarchies in filesystems.
But I wanted to comment on this: > there is a mismatch between the narrowness of hierarchies and the rich structure of human knowledge
Absolutely true, this is exactly what I think annoys us most than anything, it shows us how limited hierarchies are. But at the same time, I think it's very relevant to keep in mind that our knowledge and the mental relationships we can find between ideas are very hard to make explicit and complete, like you would ideally want in a tag-based filesystem. I feel serious tagging, if manually defined, it's quite expensive if we want it to be really useful (surely, we can also consider complementary automatic tagging, like AI). Hierarchies instead, might not be very expressive, but they are very simple to use in "most" cases. So I would say we're still far from getting the best of both worlds. The problem to "solve" is information organization/structuring, and not even humans handle that ideally (we are more like, faulty, search engines with random inputs, prone to forget XD).
About the other ideas, I think they are all interesting, agree a lot with hashes usage and no-filenames, not so convinced about metadata, but haven't really thought enough about it. I don't think we can talk about the ideas fitting cohesively or not yet (but hey, I don't even think links in HFS are cohesive from any perspective), we would have to see more formal proposals for implementation and interface. This said, I hope we see more work along these lines in the future, it's a very worthy field to explore! Maybe start small, testing some of the ideas, we get a lot of design insight when we are working on the implementation.
My pet peeve with tagging systems in general, but especially community-based tagging is false negatives. If I search using a tag, there's no guarantee at all that it will display ALL the items that qualify for it.
I'll use my favorite porn site as an example, without going into any specifics and especially linking. I just skimmed the HN Guidelines and I don't think I'm breaking any.
Suppose I try the tag #bigtits. It is highly unlikely that I will get all the pictures with women who have especially large breasts. It's because no one will review all the images and verify if the tag #bigtits applies to them. That would be very time-consuming even for the most motivated individual who uses both hands for typing. So if I were into that particular fetish, I would need to try #bigtits, then #busty, then #nicerack, #slimandbusty, #ygwbt... because each tag has its proponents, and there's definitely overlap between them. You could - and I've seen non-porn sites doing that - use a program for automatic tagging, but then in my opinion you are defeating the purpose of tagging, which is grouping things by interesting categories. Machine-generated tags tend to be lifeless.
As I've said it is a pet peeve of mine, and I will likely start a project or two to implement my fixes for a web framework or a static blog generator. I mean that I should have confidence that a tag has been considered for all content in the collection. Program-assisted tags can help, such as keeping track of what tags existed at the point when a picture was added.
Then there are almost identical tags. #cat vs #cats, #tortoise vs #turtle, #color vs #colour.
Overall, in practice, I think tagging, as usually implemented, is the most overrated feature of the Web 2.0 era.
I purchased the Sony Digital Paper system (DPT-RP1) and it has possibly the most ill conceived file system design possible. All files are stored in a flat directory on the device (eg. one long list).
Users on the Sony community site are frequently looking for updates to the software. I'm curious which file organization solution, hierarchy or tags, would be easier to implement?
undefined
I'm a big believer in tags. I tried to make a tag-based tool that merely relies on directory names as tags, so ~/t/.tag1/.tag2/tag3/your-file-or-directory and moves your files around so that the tag directories are always organized by tag counts.
I had the idea there would be a series of tmv tcd tls commands that work with the tag directory structure.
https://github.com/foucist/tagmv
Warning - The regexp I'm using is likely broken, I suspect directories with .git/ or other dot-directories in them causes issues. It sometimes causes the .git/ innards to be moved out into the project directory or something like that. I never got around to fixing it.
OFTN OSWG started work on a tag-based filesystem called TPFS[1] back in 2011. The Haskell source code might be useful to anyone interested in developing platforms that use tag-based filesystems.
As I recently inherited a huge, well-tagged music collection (tens of terabytes of files) I am very interested in this. Is there something like it, also supporting .cue files and also storing the original filenames and structure? A mediaplayer agnostic way to access this treasure trove would be the best.
Since 1 June 2012, I've been taking notes in unicode text files, which contain (occasional or adjacent) lines starting with 'nb ' and then a list of tags. I wrote a simple tool ("nb") in Inferno's shell (thanks to Robert J. Ennis for the port to Plan 9's rc), to (1) search for given keywords in per-directory index files pointed to by the global index, (2) index all of the nb lines in files in the current directory, and (3) if necessary, append, to a global index file, a reference to the index file in the current directory.
https://github.com/catenate/notabene
I've found that I'm comfortable with the eventual consistency this offers, in exchange for fast lookups when I want something (as opposed to indexing first, and/or indexing globally, and so waiting for indexing to get a result). This distributed-file approach also allows me to add tags to a variety of files: local files, or networked file-system files, or sshfs-mounted files, or Dropboxed files, or files under version control, or files with varying text formats; and find tags across all of them and across all the time I've been indexing.
It runs in linear time with respect to the number of tags I've entered, plus the time to read and process the global index, so obviously there are many ways I could improve the time performance (as an easy example, I could permute the index to list all the tags in alphabetical order, and next to each tag list the files that contain that tag).
I also wrote other tools, since the layout is so simple: for example, "nbdoc", to catenate the actual contents of the references returned by the primary tool (nb); and "so" (second-order), to return all the tags which appear in any nb line with the given tag(s).
I've also found that it's not easy for me to remember what tags I might have used in the past, or how I was thinking about something, so I try to use the conjuction of several tags to narrow down search results, rather than try to remember one specific tag (this seems to correspond to the observation that it can be difficult to remember exactly where in a hierarchy you put something).
The modular approach, of per-directory indexes referenced in a global file, also makes it easy for me to combine work-specific notes, with public notes, with private notes, all in the same global index file, at work; but only have the same public and private notes at home.
I wonder if Microsoft would be willing to take another shot at WinFS? It would have met most of the requirements. But the project bogged down and never shipped.
git-annex might have some interesting thoughts about this: https://git-annex.branchable.com/tips/metadata_driven_views/
Have you seen "Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems" ? It is very relevant and if tags could be accurately machine generated we could derive organizational hierarchies from just tags.
> When people realize they need to classify a file in more than one way, they will start to use shortcuts/links to try to solve the problem. (Windows has shortcuts, Unix has soft/symbolic links, and Mac has aliases. This is a ubiquitous feature, but transferring shortcuts across existing platforms is very hard.) This sounds like a reasonable solution, but will face trouble in all but the simplest use cases.
They rule out this solution because it's not perfect, but surely if the idea had real merit this would be a serviceable test bed?
I think we have hierarchies because it's human nature to create hierarchies to make sense of the world, we try to force them into place where they don't make sense. We see it in biology, we see it in organisations and we see it in code, I'm sure most of us here have worked with examples of OO hierarchies that made no sense.
undefined
I wrote a paper about the same topic while in Uni about 15 years ago, and also developed a proof of concept 'filesystem' with an file explorer that uses tags. Too bad it isn't the standard in any OS yet.
I have thought about this a bit and I think if similarity hashes (probably LSH forest) were used, automatic tagging could occur with a preset of hashes.
for simple use cases, folder beats everything else.
when you have a large numbers in deep path then tags should be the way to go, you will need a database to manage it for portability across OSes etc.
with tags we need isolate how-to-store-the-files from how-to-organize-them-for-easy-access, tags can be used to build a virtual folder hierarchy for example.
undefined
ReiserFS flashbacks.
Didn't gmail do this then backtracked into a hybrid. Tagging is good for semantic searching and computer categorising, but it seems us fleshsacks are wed to our physical metaphors and groupings.
You can pry HFS from my dead hands!
seriously, graphs are hard, and the possibility to lose data is serious.