Xophmeister 7 days ago

A number of people have commented that they’ve never experienced a filename with `\n`, or other weird characters, in the wild. On the contrary, I’ve seen this many many times, to the point that I’m defensive about how I write scripts, to compensate.

In my case, these were files created by some other program that contained a bug, where the filename was accidentally set to something like the file contents, say. These other programs were often written by researchers, rather than engineers, who (reasonably so) cared more about their research, than functional correctness. They were also not incentivised to clean-up the dodgy files, if they fix the bug — a big if! — or may not have even be equipped to do so.

You could argue that this is an edge case, but shit happens. Often.

  • ykonstant 6 days ago

    Not an edge case at all; to add another source of such file names, academics often copy-paste or manually type pdf titles into file names resulting in files with newlines. Sometimes their machines use ancient encodings which, together with non-ASCII author names and "helpful" encoding converters result in garbage file names.

    I've mentioned these situations before in this forum; the last time I did, I got the absurd reply "I don't care what your academic colleagues do" from someone who supposedly had sysadmin experience. So yeah, all you can do is write correct code and accept the fact that many others will refuse to.

  • SkyPuncher 7 days ago

    Anytime files, file systems, or binary files become involved I immediately assume everything is going to be harder and longer than estimated.

    There are simply so many dumb things that can happen. A library might handle 99% of cases correctly, but that 1% case basically requires an entire rewrite.

    • Log_out_ 5 days ago

      Of the filenames. Got FileRenamer CRON job, go and kill filenames. Dont solve edge cases, blunt edges at the bleeding edge. Im not gonna rewrite my software to acomodate down sign errors of the ancients.

1vuio0pswjnm7 8 days ago

You shouldn't parse the output of ls(1) (wooledge.org) 149 points by tosh on Dec 31, 2021 | past | 148 comments

https://news.ycombinator.com/item?id=29747034

Parsing the output of ls is an antipattern (wooledge.org) 2 points by goranmoomin on Sept 6, 2021 | past

https://news.ycombinator.com/item?id=28435532

Why you shouldn't parse the output of ls(1) (wooledge.org) 2 points by O1111OOO on Dec 24, 2018 | past | 2 comments

https://news.ycombinator.com/item?id=18753980

Parsing ls (wooledge.org) 3 points by tambourine_man on Jan 31, 2016 | past

https://news.ycombinator.com/item?id=11007601

Why you shouldn't parse the output of ls(1) (wooledge.org) 2 points by rosser on Jan 20, 2015 | past

https://news.ycombinator.com/item?id=8914841

Why you shouldn't parse the output of ls (wooledge.org) 31 points by dgellow on July 6, 2014 | past | 18 comments

https://news.ycombinator.com/item?id=7994720

donatj 8 days ago

I have been writing shell scripts for over twenty years and don't think I have ever encountered a file name with a newline. I have hit file names starting with a dash - though being interpreted as flags at least a few times. Every command should support the -- path separation feature a lot of newer stuff like git does.

  • userbinator 8 days ago

    Every command should support the -- path separation feature a lot of newer stuff like git does.

    That's a POSIX standard feature:

    https://pubs.opengroup.org/onlinepubs/009696799/basedefs/xbd...

    The argument -- should be accepted as a delimiter indicating the end of options. Any following arguments should be treated as operands, even if they begin with the '-' character. The -- argument should not be used as an option or as an operand.

  • arwineap 8 days ago

    I don't place newlines in my filenames, but I also prefer my shell scripts to be portable.

    Learning and understanding the bash oddities is key for me because I use it so often. It's the same reason I quote every expansion; sure my text likely doesn't have spaces, but maybe it does and id rather use the extra quote characters to ensure I don't have to troubleshoot language grammar bugs in my code

    • frou_dh 7 days ago

      I stopped having to proactively learn most gotchas by just having the spectacularly useful shellcheck linter automatically run on save.

userbinator 8 days ago

I think that if an attacker can control filenames fully, this is the least of your worries. Otherwise this article could've just as well been titled "why you shouldn't create filenames with newlines", but that makes too much sense.

Has anyone actually found a good use for filenames containing newlines?

  • nine_k 8 days ago

    I wonder what's the point to support newlines as a valid character in file names. And, for that matter, any other ASCII characters in the 0x00..0x1f range.

    Of course, file names are not characters but bytes, and this allows filesystems to transparently support various filename encodings, which are handled by locale, not by filesystem. On the other hand, there are few practical encodings that involve bytes that are control characters in ASCII. Obviously not UTF-8, not 8-bit stuff like ISO-8859-*, nor the widespread Japanese, Korean, or Chinese encodings, as much as I'm aware. UCS2 filenames would be impossible though, but I suspect nobody uses UCS2 as a locale in practice (even ironically).

    I think adding a mount flag that would disallow non-printable ASCII bytes in filenames could improve security, and is also really trivial to implement. I cannot be the first person to suggest it. I wonder why it was never implemented.

    • userbinator 8 days ago

      I wonder what's the point to support newlines as a valid character in file names. And, for that matter, any other ASCII characters in the 0x00..0x1f range.

      0 is reserved for the terminator so you can't have a null byte, but I think instead of looking at this from the perspective of "what's the point to support", you should think of it as "why add additional code just for restriction, when we can consider that an upper-layer concern". The same goes for case sensitivity --- it makes the FS code in the kernel more generic and simpler.

      Contrast this with the other extreme, filesystems that have to effectively implement a subset of Unicode because they're case-insensitive beyond ASCII. Even Apple abandoned that idea over time: https://news.ycombinator.com/item?id=13953800

      • coldtea 7 days ago

        >you should think of it as "why add additional code just for restriction, when we can consider that an upper-layer concern".

        Because that way you solve the problem at the root?

        Of course that's not the "not my problem" Jersey-style approach.

        But tons of our current problems are because of that Jersey-style approach.

    • coldtea 7 days ago

      >Of course, file names are not characters but bytes

      Depends on the filesystem. Some expect them to be characters.

  • dwheeler 8 days ago

    > Has anyone actually found a good use for filenames containing newlines?

    They're great for attacking systems, because so many programs aren't prepared for them. Any program that uses filenames as keys (and there are many) can have this vulnerability if the input validation has a weakness.

    I don't think there are any legitimate uses. They also make it unnecessarily hard to write robust shell scripts.

    • userbinator 8 days ago

      Any program that uses filenames as keys (and there are many) can have this vulnerability if the input validation has a weakness.

      That's why things like upload sites usually ignore the provided file name and generate their own (unique) one, or else limit it to a very safe subset like [0-9A-Za-z_]

      • o11c 7 days ago

        The POSIX portable filename character set is:

          A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
          a b c d e f g h i j k l m n o p q r s t u v w x y z
          0 1 2 3 4 5 6 7 8 9 . _ -
        
        Additionally:

        * The slash `/` is portable as a separator.

        * The filename should not be empty

        * The filenames `.` and `..` are special. (in many contexts it's reasonable to exclude all components that start with a dot)

        * The filename should not start with a `-`

        * The entire path should be no more than 256 bytes and no component should be more than 14 bytes (widely ignored since we assume modern filesystems).

    • IshKebab 7 days ago

      Shell makes it unnecessarily hard to write robust shell scripts. I don't think filenames are to blame here.

      I mean it's not an issue in Python for example.

      • dwheeler 7 days ago

        It is an issue for Python, though this specific issue isn't the main issue for Python.

        More generally, it's even harder to routinely call programs while connecting them with pipes and ensuring that the programs run in parallel. It's a pain to routinely call other programs and get results Python. It takes many many lines of Python to do tasks that are clearer one-liners in shell... and vice versa.

        Python and shell are different; each are good at different things.

        • IshKebab 7 days ago

          It's not an issue in Python. You would use https://docs.python.org/3/library/os.html#os.listdir in Python which returns a proper list of strings, not a string with in-band delimiters.

          > Python and shell are different; each are good at different things.

          I agree there. Shell is good for interactive use and throwaway scripts that you don't dare share with anyone.

          Python is good for scripts that you want to actually work without supervision. (Though Python is just an example; I would recommend Deno if you are actually scripting stuff.)

  • inopinatus 8 days ago

    > Has anyone actually found a good use for filenames containing newlines?

    Yes - in tests for code that handles filenames, shell or otherwise.

    • mminer237 8 days ago

      We should also require all files to include metadata of their names translated into Klingon so we can properly test if programs are verifying the translation correctly.

      • inopinatus 8 days ago

        "Klingons do not release software. Klingon software escapes, leaving a bloody trail of design engineers and quality assurance people in its path."

        This certainly clarifies that input validation & robustness is a security matter

        • blooalien 8 days ago

          Well, this comment straight up wins my "favorite comment of the day" award. Take my upvote as your prize. :)

    • kelnos 8 days ago

      I think that falls under "no".

      • II2II 8 days ago

        Seeming as newlines may appear in file names, it is a good use. Otherwise something as mundane as a bug in program A (which produces a file name with a newline) has a potential to expose a bug in program B (which fails to interpret a file name with a newline).

        • tyre 8 days ago

          At some point it feels fine that code fails. If you’re writing shell scripts against file names that you don’t control and someone puts a new line, that’s on them. They’re asking for a bad time.

          • inopinatus 8 days ago

            Many shell scripts we write run as root, or as a deployment user (which is often privileged in terms of its access to business-critical code, packages, data stores, user uploads). Or they may be supplied for use as utilities by folks with other domains of expertise. We should never plan to enable the possibility of uncontrolled outcomes due to bad/hostile third-party input; moralising blame after an incident isn't a productive activity.

            • Filligree 8 days ago

              Maybe the kernel should be fixed to refuse filenames with newlines in them, if there are no other use cases…

              • inopinatus 8 days ago

                The option (and it would have to be an option) to do so doesn't sound like an unreasonable defense-in-depth measure, but given the myriad Unix and Unix-like variants in the world, and the possibility that your code may not be running on such a configured target system[1], we still end up having to have adequately defensive basic coding standards.

                Honestly, though, it's one reason I write a lot of my own scripts in zsh these days. Zsh has its own quirks, and I don't touch oh-my-zsh with a barge pole, but it's a darn sight quicker to reason about expansions in zsh code compared to bash, especially if arrays are in the mix. Unfortunately zsh is not necessarily available everywhere and certainly remains outside the base install of many *nix distros.

                [1] we can prove the existence of at least one such device, by construction: it's the CI/CD host on which tests run, because those tests themselves may specifically and intentionally be for safe handling of such filenames, and therefore rely on being able to instantiate them

              • SAI_Peregrinus 5 days ago

                Rather, the POSIX specification should be fixed to exclude `\n` from filenames, instead of just `/` and `\0`.

    • fshbbdssbbgdd 8 days ago

      Somebody does it once and now we all have to handle it forever!

    • chipdart 8 days ago

      > Yes - in tests for code that handles filenames, shell or otherwise.

      That sounds like a variant of the old broken window fallacy.

  • gruez 8 days ago

    >I think that if an attacker can control filenames fully, this is the least of your worries.

    There's a lot of ways an arbitrarily named file can get on your system. You could be cloning a random git repo, extracting some sort of archive, or plugging in a usb drive.

    • IshKebab 7 days ago

      Yes and more to the point I don't expect an attacker to be able to own my system just by having a particular filename on it.

      I mean... I do expect that on Linux because shitty Bash scripts are so common. But that is a problem that should be solved.

  • Joker_vD 8 days ago

    > Has anyone actually found a good use for filenames containing newlines?

    Yes. They are great for more legible names for document files on the Desktop, or even for using as post-it notes, also on Desktop. Of course, it only works for people who don't start fullscreen tmux session as the first thing they do after the boot and then never leave it.

  • bbwbsb 8 days ago

    Whether it matters if attackers control filenames depends on the threat model. I can think of a few where it doesn't matter. One example: you provide managed web application hosting and have some scripts that check something about the files so they get handled properly.

    If you have a good reason to create a file, and the value you want to identify it by contains a newline, then that is a good reason to create a filename containing new lines. Filenames are arbitrary bytes excluding null. Regardless at scale, eventually such a file will end up being created. Encoding bugs, libraries, user creation, messed up copy/paste, etc. File names may not even be utf8.

    `ls | cmd` is also just an sh code smell. Bash is big on discipline and suffering and not so much on correctness (like C and forth) so code smells matter more. When I worked with/maintained large bash/sh scripts, one `ls | cmd` and I would reread everything the person wrote because it was very likely it was horribly broken. Another one is 'set -euo pipefail`, which people love to set but often don't understand.

    A lot of sh/bash, maybe most of it, is written by people that don't take it seriously as a language, and it shows. Also basically everything on stack overflow about it is wrong, and probably everything chatgpt says. I remember it being uniquely bad in that regard. To the point that whenever someone says a neat bash fact, it is better to assume they are part of an elaborate conspiracy to get you to write catastrophic bugs until verified independently. Anything short of that level of paranoia and you eventually do the steam thing and delete everyone's files.

    The article could have been titled "why you shouldn't use sh".

  • elric 8 days ago

    > if an attacker can control filenames fully, this is the least of your worries

    Sure, but practicing defense in depth is often a good idea. Maybe today's version of your application sanitizes user input correctly, but tomorrow's version might suddenly allow funky characters. If your super duper bash backup script doesn't get updated, you might suddenly have a problem.

  • colechristensen 8 days ago

    This is one of the ways you mitigate the “if they already have this you’re screwed” thing.

    The reason it’s not the least of your worries is if somebody can control a file like that the next step is some kind of escalation and this right here is one of those escalations.

  • tedunangst 8 days ago

    Users can typically create any filename they like in /tmp, which causes trouble when the nightly cleanup script runs and tries to delete the old ones.

    • chipdart 8 days ago

      > Users can typically create any filename they like in /tmp, which causes trouble when the nightly cleanup script runs and tries to delete the old ones.

      I don't think that's how /tmp works. Aren't those files garbage collected either at boot time or after X days of creation? I mean, if your program creates a file in /tmp and keeps it open for days then you clearly screwed up as you're misusing the whole concept of a temporary file. For that there are better places in the file system hierarchy.

      • pastage 8 days ago

        It all depends, next Debian will mount tmp in ram, LWN has a great article about it as usual. They talk about the problem with clean up, e.g. tmux sessions are stored in tmp.

        https://lwn.net/Articles/975565/

        • chipdart 8 days ago

          I don't think your reply holds any relevance. All it discusses is the file system used by a distro for it's /tmp. That's irrelevant for this discussion. The file system is still wiped on reboot and old files are garbage collected after X days of creation, which is handed by the likes of systemd and upstart. The article you quoted reiterates exactly the points I made.

    • mozman 8 days ago

      find -print0 | xargs -0

fire_lake 7 days ago

This is why I won’t use bash for anything remotely complicated. It’s just too full of edge cases. Any general application language is better JavaScript, F#, heck even Python.

  • jessekv 7 days ago

    Python scripting based only on the standard library is underrated. There is no pip hell, it's readable and maintainable by the most junior team members, it's largely cross platform. This is the "batteries included" dream.

    • fire_lake 6 days ago

      What do you do once you need a dependency?

      • twalla 6 days ago

        Historically? Vendor them because if you're doing this you likely already have a centralized "scripts" repo.

0xml 8 days ago

That's why I kind of like PowerShell, object-oriented instead of string-oriented

  • csydas 8 days ago

    I get what you’re saying but the other weird limitations of powershell as well as microsoft’s strangely abusive security relationship with powershell makes it a non-starter for me, or at least makes me hesitate a ton before putting effort into a ps script since even in the same AD orgs which in theory have the same gpos applied across all machines, it’s a crapshoot if the ps script will even execute without having to trick windows into allowing it to run. object oriented nature of ps is also a curse as much as blessing as simple text parsing or iterating over files on a system are very slow compared to bash, never minding that from my experience it’s a crapshoot on how well a given ps module handles strings and how that will be reflected in the script.

    (e.g., one company’s ps module doesn’t handle square brackets ][ well and for an inexplicable reason the normal ps escaping does not work as advertised, requiring more backticks than usual and a different number depending on if you “” or ‘’ the input string. As best i can tell its joint fault of ps bug and poor module design but had no luck convincing the vendor to adjust their design because “it will break existing scripts”, which is a valid concern but it also means writing automation for their product is a headache since you’re going to see ][ in production environments as is a visually obvious delimiter when humans are reading it)

  • pastage 8 days ago

    It seems like a good idea but breaks down, a shell is text for good reasons.

    Psh is ok in Windows land on Microsoft tools, but IMO you might as well use Python if you want those features. I run the same bash script on five platforms while the power of bash disappears when you do not have all the 3rd party tools and Unix kernel, it is still easy to write and execute fast.

    That said the idea of power shell is great I use it from bash on Windows for very limited things. As a shell to integrate stuff not so much.

    • SAI_Peregrinus 5 days ago

      POSIX Shell is text for good reasons, but file names aren't (necessarily) text. So POSIX shells aren't very well-suited to dealing with file names, requiring arcane hackery to be portable & safe. It'd be easier with any sort of type system to distinguish file names from text, object orientation isn't necessary at all.

  • g15jv2dp 8 days ago

    After using powershell for a while, I just cannot get back into "unix" tools. Crazy shell syntax/parsing, inadequate formats...

fortran77 8 days ago

This is why PowerShell is so great. Commands return objects with a structure so you can do what you want with the output from Get-Childitem safely.

Genbox 7 days ago

People don't consider adversarial inputs when building scripts, which is one of the main reasons why I started a company to build an incident response platform in 2018.

I was handling security incidents on a large scale with tools that attackers would purposely circumvent with special filenames/inputs. Instead of fighting against the grain, I decided to build a platform with a heavy focus on correctness. Any deviation from the data specification would stick out as a sore thumb and immediately detect anti-forensics and other tricks.

Today, I have a vast compendium of anti-forensics tricks, such as commands not getting written to history, files that cannot be deleted or copied, and much more.

Suffice to say that if you are parsing the output of any tool, you are vulnerable to a whole slew of adversarial techniques.

SoftTalker 8 days ago

Use find(1) with -print0

  • sambazi 8 days ago

    it's even posix complaint!111

INTPenis 8 days ago

I hung around #bash@freenode (before libera) for a few years and picked up so many bashisms that I now can't look at any script without finding some nitpick. Especially scripts made by enterprise organisations to install proprietary software.

hprotagonist 8 days ago

unless, of course, you use (gnu)

  ls -D
and the application for which it was specifically earmarked. I am consistently amazed at how hard it is to mess dired up.
  • SassyBird 8 days ago

    When would you use it?

    Emacs dired works fine on other platforms, which don’t have GNU ls on them, so I’m guessing by default it doesn’t run ls on its own.

    • hprotagonist 8 days ago

      If your ls program supports the ‘--dired’ option, Dired automatically passes it that option; this causes ls to emit special escape sequences for certain unusual file names, without which Dired will not be able to parse those names. The first time you run Dired in an Emacs session, it checks whether ls supports the ‘--dired’ option by calling it once with that option. If the exit code is 0, Dired will subsequently use the ‘--dired’ option; otherwise it will not. You can inhibit this check by customizing the variable dired-use-ls-dired. The value unspecified (the default) means to perform the check; any other non-nil value means to use the ‘--dired’ option; and nil means not to use the ‘--dired’ option.

      On MS-Windows and MS-DOS systems, and also on some remote systems, Emacs emulates ls. See Emulation of ls on MS-Windows, for options and peculiarities of this emulation.

      https://www.gnu.org/software/emacs/manual/html_node/emacs/Di...

      https://www.gnu.org/software/emacs/manual/html_node/emacs/ls...

      • oefrha 8 days ago

        Geez I didn't realize dired relies on shelling out to ls. Why the fuck would you go through all the trouble of

        - shelling out to an unreliable program and doing all the parsing work;

        - adding a special mode for yourself to another piece of software, which may or may not be present;

        - adding a user customization point;

        - adding an emulation of said software when it's not present;

        when you are in a real programming language, and all you need to get the clean data directly is readdir(3) and stat(3) (or their Windows equivalents)? This is unbelievable.

        • mickeyp 7 days ago

          It's not unbelievable if you know and use Emacs and dired.

          dired will use `ls' if it is available. You can also control the arguments passed to `ls' if you want to customize sort order and suchlike. You cannot trivially do this if it's hardcoded as syscalls.

          Emacs's buffer concept specifically works around the idea of annotating the output of common coreutils commands and annotating them so they are interactive in Emacs. That means you can combine `find' + `ls' and show a collated list of files matching the things you want. That is very powerful.

          Oh, and you can swap out `find' or `ls' with anything that emits a similar format (or customize Emacs to interpret it differently).

          Now your readdir, stat, etc. equivalents require significant development effort.

        • TeMPOraL 8 days ago

          `ls` is faster and more portable in Unix-land, fallback to emulation on non-Unix platforms is easier than asking users to install a port of half of Unix just to have `ls`, and user customization points are what Emacs is made of and for in the first place.

          • oefrha 7 days ago

            Shelling out to `ls` (exec + opendir + readdir + stat + pipe IO) is not faster than opendir + readdir + stat. `ls --dired` is definitely not portable. readdir and stat (the libc functions) are portable, work across POSIX and other Unix-likes. And the customization point here doesn’t provide any benefit for the user, it serves to let the user shoulder the responsibility for its own design defect (has to guess whether the `ls` supports --dired or not).

        • hprotagonist 8 days ago

          dired predates emacs, and was originally itself a standalone tool.

          • oefrha 7 days ago

            Sure, but the dired mode of Emacs has been first party forever, and nobody thought to improve it?

    • User23 8 days ago

      Dired will warn in that case.

        ls does not support --dired; see `dired-use-ls-dired' for more details.
cornel_io 8 days ago

Simple solution: if you ever run into this, blog about it and mark it as a freak occurrence, because it is. Don't ever in real work worry about stupid edge cases like this or you're wasting your employer's money. I'd reprimand an engineer for spending more than an hour on this, it serves nobody unless you're working on extremely widely used library code.

  • TeMPOraL 8 days ago

    This is why Unix practice of being so thoroughly stringly-typed is bad, especially when combined with advice like yours. At every stage, someone is going to cut corners and don't parse just that one "freak occurrence". Then someone wants to wire a bunch of things together and they start to fail, as with so many corners cut on the way, you're bound to hit one.

    And it all starts with stringly typed data and asking everyone to make half-baked parsers and serializers on the fly.

djha-skin 8 days ago

Seriously though, has anyone encountered a file with a new line in its name? I have never encountered this "problem" in the wild.

  • makeitdouble 8 days ago

    I'd assume you can also hit that case if your string library mishandles the encoding of the file name and it becomes a new line where there was none.

    Then auto generating files from another source will also do that, in particular from html fields, where SQL and sheer security related sanitization will obviously be done, but comestic stuff like new lines can be left untouched.

  • rurban 8 days ago

    I often create them by accident in emacs with save-as, somehow a newline is often to easily appended

kjkjadksj 7 days ago

The more realistic reason is that ls is very slow in directories with a lot of files compared to find.

mtoner23 7 days ago

No one makes files with new line characters so I think I'll be fine

m3kw9 8 days ago

Can’t wait for the iOS 18 feature where it summarizes the page for you with a tap

  • sambazi 8 days ago

    microsoft has you covered

    • m3kw9 7 days ago

      Is not built into safari as a single tap. MS edge on iOS is the worst experience I’ve used on a mobile device.

jmclnx 7 days ago

I disagree, the whole purpose of UNIX is to feed output of one command into another command.

All these file name issues all come from one source, Microsoft. POSIX should be updated to forbid file names with any of these characters:

* space

* NULL

* New Line (as others said, I never ran across a file with a \n in it)

  • cxr 7 days ago

    I suspect you're not going to find as much support as you seem to think you will in forbidding filenames from containing spaces (under the mistaken belief that it's a Microsoft thing... wat).

  • ziml77 7 days ago

    How do these issues come from MS exactly?