Introduction
A new syscall interface needs to be developed in order to allow
programs to access named streams and/or extended attributes attached to
files. Several existing filesystems support one or the other; NTFS, HFS
and HFS+ support named streams, BeFS, HPFS and XFS support extended
attributes. While it is possible for a filesystem that supports named
streams to support extended attributes, the reverse is not possible.
Therefore, two different interfaces need to be provided, one for named
streams and one for extended attributes. Individual filesystems may
support either or both.
Named Streams
The first abstraction we make in order to develop a
named-streams interface is the concept of a namespace. A namespace,
within the concept of a filesystem, is a tuple space consisting of
(name, inode) pairs. The user must be able to get the names contained
within a given namespace, in order to use them in path names. Two
namespaces are defined in our implementation of named streams, as seen
below.
define NAMESPACE_DIRECTORY 0
define NAMESPACE_NAMED_STREAMS 1
Fig 1 – Namespace identifiers
NAMESPACE_DIRECTORY is the traditional UNIX directory namespace
for an inode. Only objects marked as directories can have any entries
in this namespace. This preserves the traditional UNIX filesystem
semantics. On the other hand, both files and directories, and in
general, any special node, can have entries in the
NAMESPACE_NAMED_STREAM namespace.
Most file-access system calls that work within or modify the
namespaces of inodes require a path. There is one exception,
getdents(), which we will cover later. For the remaining functions,
however, we can preserve the calling conventions of these functions by
only changing the interpretation of the path passed. Each namespace now
has one or more separating character strings associated with it. These
separators bracket names. The separator preceding a name determines the
namespace in which a name is contained. The separator for the
traditional UNIX namespace is ‘/’, and its meaning is preserved in our
system. We also define the additional separator ‘:’, which is a
separator for the named streams namespace. The use of ‘/’ would be
ambiguous if streams were allowed on directories.
We also introduce a separation in the interpretation of delimiters for
the path parser. Traditionally, a ‘/’ as the leading character in a
path means that a path should be absolute, i.e., interpreted as relative
to the root directory. This behavior is not the expected behavior for
‘:’. Instead, the path with a leading colon is commonly expected to be
relative to the current directory. In order to refine this distinction,
a NAMESPACE_DELIMITER_ROOTING flag can be specified for a delimiter,
which when set causes the delimiter to behave as ‘/’ does currently.
When not set, any path beginning with the delimiter is understood to be
relative to the current path. In practice, “/” is rooted and “:” is
not.
In the case of multiple separators, we take the last separator
that we see as the one defining the context in which the next name is
interpreted. For example, /var/gdm/:0-lock would be interpreted as the
named stream 0-lock on the directory gdm. This namespace parsing should
not break existing programs that use colons in their filenames, such as
X11.
The getdents() function serves the purpose of enumerating the
contents of directories. Since this function has no concept of
namespace, we must define a new, more general system call,
enumerate_namespace() (Fig. 2). It’s calling convention is similar to
that of getdents(), with the addition of the nsflags parameter. The
lower 8 bits of nsflags specify the namespace to enumerate. We reserve
the upper 24 bits for miscellaneous flags dealing with namespaces. We
do not yet define any of these flags.
int enumerate_namespace ( unsigned int fd,
unsigned int nsflags,
struct dirent *ep,
unsigned int count );
Fig 2 – enumerate_namespace
An enumerate_namespace() operation is otherwise identical to the
operation of getdents(), which allows getdents() to simply call
enumerate_namespace() with nsflags set to 0. We also define the
function getsents(), with the same calling conventions as getdents(),
which instead enumerates the named streams of an object by passing 1 as
the value of nsflags (NAMESPACE_NAMED_STREAM) to
enumerate_namespace(). getsents() does not have to be implemented as a
system call, but instead can be implemented as a C library routine. In
contrast, getdents() must be left as a system call, in order to ensure
backward compatibility.
The Virtual Filesystem Implementation of Named Streams
In order to implement named streams, the traditional UNIX VFS
needs to be extended. A primary goal is to maintain compatibility with
existing filesystems and user applications, and to keep the
implementation in the kernel as simple as possible. One of the problems
with the traditional UNIX VFS is that it confuses namespace operations
with operations on the inode itself. If we introduce the concept of a
namespace, as follows, we can minimize the impact on the kernel, and
migrate towards an extensible VFS interface for UNIX-like operating
systems. We will use the Linux 2.2.16 filesystem structures as an
example, but they can be easily applied to most UNIXes.
In current UNIX implementations, each inode has one namespace,
but only directory inodes have a non-empty namespace. In our
implementation, each inode can have multiple namespaces, and each
namespace is independent of the others. This allows the creation (and
access) of hierarchies of named streams attached to both files and
directories.
In order to implement this, we need to extend the VFS layer to
allow multiple namespaces per inode. Since names are now contained by
differing namespaces, we can add a new field, unsigned int d_nsflags,
to the dentry struct, which specifies which name is contained within
which namespace, in addition to reserving space for additional namespace
related flags to the dentry. We reserve the lower 8 bits of this field
for namespace identifiers, and the upper 24 for flags (corresponding to
the nsflags field in enumerate_namespace), set to 0 by default.
Adding this flag requires changes to any code which attempts to
determine the equivalence of entries, or to add and remove entries, in a
namespace. Two entries which contain the same name are only equivalent
if the lower 8 bits of the d_nsflags field are equal, and the remaining
flag bits are either equal, or unequal and equivalent by some
superseding specification. In order to add an element, the namespace
into which it should be added must be specified. The VFS interface of
Linux 2.3/2.4 passes dentry structure pointers to all functions which do
name lookups, allowing the function to check the d_nsflags member of
the dentry structure, and query the appropriate on-disk structures for
the namespace. Path parsing must also now support namespace switching
on the appropriate delimiters.
One problematic area is the readdir function in the
file_operations structure. It does not provide a way in which to
specify the namespace which needs enumeration, and keeps track of
enumeration state information for the directory namespace only. A
function must be added to the file_operations structure which can act as
a counterpart to the enumerate_namespace() system call, and preserve
separate state for each supported namespace. We also call this function
enumerate_namespace(), and it is illustrated in Fig 3.
typedef int (*filldir_t)(void *, const char *, int, off_t, ino_t);
typedef fillnsenum_t filldir_t;
struct file_operations {
…
int (*readdir) (struct file *, void , filldir_t); … int (enumerate_namespace) (struct file *, unsigned int,
void *, fillnsenum_t);
};
Fig 3 – file_operations declarations
We add an additional unsigned int parameter, nsflags, and change
the type of the function pointer passed to a fillnsenum_t (fill
namespace enumeration) to the calling convention of readdir in order to
derive this function. nsflags preserves its meaning from the
enumerate_namespace() system-call counterpart. fillnsenum_t is
currently equivalent to its cousin, filldir_t, but allows for
extensibility and orthogonality from the earlier interface. As you can
see in the example above, fillnsenum_t can be implemented as an alias to
filldir_t for now. The system-call enumerate_namespace() will call the
new enumerate_namespace() member of file, rather than readdir.
In order to maintain backward compatibility with older
filesystems that do not support alternate namespaces in either their
current implementations or earlier versions of their on-disk formats, we
add a new flag to the s_flags member of the super_block struct,
S_NAMESPACES. The absence of this flag tells the VFS to interpret
paths in the traditional manner. The VFS must also ignore namespace
separators, and not pass dentry with nsflags set other than to zero to
any function on an inode or file belonging to the specified super_block.
The VFS will also call readdir instead of enumerate_namespace. This
preserves the old semantics expected by existing filesystems.
Extended Attributes
Actual storage methods for Extended Attributes are a filesystem
implementation issue, but a standard interface for accessing them must
be provided. Attributes are typically accessed as name-value pairs, and
are set atomically. Existing filesystems do not support hierarchical
Extended Attributes.
Example: BeFS
BeOS provides the following functions to access Extended
Attributes:
ssize_t fs_read_attr ( int fd,
const char *attribute,
uint32 type,
off_t pos,
void *buf,
size_t count );
ssize_t fs_write_attr ( int fd,
const char *attribute,
uint32 type,
off_t pos,
const void *buf,
size_t count );
int fs_remove_attr ( int fd,
const char *attr );
DIR * fs_open_attr_dir ( const char *path );
DIR * fs_fopen_attr_dir ( int fd );
int fs_close_attr_dir ( DIR *dirp );
struct dirent *fs_read_attr_dir ( DIR *dirp );
void fs_rewind_attr_dir ( DIR *dirp ) ;
int fs_stat_attr ( int fd,
const char *name,
struct attr_info *ai );
Fig 4 – BeOS fs_attr.h declarations
Although BeOS enumerates Extended Attributes as though they were
in a directory, it is not possible to obtain a file handle to an
extended attribute, and they are not hierarchical. Notice that
fs_read_attr() and fs_read_attr() accept an offset, a buffer and a
length; however, BeOS does not currently support writing at any offset
other than zero. A write replaces any existing data (i.e., it writes and
then truncates). A read will read the specified number of bytes starting
at the indicated position. The type field is only a hint, so that
reading programs can assume the attribute is of, for example,
B_STRING_TYPE.
Example:XFS
SGI’s xFS supports extended attributes with a similar API.
Interestingly, xFS’ Attribute Manager stores “specialized structures” in
a second “data fork” provided by the xFS Space Manager. Attribute
Manager functionality is accessible via “extended vnode calls.” xFS
attributes must be small — a few kilobytes or less. This is in contrast
to BeOS and NTFS, which allow arbitrarily-sized attributes or streams
respectively. If a user has permission to read a file, then all
attributes on that file can also be read. When a file is unlinked, all
attributes are unlinked as well. XFS has a concept of “Root vs. Non-Root
namespaces,” to facilitate segregation of root-modifiable and
user-modifiable attributes. Accessor functions must specify the
namespace they wish to use. XFS supports the following general vnode
operations:
vop_attribute_list – return a list of all attribute names in the way
that getdents() works.
vop_attribute_get – read an attribute and value.
vop_attribute_set – write (possibly creating) an attribute and value.
vop_attribute_create – create an attribute and value, fail if attribute
already exists.
vop_attribute_remove – remove an attribute.
vop_attribute_multi – take a list of attribute operations and loop
across them all.
XFS has pairs of functions; one operates on a path, the other on a file
descriptor.
define ATTR_DONTFOLLOW 0x01 /* do not follow symlinks */
define ATTR_NOCREATE 0x02 /* don’t create on set op */
define ATTR_FILESYSTEM 0x10 /* incl filesystem attrs */
define ATTR_USER 0x20 /* incl user attrs */
define ATTR_ROOT 0x40 /* incl root-only attrs */
int attr_list ( char *path,
struct attr_list_struct *list,
int len,
int flags );
int attr_listf ( int fd,
struct attr_list_struct *list,
int len,
int flags );
int attr_get ( char *path,
char *attrname,
char *value,
int *len,
int flags );
int attr_getf ( int fd,
char *attrname,
char *value,
int *len,
int flags );
int attr_set ( char *path,
char *attrname,
char *value,
int len,
int flags );
int attr_setf ( int fd,
char *attrname,
char *value,
int len,
int flags );
int attr_create ( char *path,
char *attrname,
char *value,
int len,
int flags );
int attr_createf ( int fd,
char *attrname,
char *value,
int len,
int flags );
int attr_remove ( char *path,
char *attrname,
int flags );
int attr_removef ( int fd,
char *attrname,
int flags );
struct attr_multi_op
{
int operation; /* set/create/remove operation code */
char *attrname; /* the attribute name to operate
on */
char *value; /* the attribute value to use/set */
int *len; /* the max/used length of the
value / int flags; / flags for this sub-operation
/ int error; / error for this sub-operation
*/
};
define ATTR_OP_GET 0x1 /* do an attr_get() */
define ATTR_OP_SET 0x2 /* do an attr_set() */
define ATTR_OP_CREATE 0x3 /* do an attr_create() */
define ATTR_OP_REMOVE 0x4 /* do an attr_remove() */
int attr_multi ( char *path,
struct attr_multi *args,
int count,
int flags );
int attr_multif ( int fd,
struct attr_multi *args,
int count,
int flags );
BeOS allows attributes to be read at any offset, but XFS allows
reading only the entire attribute. In this way, the BeOS API is more
generalized. Because BeOS allows large attributes, but XFS does not, the
difference is understandable. An OS wanting to support both will need to
follow the BeOS example.
In fact, the BeOS and XFS APIs are complementary and can be
combined. Where BeOS requires a file descriptor to be passed, XFS will
take a file descriptor or a path. And whereas XFS requires the entire
attribute to be written or read at once, BeOS allows byte ranges of
attributes to be read. Both require that only entire attributes are
written. XFS allows multiple attributes to be operated on at a time,
while BeOS requires enumeration and iteration by the programmer.
Inside of the VFS, a minimum number of operations can be
exported and still achieve all the functionality. Others can be
provided by a userland library that utilizes the exported functions.
The three most common allow for the reading, writing or deleting
of an attribute. The VFS already contains functions vop_getattr and
vop_setattr for the manipulation of bit (e.g. “sticky”) or integer (e.g.
UID) metadata. To avoid a name conflict with these functions and to
better distinguish themselves, the functions should have a unique
prefix: vop_eattr. The functions vop_eattr_get and vop_eattr_set would
read and write the data from a given extended attribute for a given
inode.
Minimum VFS Support
vop_eattr_get – read an EA
vop_eattr_set – set an EA
vop_eattr_remove – remove an EA
vop_eattr_list – list the EAs like vop_readdir would a directory.
Optional Support
vop_eattr_create – Create an EA or error if it exists.
vop_eattr_multi – perform a sequence of EA vops atomically.
vop_eattr_rename – change the name of an EA
vop_eattr_serialize – export all the EAs as a stream of entries.
All of the vop_eattr_* operations are, of course, atomic
operations, including the optional vop_eattr_multi. The
vop_eattr_create is present to provide a more orthogonal interface
between file and EAs. The vop_eattr_multi would clearly be a useful
item, and best implemented in the kernel itself, although it is hardly
required for EAs to be functional. A vop_eattr_lookup may also be
necessary. The optional items can always be implemented in the ‘C’
library by using the existing vops, although a lack of orthogonality
with file vops and potential race conditions may make it more reasonable
to include them in the kernel.
At the ‘C’ level, the XFS implementation is more consistent, but
the Be’s inclusion of type is clearly needed for effective
multi-platform support. The XFS method is recommended with the addition
that a set of type flags should be added and data normalized so that
libraries and applications can avoid endian issues when using integer
and and float attributes. (How does NFSv4 handle this?)
Supporting Extended Attributes on Streams-Capable Filesystems
Filesystems have the option of supporting Named Streams,
supporting Extended Attributes, or supporting neither or both. For
filesystems that already support named streams, it is possible to create
an Extended Attributes interface on top of the existing named streams
functionality. There are two ways of doing it: a one-to-one mapping of
Attributes to Streams, or providing structured storage inside a single
alternate stream. MacOS takes the latter approach. HFS supports up to
256 numbered streams, where the data fork is stream zero, and the
resource fork is stored in stream 255. A userspace library provides
structured storage inside of stream 255. No other streams are
accessible. Similarly, Services for Macintosh on Windows NT simply
exports streams-capable filespace to Macintoshes using AFP, and the Macs
themselves actually handle storing name-value pairs inside the resource
stream.
Mapping EAs to named streams individually can be done in several
places. If a filesystem supports both Named Streams and Extended
Attributes interfaces, it can simply write the data provided in a
set_attr() call to a named stream with the same name as the attribute.
It can also provide structured storage inside a single named stream, or
in a separate space altogether. The last solution makes streams
inaccessible as attributes and vice-versa and would require support by
the underlying filesystem. If a stackable filesystem is used, it can
perform all the same functions as the filesystem itself in the previous
description, except that it cannot use additional private storage; it
must map onto existing filesystem capabilities. In other words, it can
either map EAs one-to-one to streams or provide structured storage.
Accessing named streams both as EAs (via mapping) and as named
streams simultaneously provides the opportunity for namespace
collisions. The upside is that the data is accessible through both
interfaces. This dual-access scenario is not unlike using read(),
write() and mmap() on the same file simultaneously — unless you’re
careful, you will end up leaving the data in an inconsistent state.
Conceivably, if the filesystem supports both named streams and EAs
itself, it could provide more consistency guarantees than if the EAs to
streams mapping is done externally. Of course, if the filesystem
supports both, it can simply store them in separate areas if it wishes,
which provides the best guarantee of consistency. The underlying
filesystem can assist by providing a locking mechanism to aid
seralization of accesses to streams when they are being accessed by an
attr*() function. Manipulations of EAs must be atomic operations.
Providing structured storage inside a single stream will
probably save some space when compared with a one-to-one EA to stream
mapping as you will not have to potentially allocate one inode per
attribute. The space savings comes at the expense of extra code and
complexity needed to provide the structured storage. If a single stream
contains all extended attributes using some type of structured storage,
then that stream can simply be copied to another file intact (as when a
file is copied), unless it crosses filesystem boundaries onto a
filesystem that does not understand the chosen structured storage format
(this concern is mitigated by performing EA mapping in a stackable
filesystem module). It also reduces the possible name collisions to one.
However, filesystem damage may more easily cause loss of all extended
attributes if they are kept in structured storage than if they are each
mapped to their own stream. Mapping each EA to a unique stream will
allow programs that do not understand EAs, but can use the named streams
namespace, to interoperate with EA-only aware programs.
A stackable filesystem would conceivably allow support of EAs
and/or streams on a filesystem that is not aware of them natively, ext2
for instance, by mapping streams to a native structure such as a
“.streams” directory in each directory. However, such a discussion is
outside the scope of this paper, and moot in any case, as Linux does not
support stackable filesystems. In their absence, the VFS can perform
transparent mapping of EAs to streams, and thus be able to provide EA
functionality on streams-aware filesystems without the extra complexity
of providing structured storage.
The userspace library that exports the attr_*() calls to
applications can also perform the same function without support from the
VFS or filesystem by performing the EA emulation itself for unsupporting
filesystems.