Skip to content
Domain Specific Language

Strings in Rust

During the last 20 years I have used a number of garbage collected and reference counted programming languages. All of them have a single type for representing strings. Rust has two types of strings that can be stored in three different ways.

I want to shortly illustrate how Rust's strings interact with the heap, with the stack, and with the data segment of your binary, as well as shortly explain what those things are.

Java, Swift, TypeScript, and Go all have string types that make it irrelevant if they're stored on the stack or on the heap. A Java string is always heap allocated, while a Go string may be stack or heap allocated. The point is that you don't need to know: the way you use the type doesn't change.

Rust doesn't work quite that way. In general, the programmer needs to choose between types that store data either on the stack or on the heap. The choice is generally speaking a trade-off between speed and versatility.

Scope

Please note that this post does not touch upon the topics of UTF8, UTF8 validity, and so on. Neither does it talk about lifetimes, except for 'static, which is not explained further. This post basically glosses over everything that is not needed to understand String or str.

We need to take a small detour.

Stack and heap

This is a conceptual view of the stack and heap parts of program memory. This is probably a virtual memory if you're on a laptop computer or physical memory if you're on a small embedded hardware without an operating system.

                  A view of program memory in a made up 16 bit computer 
 low memory addresses   ======================================= addr[0x00F0] 
                        |               HEAP                  |
                        | typically grows towards higher addr |
                        |                 |                   |
                        |                 ห‡                   |
                        |/////////////////////////////////////|
                        |                 ^                   |
                        |                 |                   |
                        | typically grows towards lower addr  |
                        |               STACK                 |
 high memory addresses  ======================================= addr[0xFFFF] 

The stack and heap memory is structured in a way that they grow towards each other. In my examples I use a made up 16 bit memory architecture. That means that pointers are 16 bits and registers are 16 bits as well.

                        PROGRAM                      MEMORY
use std::str;                           =====================================
fn main() {                             |             HEAP                  |
    // A stack allocated i16            |                                   |
+-- let x: i16 = 5;                     |                                   |
|                                       | addr[0x00F0] <-----+              |
|   // A heap allocated i16             | type: i16          |              |
|   let y: Box<i16> = Box::new(5);      | value: 5           |              |
|       |                               |                    |              |
|       |                               |////////////////////|//////////////|
|       |                               | addr[0xFFE0]       |              |
|       |                               | type: Box<i16>     |              |
|       +-------------------------------> value: 0x00F0 -----+              |
|                                       |                                   |
|                                       | addr[0xFFF0]                      |
|                                       | type: i16                         |
+---------------------------------------> value: 5                          |
                                        |             STACK                 |
}                                       =====================================

The program above illustrates the difference between putting something on the stack, which is immediately available to the function, since in some sense the current stack frame is the function. When something is put on the heap, we say that it's boxed. In this case the Box is essentialy just a stack allocated struct, that internally holds a pointer to the heap, where the actual data is stored. Illustrated by the -> arrows above.

Three types of string storage

A string in Rust can be stored in one of three ways:

We need one more detour.

Application binary segments

The segments in a binary are often illustrated like this, please note that this is after loading the binary file from disk into main memory: the stack and heap segments aren't stored in the binary on disk.

The text segment actually contains the compiled machine code, and the data segment contains application data found by the compiler while compiling.

                                     SEGMENTS
      low mem =============================================================
              |                       .text                               |
              |                contains machine code                      |
              =============================================================
              |                        .data                              |
              | contains data known to the app binary at compile time     |
              =============================================================
 addr[0x00F0] |                         HEAP                              |
              | contains dynamically allocated data created at run time   |
              | * data that lives "much" longer than one function         |
              | * data that is too big for the stack                      |
              | * data that must live behind a pointer, i.e. unknown size |
              |////////////////////////////////////////////////////////////
              |                         STACK                             |
              | contains memory allocated by functions at run time        |
              | * data with a known size                                  |
 addr[0xFFFF] | * data that mostly does not outlive the function itself   |
     high mem =============================================================

In this case, the string "abc" has been found by the compiler and put in the data segment. When our function is run, the x variable points directly into the data segment, at the address where "abc" starts. Since the &str also stores a length, the program will only read three characters when run.

Two types of strings

String

The first of the two types is String which is the heap allocated, growable, string type. It's growable, which means that unlike Java it's possible to change the string as long as there is enough room left. If there isn't enough room, it will expand its size automatically.

A String may be owned, something like let x: String = String::from("abc"), or referenced fn takes_string_ref(x: &mut String).

String is implemented as a wrapper around Vec<u8>.

str

A str is simple, but also very very complicated. You can either accept the standard explanation without further questioning it, or you can read my take below ๐Ÿ‘‡.

A str is a Dynamically Sized Type. Furthermore, it is a primitive type, and unlike String you can't find it in the standard library since it's a compiler internal type.

str is callad a "string slice". Unlike most other types you can NOT get an instance of a raw str. It is is most often seen with its buddy Mr. Ampersand, as in: &str. Other possibilities are Rc<str> and Box<str>. The use case for Box<str> is that it doesn't contain the capacity field of String, so it takes up less memory.

Just like a &i64 is a reference to an actual i64, an &str is a reference. But a reference to what? An i64 can be stack allocated, so the &i64 is a pointer to some other place in the stack. A Box<i64> is heap allocated, and you can get a &i64 reference to that one, too.

But the &i64 isn't just a pointer, it has an implicit size too. The compiler knows that since the type is 64 bits, it knows how much data to read when reading the pointed-to reference. But what size is an &str?

Aside

Since the Rust 2018 edition, we need to put references to trait objects, which have an unknown compile-time size, behind the dyn keyword: &dyn MyTrait. That way the compiler knows to generate a vtable (a table of function pointers) that the runtime can use to find the functions of the actual underlying struct. dyn MyTrait is a Dynamically Sized Type too, and it also has to be pointed to by a & reference, or be boxed somehow: e.g. Box<dyn MyTrait>.

But what about &str? A pointer to a string isn't enough, the computer must know how many bytes of data to read. Fortunately, it does contain the length too, just as a &[u8] reference knows how many bytes to read behind the pointer.

I think there are nice similarities between how the lack of a known compile time size of a str forces the runtime code to store the runtime length together with the pointer to the actual data, and how references to trait objects need to store a pointer to a vtable to work properly. They're both Dynamically Sized Types too.

So a &str is basically type str { pointer: *const u8, len: usize }. Maybe it would have been less confusing if &str was presented another way? What about &str[u8; ?]. No that's terrible, never write that again.

The way str is presented by the standard documentation leads me to believe that the & in &str is the actual pointer, and that the str part is just a placeholder for len: usize and an implicit data type u8. But that's maybe wrong, probably?

My personal take is that str could have been a standard library type, or a struct instead, and used without it being a reference "&". That way the pointer field could have been seen in code, and all would have been well. But since Rust is Rust, and & means shared reference, all the standard rules around lifetimes and sharing kick in. That results in an overall nicer experience. However, I find the lack of a deeper explanation or what-if explorations unsatisfying.

Three types of string storage, again

Heap

The standard library String is always heap allocated, but it can interact with &str in two ways

  1. Anything that takes a &str can take a reference to a String and it will just work
  2. We can get a &str sub-slice of a String by doing &my_string[1..3] which for the String "abcd" would be "bcd".

Neither (1) nor (2) above need to allocate any extra memory except for the size of the pointer and the length.

Here we allocate an empty String on the heap, and our handle to it, called x is on the stack:

                 PROGRAM                                MEMORY
use std::str;                            ========================================
fn main() {                              |              HEAP                    |
    // An empty String                   | addr[0x00F0] <----+                  |
    let x: String = String::new();       | type: *const u8   |                  |
        |                                | value: [empty]    |                  |
        |                                |                   |                  |
        |                                |///////////////////|//////////////////|
        |                                |                   |                  |
        |                                | addr[0xFFF0]      |                  |
        |                                | type: String      |                  |
        +------------------------------- > value: 0x00F0 + len: 0 + capacity: 0 |
                                         |              STACK                   |
}                                        ========================================

There's a bit of lying going on above, since String itself doesn't have the pointer, it's a wrapper around Vec<u8>, and the Vec has the actual pointer.

Data segment

As previously mentioned, all string literals, e.g. let x = "hello", will have the type &'static str, and they are stored in the data segment of the application binary.

              PROGRAM                                     SEGMENTS
                                            ======================================= low mem
fn main() {                                 |              .text                  |
    // A str "string" with the value "abc"  =======================================
    // which is stored in the data segment  |              .data                  |
    let x: &'static str = "abc"; ----+      | addr[0x0008] <----------+           |
        |                            |      | type: *const u8         |           |
        |                            +------> value: "abc"            |           |
        |                                   ==========================|============
        |                                   |               HEAP      |           |
        |                                   |                         |           | addr[0x00F0]
        |                                   |                         |           |
        |                                   |/////////////////////////|///////////|
        |                                   |                         |           |
        |                                   |                         |           |
        |                                   |                         |           |
        |                                   | addr[0xFFF0]            |           |
        | // x is stored on the stack       | type: &'static str      |           | addr[0xFFFF]
        +-----------------------------------> value: 0x0008 + len: 3 -+           |
         // and contains a ptr and a len    |               STACK                 |
}                                           ======================================= high mem

Stack

It's possible to store string data on the stack, one way would be to create an array of u8 and then get a &str slice pointing into that array.

This is stolen from the str documentation:

                        PROGRAM                                      MEMORY
use std::str;                                           =================================
fn main() {                                             | HEAP (unused in this example) |
                                                        |///////////////////////////////|
    let sparkle_heart: [u8; 4] = [240, 159, 146, 150];  | addr[0xFFE0] <----------+     |
                                  |                     | type: *const u8         |     |
                                  +---------------------> value: [bytes...]       |     |
                                                        |                         |     |
+-- let sparkle_heart = str::from_utf8(&sparkle_heart)  | addr[0xFFF0]            |     |
|       .unwrap();                                      | type: &str              |     |
|-------------------------------------------------------> value: 0xFFE0 + len: 4 -+     |
                                                        |             STACK             |
}                                                       =================================

Stack strings and Hybrid strings

For stack allocated strings, or for hybrid stack/heap implementations, there are a number of crates available. Use your favorite search engine to find them.

A helpful tip

The reference variety of String: &String, should be avoided in favor of &str, unless there is a need for a "String out parameter". A "String out parameter", or &mut String, can be used when a currently owned String needs to be updated by a receiving function, without having to move it into, and then out of, that function.

In short:

Final words

We learned that