字符串编码

rust 中的字符串都是使用的 UTF-8 编码,rust 代码文件也是 UTF-8 编码,如果不是,rust 会报错。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
use std::str;

fn main() {
    let tao = str::from_utf8(&[0xE9u8, 0x81u8, 0x93u8]).unwrap(); // UTF8 到 str
    assert_eq!("道", tao);
    assert_eq!("道", String::from("\u{9053}"));
    let unicode_x = 0x9053; //  unicode 码点
    let utf_x_hex = 0xe98193;
    let utf_x_bin = 0b111010011000000110010011;
    println!("unicode_x: {:b}", unicode_x);
    println!("utf_x_hex: {:b}", utf_x_hex);
    println!("utf_x_bin: 0x{:b}", utf_x_bin);
}
1
2
3
unicode_x: 1001000001010011
utf_x_hex: 111010011000000110010011
utf_x_bin: 0x111010011000000110010011

字符

Rust 使用 char 表示单个字符,char 类型使用的整数值和 Unicode 标量值一一对应。为了存储任意的 Unicode 标量值,Rust 规定每个字符都占 4 个字节

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fn main() {
    let tao = '道';
    let tao_u32 = tao as u32;
    assert_eq!(36947, tao_u32);
    println!("U+{:x}", tao_u32); // U+9053
    println!("{}", tao.escape_unicode()); // \u{9053}
    assert_eq!(char::from(65), 'A');
    assert_eq!(std::char::from_u32(0x9053), Some('道'));
    assert_eq!(std::char::from_u32(36947), Some('道'));
}
1
2
U+9053
\u{9053}
1
2
3
4
5
6
7
fn main() {
    let mut b = [0; 3];
    let tao = '道';
    let tao_str = tao.encode_utf8(&mut b);
    assert_eq!("道", tao_str);
    assert_eq!(3, tao.len_utf8());
}