Catsystem2提取.cst文件中的日文内容

用法

python 程序.py 文件夹目录

代码采取MIT协议

ds与gpt5.3写的，勉强看吧

代码在最下方😘

# 原始代码作者：Ephylm411 # 参考：https://zhuanlan.zhihu.com/p/623697843

1. 背景知识：.cst 文件结构

根据源码作者Ephylm411的分析，.cst 文件的结构如下：

外层封装（压缩层）

文件头（16字节）
- 8字节：固定魔数 CatScene
- 4字节：压缩后数据大小
- 4字节：解压后数据大小
数据体：zlib 压缩的数据，解压后得到 .bin 文件。

内层结构（解压后的 `.bin` 文件）

文件头（16字节）：4 个 uint32
- h0：数据体总长度（不含文件头）
- h1：语句组个数（每个组包含若干语句）
- h2：偏移表起始偏移量（相对文件头）
- h3：语句块起始偏移量
数据体：依次为三部分
1. 语句表（长度 = h2 - 16）
  每 8 字节一组，每组两个 uint32：
  - d10：当前语句组包含的语句数量
  - d11：该组第一条语句的全局索引
2. 偏移表（长度 = h3 - h2）
  每 4 字节一个 uint32，共 h1 * 8 项？实际上语句个数 = 所有组 d10 之和，而偏移表项数等于语句总个数。
  每个偏移量表示对应语句在语句块中的起始位置（相对语句块开头）。
3. 语句块（长度 = h0 - h3）
  每条语句格式：
  - [0]：固定为 0x01（语句起始标记）
  - [1]：语句类型（0x20 对话框文本、0x21 角色名、0x02 等待输入、0x30 控制命令等）
  - [2..]：内容（以 0x00 结尾，Shift-JIS 编码）

重要关系：

h1 * 8 = h2 - 16（语句表大小 = 8 字节/组 × 组数）
所有组的语句数量之和 = 偏移表项数 = 语句块中的语句条数
语句块中每条语句的长度由起始和结束 0x00 界定。

2. 代码模块解析

2.1 辅助函数

cst2bin(datcst) / bin2cst(datbin)
处理外层压缩：检查魔数，进行 zlib 压缩/解压，返回原始二进制数据。
warn() / printWarnings()
收集并输出非致命警告（例如无法访问的数据区域、未知类型码等）。

2.2 核心类 `FormatCST`

负责解析 .bin 内部结构，并提供迭代、修改、重新打包功能。

`init(self, fc)`

从文件对象 fc 读取 .cst，解压得到 bin 数据。
解析 bin 的文件头，得到 h0, h1, h2, h3。
切分出三部分：b1（语句表）、b2（偏移表）、b3（语句块）。
验证完整性：
- 语句表每组的第二个数必须是连续的起始索引。
- 语句个数与偏移表项数一致。
解析语句块：按偏移表定位每条语句，读取类型和内容，存入 self.d3 列表（每条语句的完整二进制数据，包括开头 0x01 和结尾 0x00）。
过程中检查未访问到的区域并发出警告。

`iter()` / `next(skp=True)` / `get(skp=True)`

iter() 重置迭代器。
next(skp) 移动到下一条语句，skp=True 时跳过不需要翻译的语句（只保留类型 0x20、0x21 以及特殊的 scene 命令和 fselect 命令）。
get(skp) 返回当前语句的内容（如果 skp=True 则去除首尾标记，否则保留完整二进制用于调试）。

`rep(bn)`

替换当前语句的内容（保留前两个字节和结尾 0x00），bn 为新内容的字节串。

`pac()`

重新生成偏移表和语句块，返回新的 .bin 文件数据（不包括外层压缩）。

2.3 文本编码处理

SCENE_ENCODING = 'cp932'：游戏内部使用 Shift-JIS（CP932）。
TXT_ENCODING = 'utf-8-sig'：导出的文本文件使用 UTF-8 with BOM，兼容中文编辑器。
scene_bytes_to_text() / text_to_scene_bytes() 完成编码转换。

2.4 单文件操作函数

depacst_file(sc, sb)：解压缩 .cst 为 .bin。
unpacst_file(sc, st, skp)：提取文本到 .txt。skp 参数控制是否只导出可翻译文本。
repacst_file(sc, st, sd)：根据修改后的 .txt 重新打包为新的 .cst。

2.5 批量处理函数

extract_dir(src_dir, skp)：处理目录下所有 .cst，将文本输出到 scene_txt 子目录。
depacst_dir(src_dir)：解压缩目录下所有 .cst 到 scene_bin 子目录。
unpacst(skp) / depacst() / repacst()：操作默认目录 scene_cst、scene_txt、scene_bin、scene_dst。

2.6 命令行解析与主流程

支持多种调用方式：

text

python script.py [tag] [path]

tag：
- 0：提取文本（跳过非翻译项）
- 1：重新打包
- 2：解压为 .bin
- 3：提取所有语句（包括控制命令，不跳过）
path：可以是单个 .cst 文件或包含 .cst 的目录。省略时使用默认目录 scene_cst。

3. 使用示例

3.1 准备环境

将脚本放在游戏目录（或任意目录），创建文件夹 scene_cst，把需要处理的 .cst 文件放入其中。

3.2 提取文本

text

python script.py 0

或指定单个文件：

text

python script.py 0 script.cst

提取的文本将保存在 scene_txt 文件夹（或同目录下），文件名为 原文件名.cst.txt，编码为 UTF-8-BOM。

3.3 翻译文本

用文本编辑器（如 Notepad++、VS Code）打开 .txt 文件，修改每行内容。
注意：

不要改变行数（每条翻译对应原一行）。
新文本长度不能超过原句的字节长度限制（Shift-JIS 下每个日文字符占 2 字节，中文也占 2 字节；若使用 UTF-8 编码中文将占 3 字节，易超长）。
特殊控制命令（如 \x30 开头的）通常不需要修改，但 scene 和 fselect 对应的内容可以翻译。

3.4 重新打包

text

python script.py 1

或指定文件：

text

python script.py 1 script.cst

程序会读取原 .cst 和对应的 .txt，生成新的 .cst 文件（默认在 scene_dst 文件夹，或添加 .new.cst 后缀）。

3.5 其他操作

解压为 .bin：python script.py 2
提取全部语句（包括控制命令）：python script.py 3

4. 注意事项

长度限制
游戏引擎为每条语句分配了固定长度的缓冲区（由原日文长度决定）。替换文本的字节数不能超过原句的字节数，否则会导致游戏读取崩溃。代码中 rep() 直接替换内容，不检查长度，因此翻译时需确保新文本的字节数 ≤ 原句字节数（Shift-JIS 编码下，中文字符与日文字符一样占 2 字节，相对安全；若使用 UTF-8 则每个中文字符占 3 字节，极易超长）。
编码问题
游戏内部使用 CP932（Shift-JIS），导出文本使用 UTF-8-BOM 是为了方便编辑。重新打包时程序会将 UTF-8 转回 CP932，若存在无法映射的字符（如某些特殊符号）会报错。
特殊语句
- 0x02 语句无内容，不需要翻译。
- 0x30 控制命令一般不要改动，但其中 scene 和 fselect 的内容需要翻译，程序在 skp=True 模式下会保留它们。
- fselect 选择支通常位于文件末尾，内容可翻译。
警告与错误
- 程序会输出警告（如未访问的数据区域、未知类型码等），但不影响基本功能。
- 若文本行数不足或多于原语句数，会给出警告，但依然会生成新文件（可能缺句或多余行被忽略）。
目录结构
默认工作目录下需存在 scene_cst 文件夹。批量处理时输出目录自动创建。也可直接指定文件或目录路径。

5. 总结

该工具是针对 CatSystem2 引擎游戏（如 NEKOPARA）的汉化辅助工具，能够完整地提取和替换脚本中的文本。其核心在于正确解析 .cst 文件的复杂结构，并保持偏移表与语句块的一致性。虽然代码中有一些硬编码假设（如语句类型含义），但在已知游戏中工作良好。使用时需注意文本长度限制和编码转换，这是汉化此类游戏的关键难点。

源码如下:

Python

# 原始代码作者：Ephylm411
# 参考：https://zhuanlan.zhihu.com/p/623697843
# 改进：Kinotern 与 GPT-5.3 Codex
import os, sys
import struct
import zlib


warnings = []

def warn(value):
    warnings.append(value)

def clearWarnings():
    global warnings
    warnings = []

def printWarnings(sc):
    global warnings
    for wi in warnings:
        print('[WARNING]', sc, wi)
    clearWarnings()


def cst2bin(datcst):
    # CST 外层封装：魔数 + 压缩后大小 + 解压后大小 + zlib 数据。
    tag, sizcst, sizbin = struct.unpack_from('8sII', datcst)
    if tag != b'CatScene':
        raise Exception('Label Mismatch')
    datcst = datcst[16:]
    if sizcst != len(datcst):
        raise Exception('Size Ante Decompress Mismatch')
    datbin = zlib.decompress(datcst)
    if sizbin != len(datbin):
        raise Exception('Size Post Decompress Mismatch')
    return datbin

def bin2cst(datbin):
    # 将场景原始二进制重新封装为 CST。
    datcst = zlib.compress(datbin)
    return b'CatScene' + struct.pack('II', len(datcst), len(datbin)) + datcst


class FormatCST:
    def __init__(self, fc):
        # 解码并拆分 CST 内部数据：
        # - b1：分段映射表
        # - b2：片段偏移表
        # - b3：片段内容区
        b = cst2bin(fc.read())
        (h0, self.h1, self.h2, self.h3), b = struct.unpack_from('4I', b), b[16:]
        self.b1, b2, b3 = b[:self.h2], b[self.h2:self.h3], b[self.h3:]
        if h0 != len(b) or self.h1 * 8 != self.h2 or (self.h3 - self.h2) % 4 != 0:
            raise Exception('Integrity Constraint 0 Violated')

        it = struct.iter_unpack('II', self.b1)
        flag = True
        self.n1 = 0
        while flag:
            try:
                d10, d11 = next(it)
                if d11 != self.n1:
                    flag = False
                self.n1 += d10
            except StopIteration:
                break
        if not flag or self.n1 * 4 != self.h3 - self.h2:
            raise Exception('Integrity Constraint 1 Violated')

        it = struct.iter_unpack('I', b2)
        d2 = []
        while True:
            try:
                d2.append(*next(it))
            except StopIteration:
                break
        if self.n1 != len(d2):
            raise Exception('Integrity Constraint 2 Violated')
        
        ofs = 0
        self.d3 = []
        for i in range(self.n1):
            if ofs < d2[i]:
                warn('Unaccessible Fragment Offset 0x{0:08X}'.format(ofs))
                ofs = d2[i]
            if ofs > d2[i]:
                raise Exception('Overflow Offset 0x{0:08X}'.format(ofs))
            try:
                d30, d31, d32 = struct.unpack_from('3B', b3, ofs)
            except Exception:
                raise Exception('Content Truncated')
            if d30 != 0x01:
                raise Exception('Invalid Offset 0x{0:08X}'.format(ofs))
            if d31 not in (0x02, 0x20, 0x21, 0x30):
                warn('Unknown Code 0x01{1:02X} Offset 0x{0:08X}'.format(ofs, d31))
            ofs += 3
            while d32 != 0x00:
                try:
                    d32, = struct.unpack_from('B', b3, ofs)
                except Exception:
                    raise Exception('Content Truncated')
                ofs += 1
            self.d3.append(b3[d2[i] : ofs])
        if ofs < len(b3):
            warn('Unaccessible Fragment Offset 0x{0:08X}'.format(ofs))
        
    def iter(self):
        # 重置迭代状态。
        self.idx = -1
        self.fslc = False

    def next(self, skp = True):
        # 移动到下一个片段。skp=True 时只保留可见文本项。
        self.idx += 1
        if skp:
            while self.idx < self.n1:
                if self.fslc:
                    break
                d31, = struct.unpack_from('B', self.d3[self.idx], 1)
                if d31 in (0x20, 0x21):
                    break
                if d31 == 0x30 and self.d3[self.idx][2:8] == b'scene\x20':
                    break
                if d31 == 0x30 and self.d3[self.idx][2:] == b'fselect\x00':
                    self.fslc = True
                self.idx += 1
        if self.idx >= self.n1:
            raise StopIteration
    
    def get(self, skp = True):
        # 返回当前片段的文本主体。
        if skp:
            return self.d3[self.idx][2:-1]
        else:
            return b'<\\x01><\\x' + bytes('{0:02X}'.format(self.d3[self.idx][1]), encoding = 'utf-8') + b'>' + self.d3[self.idx][2:-1] + b'<\\x00>'
    
    def rep(self, bn):
        # 替换当前片段文本，保留前缀与结尾空字节。
        self.d3[self.idx] = self.d3[self.idx][:2] + bn + b'\x00'

    def pac(self):
        # 重新构建偏移表和内容区。
        b2, b3 = b'', b''
        ofs = 0
        for i in range(self.n1):
            b2 += struct.pack('I', ofs)
            b3 += self.d3[i]
            ofs += len(self.d3[i])
        b0 = struct.pack('4I', self.h3 + ofs, self.h1, self.h2, self.h3)
        return b0 + self.b1 + b2 + b3


pathcst = 'scene_cst'
pathbin = 'scene_bin'
pathtxt = 'scene_txt'
pathdst = 'scene_dst'
# 游戏脚本文本编码（NEKOPARA 常见为 CP932 / Shift-JIS）
SCENE_ENCODING = 'cp932'
# 导出文本编码：UTF-8 with BOM（utf-8-sig）
TXT_ENCODING = 'utf-8-sig'


def scene_bytes_to_text(bn):
    return bn.decode(SCENE_ENCODING)


def text_to_scene_bytes(st):
    return st.encode(SCENE_ENCODING)


def cst_to_txt_name(sc_name):
    # 统一文本命名：01.cst -> 01.cst.txt
    return sc_name + '.txt'


def read_txt_lines(st):
    # 优先按 UTF-8 BOM 读取；若是旧版 CP932 文本则回退兼容。
    try:
        f = open(st, 'r', encoding = TXT_ENCODING, newline = '')
        try:
            return f.read().splitlines()
        finally:
            f.close()
    except UnicodeDecodeError:
        warn('TXT 非 UTF-8，已回退按 CP932 读取')
        f = open(st, 'r', encoding = SCENE_ENCODING, newline = '')
        try:
            return f.read().splitlines()
        finally:
            f.close()


def depacst_file(sc, sb = None):
    # 单文件模式：.cst -> .bin
    clearWarnings()
    if sb is None:
        sb = os.path.splitext(sc)[0] + '.bin'

    f = open(sc, 'rb')
    try:
        b = cst2bin(f.read())
    finally:
        f.close()

    f = open(sb, 'wb')
    f.write(b)
    f.close()

    printWarnings(os.path.basename(sc))
    return sb


def unpacst_file(sc, st = None, skp = True):
    # 单文件模式：.cst -> .txt
    clearWarnings()
    if st is None:
        st = cst_to_txt_name(sc)

    f = open(sc, 'rb')
    try:
        c = FormatCST(f)
    finally:
        f.close()

    f = open(st, 'w', encoding = TXT_ENCODING, newline = '\r\n')
    c.iter()
    while True:
        try:
            c.next(skp)
            f.write(scene_bytes_to_text(c.get(skp)))
            f.write('\n')
        except StopIteration:
            break
    f.close()

    printWarnings(os.path.basename(sc))
    return st


def repacst_file(sc, st = None, sd = None):
    # 单文件模式：.cst + .txt -> .new.cst
    clearWarnings()
    if st is None:
        st = cst_to_txt_name(sc)
        # 兼容旧命名：01.txt
        if not os.path.exists(st):
            st_old = os.path.splitext(sc)[0] + '.txt'
            if os.path.exists(st_old):
                warn('检测到旧命名 TXT，建议改为 *.cst.txt')
                st = st_old
    if sd is None:
        sd = os.path.splitext(sc)[0] + '.new.cst'

    f = open(sc, 'rb')
    try:
        c = FormatCST(f)
    finally:
        f.close()

    lines = read_txt_lines(st)
    c.iter()
    li = 0
    while True:
        try:
            c.next()
        except StopIteration:
            break
        if li >= len(lines):
            warn('Lack of Text')
            break
        try:
            bn = text_to_scene_bytes(lines[li])
        except UnicodeEncodeError as e:
            raise Exception('Text Encode Error Line {0}: {1}'.format(li + 1, e))
        c.rep(bn)
        li += 1
    if li < len(lines):
        warn('Unused Text Lines: {0}'.format(len(lines) - li))

    f = open(sd, 'wb')
    f.write(bin2cst(c.pac()))
    f.close()

    printWarnings(os.path.basename(sc))
    return sd


def parse_arg():
    # 命令行用法：
    # - python 1.py
    # - python 1.py 0|1|2|3
    # - python 1.py xxx.cst
    # - python 1.py 0|1|2|3 xxx.cst
    # - python 1.py folder
    # - python 1.py 0|2|3 folder
    tag = 0
    src = None
    if len(sys.argv) >= 2:
        if sys.argv[1] in ('0', '1', '2', '3'):
            tag = int(sys.argv[1])
            if len(sys.argv) >= 3:
                src = sys.argv[2]
        else:
            src = sys.argv[1]
    return tag, src


def resolve_src(src):
    # 解析输入路径：支持直接路径或 pathcst 下的相对路径。
    if src is None:
        return (None, None)
    if os.path.isfile(src):
        return ('file', src)
    if os.path.isdir(src):
        return ('dir', src)
    cst = os.path.join(pathcst, src)
    if os.path.isfile(cst):
        return ('file', cst)
    cst_dir = os.path.join(pathcst, src)
    if os.path.isdir(cst_dir):
        return ('dir', cst_dir)
    return (None, None)


def extract_dir(src_dir, skp = True):
    # 批量提取目录中的 .cst 到 <目录>/scene_txt。
    lis = os.listdir(src_dir)
    csts = [name for name in lis if name.endswith('.cst')]
    dst_dir = os.path.join(src_dir, 'scene_txt')
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir)
    s0, s1 = 0, len(csts)
    for sc in csts:
        scp = os.path.join(src_dir, sc)
        stp = os.path.join(dst_dir, cst_to_txt_name(sc))
        try:
            unpacst_file(scp, stp, skp)
            s0 += 1
        except Exception as e:
            print('[ERROR]', sc, e)
    return (s0, s1)


def depacst_dir(src_dir):
    # 批量解压目录中的 .cst 到 <目录>/scene_bin。
    lis = os.listdir(src_dir)
    csts = [name for name in lis if name.endswith('.cst')]
    dst_dir = os.path.join(src_dir, 'scene_bin')
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir)
    s0, s1 = 0, len(csts)
    for sc in csts:
        scp = os.path.join(src_dir, sc)
        sbp = os.path.join(dst_dir, os.path.splitext(sc)[0] + '.bin')
        try:
            depacst_file(scp, sbp)
            s0 += 1
        except Exception as e:
            print('[ERROR]', sc, e)
    return (s0, s1)


def depacst():
    liscst = os.listdir(pathcst)
    if not os.path.exists(pathbin):
        os.makedirs(pathbin)
    s0, s1 = 0, 0
    for sc in liscst:
        if not sc.endswith('.cst'):
            continue
        s1 += 1
        sp = os.path.join(pathcst, sc)
        sb = os.path.join(pathbin, sc[:-3] + 'bin')
        try:
            depacst_file(sp, sb)
        except Exception as e:
            print('[ERROR]', sc, e)
            continue

        s0 += 1
    return (s0, s1)


def unpacst(skp = True):
    liscst = os.listdir(pathcst)
    if not os.path.exists(pathtxt):
        os.makedirs(pathtxt)
    s0, s1 = 0, 0
    for sc in liscst:
        if not sc.endswith('.cst'):
            continue
        s1 += 1
        sp = os.path.join(pathcst, sc)
        st = os.path.join(pathtxt, cst_to_txt_name(sc))
        try:
            unpacst_file(sp, st, skp)
        except Exception as e:
            print('[ERROR]', sc, e)
            continue

        s0 += 1
    return (s0, s1)


def repacst():
    liscst = os.listdir(pathcst)
    listxt = os.listdir(pathtxt)
    if not os.path.exists(pathdst):
        os.makedirs(pathdst)
    s0, s1 = 0, 0
    for st in listxt:
        if not st.endswith('.txt'):
            continue
        if st.endswith('.cst.txt'):
            sc = st[:-4]
        else:
            # 兼容旧命名：01.txt -> 01.cst
            sc = st[:-4] + '.cst'
        if sc not in liscst:
            print('[WARNING] Original CST File Missing: ' + sc)
            continue
        s1 += 1
        sp = os.path.join(pathcst, sc)
        stp = os.path.join(pathtxt, st)
        sdp = os.path.join(pathdst, sc)
        try:
            repacst_file(sp, stp, sdp)
        except Exception as e:
            print('[ERROR]', sc, e)
            continue

        s0 += 1
    return (s0, s1)


if __name__ == '__main__':
    tag, src = parse_arg()
    if tag not in (0, 1, 2, 3):
        print('Invalid Parametre')
        sys.exit()

    srct, src = resolve_src(src)
    if src is not None and srct == 'file':
        try:
            if tag == 0:
                unpacst_file(src)
            if tag == 1:
                repacst_file(src)
            if tag == 2:
                depacst_file(src)
            if tag == 3:
                unpacst_file(src, skp = False)
            print('1 / 1 completed')
        except Exception as e:
            print('[ERROR]', os.path.basename(src), e)
            sys.exit(1)
        sys.exit()

    if src is not None and srct == 'dir':
        if tag == 0:
            s0, s1 = extract_dir(src)
        elif tag == 2:
            s0, s1 = depacst_dir(src)
        elif tag == 3:
            s0, s1 = extract_dir(src, False)
        else:
            print('[ERROR] Directory mode only supports tag 0/2/3')
            sys.exit(1)
        print('%d / %d completed' % (s0, s1))
        sys.exit()

    if len(sys.argv) >= 2 and sys.argv[1] not in ('0', '1', '2', '3'):
        print('[ERROR] CST Path Missing:', sys.argv[1])
        sys.exit(1)

    if tag == 0:
        s0, s1 = unpacst()
    if tag == 1:
        s0, s1 = repacst()
    if tag == 2:
        s0, s1 = depacst()
    if tag == 3:
        s0, s1 = unpacst(False)
    print('%d / %d completed' % (s0, s1))

子网站

1. 背景知识：.cst 文件结构

外层封装（压缩层）

内层结构（解压后的 .bin 文件）

2. 代码模块解析

2.1 辅助函数

2.2 核心类 FormatCST

__init__(self, fc)

iter() / next(skp=True) / get(skp=True)

rep(bn)

pac()

2.3 文本编码处理

2.4 单文件操作函数

2.5 批量处理函数

2.6 命令行解析与主流程

3. 使用示例

3.1 准备环境

3.2 提取文本

3.3 翻译文本

3.4 重新打包

3.5 其他操作

4. 注意事项

5. 总结

内层结构（解压后的 `.bin` 文件）

2.2 核心类 `FormatCST`

`init(self, fc)`

`iter()` / `next(skp=True)` / `get(skp=True)`

`rep(bn)`

`pac()`