split 命令

阅读原文时间：2023年07月16日阅读：1

最近下游一直说我供给的文件存在乱码，下游定位到了具体哪一条。

一个250w的数据量，有一条数据有问题。几百兆的文件用note去搜索。

我使用用notepad++后，发现根本打不开。

于是只能先拆分后用notepad++打开。。。。。再用显示所有字符告诉下游。我们大数据提供的数据没有乱码。

$ split --help
Usage: split [OPTION]… [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, …; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic;
FROM changes the start value (default 0)
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
-u, --unbuffered immediately copy input to output with '-n r/…'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit

SIZE is an integer and optional unit (example: 10M is 10*1024*1024). Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, … (powers of 1000).

CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines
l/K/N output Kth of N to stdout without splitting lines
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout

GNU coreutils online help: http://www.gnu.org/software/coreutils/
For complete documentation, run: info coreutils 'split invocation'

是不是感觉特别不懂，不着急我们继续分析。

-b：值为每一输出档案的大小，单位为 byte。

-C：每一输出档中，单行的最大 byte 数。

-d：使用数字作为后缀。

-l：值为每一输出档的列数大小。

PREFIX:代表前导符，可作为切割文件的前导文件。

1.使用split命令将100KB的date.file文件分割成大小为10KB的小文件：

split -b 10k date.file

ls
结果：
date.file xaa xab xac xad xae xaf xag xah xai xaj

2.文件被分割成多个带有字母的后缀文件，如果想用数字后缀可使用-d参数，同时可以使用-a length来指定后缀的长度：

split -b 10k date.file -d -a

ls
结果：
date.file x000 x001 x002 x003 x004 x005 x006 x007 x008 x009

3.为分割后的文件指定文件名的前缀：

split -b 10k date.file -d -a split_file

ls
结果：
date.file split_file000 split_file001 split_file002 split_file003 split_file004 split_file005 split_file006 split_file007 split_file008 split_file009

4.使用-l选项根据文件的行数来分割文件，例如把文件分割成每个包含1000行的小文件：

split -l date.file

那文件合并呢？

linux命令：
比如 cat 1.wav 2.wav 3.wav > all.wav 就是直接把1.wav 2.wav 3.wav 合并成all.wav
注意1.wav 2.wav 3.wav的顺序，all.wav是按照这个顺序合并的。

手机扫一扫

移动阅读更方便

你可能感兴趣的文章

spark集群的简单测试和基础命令的使用

探讨Service Mesh中一种更高效的代理模式

hadoop集群搭建及编程实践

springboot下载文件范围下载

Elasticsearch之环境搭建

聊聊HuggingFace如何处理大模型下海量数据集

自然语言处理 Paddle NLP - 快递单信息抽取 (ERNIE 1.0)