Python 转换文件编码格式

Python Python

创建时间:2019-05-27 17:14

字数:836 阅读:

转换文件编码格式
chardet 库
Linux 命令行转换文件编码格式
批量更新图床地址
参考

因要更新图床地址，发现文件夹下存在两种编码的文件 GBK UTF-8，需对其统一修改为 UTF-8。

转换文件编码格式

import os
import codecs
import chardet


def convert(file_name, in_code="GBK", out_code="UTF-8"):
    """
    该程序用于将目录下的文件从指定格式转换到指定格式，默认的是 GBK 转到 UTF-8
    :param file_name: 文件路径
    :param in_code:  输入文件格式
    :param out_code: 输出文件格式
    :return:
    """

    try:
        with codecs.open(file_name, 'r', in_code) as f_in:
            new_content = f_in.read()
            f_out = codecs.open(file_name, 'w', out_code)
            f_out.write(new_content)
            f_out.close()
    except IOError as err:
        print("I/O error: {0}".format(err))


path = r'C:\my\temp'

for file in os.listdir(path):
    file_name = os.path.join(path, file)
    if os.path.isdir(file_name):
        continue
    with open(file_name, "rb") as f:
        data = f.read()
        code_types = chardet.detect(data)['encoding']
        encoding = code_types['encoding']
        confidence = code_types['confidence']
        if encoding != 'utf-8':
            if confidence < 0.9:  # 有一定的可能计算错误
                print(file_name)
            convert(file_name, encoding, 'UTF-8')

chardet 库

https://www.jianshu.com/p/d73c0017158c

安装：pip install chardet

文档：https://chardet.readthedocs.io/en/latest/usage.html

>>> data = '文本太少很可能不准确'.encode('gbk')
>>> chardet.detect(data)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

with open('test1.txt', 'rb') as f:
    result = chardet.detect(f.read())
print(result)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

大文件的编码判断

from chardet.universaldetector import UniversalDetector

bigdata = open(r'C:\my\temp\1.txt', 'rb')

detector = UniversalDetector()
for line in bigdata.readlines():
    print(line)
    detector.feed(line)
    if detector.done:
        break

detector.close()
bigdata.close()
print(detector.result)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

多个大文件的编码判断，可以重复使用单个的 UniversalDetector 对象。只需要在每次调用 UniversalDetector 对象时候，初始化detector.reset()。

import os
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
dirs = os.listdir(r'C:\my\temp')
for name in dirs:
    path = os.path.join(r'C:\my\temp', name)
    detector.reset()
    for line in open(path, 'rb').readlines():
        detector.feed(line)
        if detector.done:
            break
    detector.close()
    print(detector.result)

# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
# {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Linux 命令行转换文件编码格式

iconv 命令用于文件编码的转换，碰到 gbk 编码的文件，需要转换成 utf8，直接使用该命令即可。

命令用法

# 列出iconv支持的编码列表
iconv --list

# 转换文件编码语法
iconv -f 原编码 -t 新编码 filename -o newfile

参数说明：

-f：from 来源编码
-t：to 转换后新编码
-c：忽略无效字符
-s：--silent，忽略警告
-o：可选，没有的话直接，转换当前文件，使用 -o 保留源文件

命令实例

# 查看文件
$ file test
test: UTF-8 Unicode text

# 转换
$ iconv -f utf8 -t gbk test -o test.gbk

# 效果
$ file test*
test:          UTF-8 Unicode text
test.gbk:     ISO-8859 text

批量更新图床地址

import os

path = r"C:\my\temp\_posts"
for i in os.listdir(path):
    if os.path.isfile(os.path.join(path, i)):
        filePath = os.path.join(path, i)
        with open(filePath, 'r', encoding='utf-8') as a:
            str = a.read()
            str2 = str.replace('http://*.*.*.cn/', 'https://*.*.*.*.com/')
            if str != str2:
                with open(filePath, 'wt', encoding='utf-8') as b:
                    b.write(str2)

参考

https://www.jianshu.com/p/d5030db5da0e
https://www.jianshu.com/p/d73c0017158c

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论，也可以邮件至 bin07280@qq.com