dk.exe用于学校每日健康报的自动填写。
显而易见这是一个对Python进行打包生成的简易应用程序。最初取得时仅有.exe文件。现基于该.exe文件尝试取得源码。
现有工具有pyinstxtractor.py、archive_viewer.py和uncompyle6等。前两个单文件脚本用于直接处理.exe文件,拆解出库文件、.pyc或类似.pyc的文件;uncompyle6一键反编,理论上支持所有Python版本。
pyinstxtractor.py实现代码:
#!/usr/bin/python
"""
PyInstaller Extractor v1.8 (Supports pyinstaller 3.2, 3.1, 3.0, 2.1, 2.0)
Author : Extreme Coders
E-mail : extremecoders(at)hotmail(dot)com
Web : https://0xec.blogspot.com
Date : 28-April-2017
Url : https://sourceforge.net/projects/pyinstallerextractor/
For any suggestions, leave a comment on
https://forum.tuts4you.com/topic/34455-pyinstaller-extractor/
This script extracts a pyinstaller generated executable file.
Pyinstaller installation is not needed. The script has it all.
For best results, it is recommended to run this script in the
same version of python as was used to create the executable.
This is just to prevent unmarshalling errors(if any) while
extracting the PYZ archive.
Usage : Just copy this script to the directory where your exe resides
and run the script with the exe file name as a parameter
C:\path\to\exe\>python pyinstxtractor.py <filename>
$ /path/to/exe/python pyinstxtractor.py <filename>
Licensed under GNU General Public License (GPL) v3.
You are free to modify this source.
CHANGELOG
================================================
Version 1.1 (Jan 28, 2014)
-------------------------------------------------
- First Release
- Supports only pyinstaller 2.0
Version 1.2 (Sept 12, 2015)
-------------------------------------------------
- Added support for pyinstaller 2.1 and 3.0 dev
- Cleaned up code
- Script is now more verbose
- Executable extracted within a dedicated sub-directory
(Support for pyinstaller 3.0 dev is experimental)
Version 1.3 (Dec 12, 2015)
-------------------------------------------------
- Added support for pyinstaller 3.0 final
- Script is compatible with both python 2.x & 3.x (Thanks to Moritz Kroll @ Avira Operations GmbH & Co. KG)
Version 1.4 (Jan 19, 2016)
-------------------------------------------------
- Fixed a bug when writing pyc files >= version 3.3 (Thanks to Daniello Alto: https://github.com/Djamana)
Version 1.5 (March 1, 2016)
-------------------------------------------------
- Added support for pyinstaller 3.1 (Thanks to Berwyn Hoyt for reporting)
Version 1.6 (Sept 5, 2016)
-------------------------------------------------
- Added support for pyinstaller 3.2
- Extractor will use a random name while extracting unnamed files.
- For encrypted pyz archives it will dump the contents as is. Previously, the tool would fail.
Version 1.7 (March 13, 2017)
-------------------------------------------------
- Made the script compatible with python 2.6 (Thanks to Ross for reporting)
Version 1.8 (April 28, 2017)
-------------------------------------------------
- Support for sub-directories in .pyz files (Thanks to Moritz Kroll @ Avira Operations GmbH & Co. KG)
"""
"""
Author: In Ming Loh
Email: inming.loh@countercept.com
Changes have been made to Version 1.8 (April 28, 2017).
CHANGELOG
================================================
- Function extractFiles(self, custom_dir=None) has been modfied to allow custom output directory.
"""
import os
import struct
import marshal
import zlib
import sys
import imp
import types
from uuid import uuid4 as uniquename
class CTOCEntry:
def __init__(self, position, cmprsdDataSize, uncmprsdDataSize, cmprsFlag, typeCmprsData, name):
self.position = position
self.cmprsdDataSize = cmprsdDataSize
self.uncmprsdDataSize = uncmprsdDataSize
self.cmprsFlag = cmprsFlag
self.typeCmprsData = typeCmprsData
self.name = name
class PyInstArchive:
PYINST20_COOKIE_SIZE = 24 # For pyinstaller 2.0
PYINST21_COOKIE_SIZE = 24 + 64 # For pyinstaller 2.1+
MAGIC = b'MEI\014\013\012\013\016' # Magic number which identifies pyinstaller
def __init__(self, path):
self.filePath = path
def open(self):
try:
self.fPtr = open(self.filePath, 'rb')
self.fileSize = os.stat(self.filePath).st_size
except:
print('[*] Error: Could not open {0}'.format(self.filePath))
return False
return True
def close(self):
try:
self.fPtr.close()
except:
pass
def checkFile(self):
print('[*] Processing {0}'.format(self.filePath))
# Check if it is a 2.0 archive
self.fPtr.seek(self.fileSize - self.PYINST20_COOKIE_SIZE, os.SEEK_SET)
magicFromFile = self.fPtr.read(len(self.MAGIC))
if magicFromFile == self.MAGIC:
self.pyinstVer = 20 # pyinstaller 2.0
print('[*] Pyinstaller version: 2.0')
return True
# Check for pyinstaller 2.1+ before bailing out
self.fPtr.seek(self.fileSize - self.PYINST21_COOKIE_SIZE, os.SEEK_SET)
magicFromFile = self.fPtr.read(len(self.MAGIC))
if magicFromFile == self.MAGIC:
print('[*] Pyinstaller version: 2.1+')
self.pyinstVer = 21 # pyinstaller 2.1+
return True
print('[*] Error : Unsupported pyinstaller version or not a pyinstaller archive')
return False
def getCArchiveInfo(self):
try:
if self.pyinstVer == 20:
self.fPtr.seek(self.fileSize - self.PYINST20_COOKIE_SIZE, os.SEEK_SET)
# Read CArchive cookie
(magic, lengthofPackage, toc, tocLen, self.pyver) = \
struct.unpack('!8siiii', self.fPtr.read(self.PYINST20_COOKIE_SIZE))
elif self.pyinstVer == 21:
self.fPtr.seek(self.fileSize - self.PYINST21_COOKIE_SIZE, os.SEEK_SET)
# Read CArchive cookie
(magic, lengthofPackage, toc, tocLen, self.pyver, pylibname) = \
struct.unpack('!8siiii64s', self.fPtr.read(self.PYINST21_COOKIE_SIZE))
except:
print('[*] Error : The file is not a pyinstaller archive')
return False
print('[*] Python version: {0}'.format(self.pyver))
# Overlay is the data appended at the end of the PE
self.overlaySize = lengthofPackage
self.overlayPos = self.fileSize - self.overlaySize
self.tableOfContentsPos = self.overlayPos + toc
self.tableOfContentsSize = tocLen
print('[*] Length of package: {0} bytes'.format(self.overlaySize))
return True
def parseTOC(self):
# Go to the table of contents
self.fPtr.seek(self.tableOfContentsPos, os.SEEK_SET)
self.tocList = []
parsedLen = 0
# Parse table of contents
while parsedLen < self.tableOfContentsSize:
(entrySize, ) = struct.unpack('!i', self.fPtr.read(4))
nameLen = struct.calcsize('!iiiiBc')
(entryPos, cmprsdDataSize, uncmprsdDataSize, cmprsFlag, typeCmprsData, name) = \
struct.unpack( \
'!iiiBc{0}s'.format(entrySize - nameLen), \
self.fPtr.read(entrySize - 4))
name = name.decode('utf-8').rstrip('\0')
if len(name) == 0:
name = str(uniquename())
print('[!] Warning: Found an unamed file in CArchive. Using random name {0}'.format(name))
self.tocList.append( \
CTOCEntry( \
self.overlayPos + entryPos, \
cmprsdDataSize, \
uncmprsdDataSize, \
cmprsFlag, \
typeCmprsData, \
name \
))
parsedLen += entrySize
print('[*] Found {0} files in CArchive'.format(len(self.tocList)))
def extractFiles(self, custom_dir=None):
print('[*] Beginning extraction...please standby')
if custom_dir is None:
extractionDir = os.path.join(os.getcwd(), os.path.basename(self.filePath) + '_extracted')
if not os.path.exists(extractionDir):
os.mkdir(extractionDir)
os.chdir(extractionDir)
else:
if not os.path.exists(custom_dir):
os.makedirs(custom_dir)
os.chdir(custom_dir)
for entry in self.tocList:
basePath = os.path.dirname(entry.name)
if basePath != '':
# Check if path exists, create if not
if not os.path.exists(basePath):
os.makedirs(basePath)
self.fPtr.seek(entry.position, os.SEEK_SET)
data = self.fPtr.read(entry.cmprsdDataSize)
if entry.cmprsFlag == 1:
data = zlib.decompress(data)
# Malware may tamper with the uncompressed size
# Comment out the assertion in such a case
assert len(data) == entry.uncmprsdDataSize # Sanity Check
with open(entry.name, 'wb') as f:
f.write(data)
if entry.typeCmprsData == b'z':
self._extractPyz(entry.name)
def _extractPyz(self, name):
dirName = name + '_extracted'
# Create a directory for the contents of the pyz
if not os.path.exists(dirName):
os.mkdir(dirName)
with open(name, 'rb') as f:
pyzMagic = f.read(4)
assert pyzMagic == b'PYZ\0' # Sanity Check
pycHeader = f.read(4) # Python magic value
if imp.get_magic() != pycHeader:
print('[!] Warning: The script is running in a different python version than the one used to build the executable')
print(' Run this script in Python{0} to prevent extraction errors(if any) during unmarshalling'.format(self.pyver))
(tocPosition, ) = struct.unpack('!i', f.read(4))
f.seek(tocPosition, os.SEEK_SET)
try:
toc = marshal.load(f)
except:
print('[!] Unmarshalling FAILED. Cannot extract {0}. Extracting remaining files.'.format(name))
return
print('[*] Found {0} files in PYZ archive'.format(len(toc)))
# From pyinstaller 3.1+ toc is a list of tuples
if type(toc) == list:
toc = dict(toc)
for key in toc.keys():
(ispkg, pos, length) = toc[key]
f.seek(pos, os.SEEK_SET)
fileName = key
try:
# for Python > 3.3 some keys are bytes object some are str object
fileName = key.decode('utf-8')
except:
pass
# Make sure destination directory exists, ensuring we keep inside dirName
destName = os.path.join(dirName, fileName.replace("..", "__"))
destDirName = os.path.dirname(destName)
if not os.path.exists(destDirName):
os.makedirs(destDirName)
try:
data = f.read(length)
data = zlib.decompress(data)
except:
print('[!] Error: Failed to decompress {0}, probably encrypted. Extracting as is.'.format(fileName))
open(destName + '.pyc.encrypted', 'wb').write(data)
continue
with open(destName + '.pyc', 'wb') as pycFile:
pycFile.write(pycHeader) # Write pyc magic
pycFile.write(b'\0' * 4) # Write timestamp
if self.pyver >= 33:
pycFile.write(b'\0' * 4) # Size parameter added in Python 3.3
pycFile.write(data)
def main():
if len(sys.argv) < 2:
print('[*] Usage: pyinstxtractor.py <filename>')
else:
arch = PyInstArchive(sys.argv[1])
if arch.open():
if arch.checkFile():
if arch.getCArchiveInfo():
arch.parseTOC()
arch.extractFiles()
arch.close()
print('[*] Successfully extracted pyinstaller archive: {0}'.format(sys.argv[1]))
print('')
print('You can now use a python decompiler on the pyc files within the extracted directory')
return
arch.close()
if __name__ == '__main__':
main()
archive_viewer.py实现代码:
#-----------------------------------------------------------------------------
# Copyright (c) 2013-2021, PyInstaller Development Team.
#
# Distributed under the terms of the GNU General Public License (version 2
# or later) with exception for distributing the bootloader.
#
# The full license is in the file COPYING.txt, distributed with this software.
#
# SPDX-License-Identifier: (GPL-2.0-or-later WITH Bootloader-exception)
#-----------------------------------------------------------------------------
"""
Viewer for archives packaged by archive.py
"""
import argparse
import os
import pprint
import sys
import tempfile
import zlib
from PyInstaller.loader import pyimod02_archive
from PyInstaller.archive.readers import CArchiveReader, NotAnArchiveError
from PyInstaller.compat import stdin_input
import PyInstaller.log
stack = []
cleanup = []
def main(name, brief, debug, rec_debug, **unused_options):
global stack
if not os.path.isfile(name):
print(name, "is an invalid file name!", file=sys.stderr)
return 1
arch = get_archive(name)
stack.append((name, arch))
if debug or brief:
show_log(arch, rec_debug, brief)
raise SystemExit(0)
else:
show(name, arch)
while 1:
try:
toks = stdin_input('? ').split(None, 1)
except EOFError:
# Ctrl-D
print(file=sys.stderr) # Clear line.
break
if not toks:
usage()
continue
if len(toks) == 1:
cmd = toks[0]
arg = ''
else:
cmd, arg = toks
cmd = cmd.upper()
if cmd == 'U':
if len(stack) > 1:
arch = stack[-1][1]
arch.lib.close()
del stack[-1]
name, arch = stack[-1]
show(name, arch)
elif cmd == 'O':
if not arg:
arg = stdin_input('open name? ')
arg = arg.strip()
try:
arch = get_archive(arg)
except NotAnArchiveError as e:
print(e, file=sys.stderr)
continue
if arch is None:
print(arg, "not found", file=sys.stderr)
continue
stack.append((arg, arch))
show(arg, arch)
elif cmd == 'X':
if not arg:
arg = stdin_input('extract name? ')
arg = arg.strip()
data = get_data(arg, arch)
if data is None:
print("Not found", file=sys.stderr)
continue
filename = stdin_input('to filename? ')
if not filename:
print(repr(data))
else:
with open(filename, 'wb') as fp:
fp.write(data)
elif cmd == 'Q':
break
else:
usage()
do_cleanup()
def do_cleanup():
global stack, cleanup
for (name, arch) in stack:
arch.lib.close()
stack = []
for filename in cleanup:
try:
os.remove(filename)
except Exception as e:
print("couldn't delete", filename, e.args, file=sys.stderr)
cleanup = []
def usage():
print("U: go Up one level", file=sys.stderr)
print("O <name>: open embedded archive name", file=sys.stderr)
print("X <name>: extract name", file=sys.stderr)
print("Q: quit", file=sys.stderr)
def get_archive(name):
if not stack:
if name[-4:].lower() == '.pyz':
return ZlibArchive(name)
return CArchiveReader(name)
parent = stack[-1][1]
try:
return parent.openEmbedded(name)
except KeyError:
return None
except (ValueError, RuntimeError):
ndx = parent.toc.find(name)
dpos, dlen, ulen, flag, typcd, name = parent.toc[ndx]
x, data = parent.extract(ndx)
tempfilename = tempfile.mktemp()
cleanup.append(tempfilename)
with open(tempfilename, 'wb') as fp:
fp.write(data)
if typcd == 'z':
return ZlibArchive(tempfilename)
else:
return CArchiveReader(tempfilename)
def get_data(name, arch):
if isinstance(arch.toc, dict):
(ispkg, pos, length) = arch.toc.get(name, (0, None, 0))
if pos is None:
return None
with arch.lib:
arch.lib.seek(arch.start + pos)
return zlib.decompress(arch.lib.read(length))
ndx = arch.toc.find(name)
dpos, dlen, ulen, flag, typcd, name = arch.toc[ndx]
x, data = arch.extract(ndx)
return data
def show(name, arch):
if isinstance(arch.toc, dict):
print(" Name: (ispkg, pos, len)")
toc = arch.toc
else:
print(" pos, length, uncompressed, iscompressed, type, name")
toc = arch.toc.data
pprint.pprint(toc)
def get_content(arch, recursive, brief, output):
if isinstance(arch.toc, dict):
toc = arch.toc
if brief:
for name, _ in toc.items():
output.append(name)
else:
output.append(toc)
else:
toc = arch.toc.data
for el in toc:
if brief:
output.append(el[5])
else:
output.append(el)
if recursive:
if el[4] in ('z', 'a'):
get_content(get_archive(el[5]), recursive, brief, output)
stack.pop()
def show_log(arch, recursive, brief):
output = []
get_content(arch, recursive, brief, output)
# first print all TOCs
for out in output:
if isinstance(out, dict):
pprint.pprint(out)
# then print the other entries
pprint.pprint([out for out in output if not isinstance(out, dict)])
def get_archive_content(filename):
"""
Get a list of the (recursive) content of archive `filename`.
This function is primary meant to be used by runtests.
"""
archive = get_archive(filename)
stack.append((filename, archive))
output = []
get_content(archive, recursive=True, brief=True, output=output)
do_cleanup()
return output
class ZlibArchive(pyimod02_archive.ZlibArchiveReader):
def checkmagic(self):
""" Overridable.
Check to see if the file object self.lib actually has a file
we understand.
"""
self.lib.seek(self.start) # default - magic is at start of file.
if self.lib.read(len(self.MAGIC)) != self.MAGIC:
raise RuntimeError("%s is not a valid %s archive file"
% (self.path, self.__class__.__name__))
if self.lib.read(len(self.pymagic)) != self.pymagic:
print("Warning: pyz is from a different Python version",
file=sys.stderr)
self.lib.read(4)
def run():
parser = argparse.ArgumentParser()
parser.add_argument('-l', '--log',
default=False,
action='store_true',
dest='debug',
help='Print an archive log (default: %(default)s)')
parser.add_argument('-r', '--recursive',
default=False,
action='store_true',
dest='rec_debug',
help='Recursively print an archive log (default: %(default)s). '
'Can be combined with -r')
parser.add_argument('-b', '--brief',
default=False,
action='store_true',
dest='brief',
help='Print only file name. (default: %(default)s). '
'Can be combined with -r')
PyInstaller.log.__add_options(parser)
parser.add_argument('name', metavar='pyi_archive',
help="pyinstaller archive to show content of")
args = parser.parse_args()
PyInstaller.log.__process_options(parser, args)
try:
raise SystemExit(main(**vars(args)))
except KeyboardInterrupt:
raise SystemExit("Aborted by user request.")
if __name__ == '__main__':
run()
当拥有Python源码时,直接使用Python自带的py_compile模块即可将源码.py文件编译成.pyc文件,这对保护源码有一定作用;但在上述工具帮助下,将.pyc或者.exe还原成.py的难度大大降低。
使用cmd命令python pyinstxtractor.py dk.exe
获得初代解压文件。
执行完成后即得dk.exe_extracted文件夹。使用该工具的期望结果是得到可用的.pyc文件,但实际上文件夹中并不包含期望结果,有大量文件为库文件,没有利用价值;值得关注的是几个无后缀名文件。
pyinstxtractor.py功能存在瑕疵,未能得到正确格式,需要手动进行修复。显然直接修改dk文件名(手动加上后缀.pyc)并不能直接用于反编,因为后缀名并不是本质错误;至关重要的是确定该dk文件头的幻数(magic number),而不同版本的Python拥有不同的幻数。
因为是从二进制层面操作.pyc文件,有必要对其文件结构进行剖析。
struct文件包含了正确的.pyc文件重要的头信息,dk文件则是缺乏头信息的部分.pyc文件,只需将struct的第1行添加到dk文件即可。
如此即可取得正确的dk_true.pyc文件。
使用uncompyle6即可。使用cmd命令uncompyle6 -o dk_true.py dk_true.pyc
即得源代码文件。
附源代码:
import tkinter as tk, tkinter.messagebox, pickle, requests, re, json
session = requests.Session()
def gui():
window = tk.Tk()
window.title('便捷化打卡系统')
screenWidth = window.winfo_screenwidth()
screenHeight = window.winfo_screenheight()
width = 300
height = 200
left = (screenWidth - width) / 2
top = (screenHeight - height) / 2
window.geometry('%dx%d+%d+%d' % (width, height, left, top))
tk.Label(window, text='学号', font=('Arial', 14)).place(x=10, y=10)
tk.Label(window, text='密码', font=('Arial', 14)).place(x=10, y=80)
var_usr_name = tk.StringVar()
entry_usr_name = tk.Entry(window, textvariable=var_usr_name, font=('Arial', 14))
entry_usr_name.place(x=60, y=10)
var_usr_pwd = tk.StringVar()
entry_usr_pwd = tk.Entry(window, textvariable=var_usr_pwd, font=('Arial', 14), show='*')
entry_usr_pwd.place(x=60, y=80)
def usr_login():
global password
global username
username = var_usr_name.get()
password = var_usr_pwd.get()
r = check()
if r['m'] == '操作成功':
json_data = post()
if '今天已经填报了' in json_data['m']:
tkinter.messagebox.showinfo(title='打卡系统', message='已经填报过了噢!')
elif '操作成功' in json_data['m']:
tkinter.messagebox.showerror(title='打卡系统', message='今日填报成功!')
else:
tkinter.messagebox.showerror(title='打卡系统', message='账户/密码有误')
btn_login = tk.Button(window, text='打卡', command=usr_login)
btn_login.place(x=110, y=150)
window.mainloop()
def check():
url = 'https://wfw.scu.edu.cn/a_scu/api/sso/check'
data = {'username':username,
'password':password,
'redirect':'https://wfw.scu.edu.cn/ncov/wap/default/index'}
header = {'Referer':'https://wfw.scu.edu.cn/site/polymerization/polymerizationLogin?redirect=https%3A%2F%2Fwfw.scu.edu.cn%2Fncov%2Fwap%2Fdefault%2Findex&from=wap',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.4034.400',
'Host':'wfw.scu.edu.cn',
'Origin':'https://wfw.scu.edu.cn'}
r = session.post(url, data=data, headers=header, timeout=3).json()
return r
def data_get() -> dict:
url_for_id = 'https://wfw.scu.edu.cn/ncov/wap/default/index'
header = {'Referer':'https://wfw.scu.edu.cn/site/polymerization/polymerizationLogin?redirect=https%3A%2F%2Fwfw.scu.edu.cn%2Fncov%2Fwap%2Fdefault%2Findex&from=wap',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.4034.400',
'Host':'wfw.scu.edu.cn',
'Origin':'https://wfw.scu.edu.cn'}
r2 = session.get(url_for_id, headers=header).text
x = re.findall('.*?oldInfo: (.*),.*?', r2)
data = eval(x[0])
return data
def post() -> json:
headers = {'Host':'wfw.scu.edu.cn',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.4034.400',
'Accept':'application/json,text/javascript,*/*;q=0.01',
'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding':'gzip,deflate,br',
'Content-Type':'application/x-www-form-urlencoded;',
'X-Requested-With':'XMLHttpRequest',
'Content-Length':'2082',
'Origin':'https://wfw.scu.edu.cn',
'Connection':'keep-alive',
'Referer':'https://wfw.scu.edu.cn/ncov/wap/default/index'}
data = data_get()
r1 = session.post('https://wfw.scu.edu.cn/ncov/wap/default/save', headers=headers, data=data)
return r1.json()
if __name__ == '__main__':
gui()
.pyc文件是Python编译过程中产生的中间过程文件。.pyc是二进制的,可以直接被Python虚拟机执行。显然.pyc文件对于实现Python的编译和反编译都尤为重要。
Python代码的编译结果就是PyCodeObject对象。PyCodeObject对象可以由虚拟机加载后直接运行,而.pyc文件就是PyCodeObject对象在硬盘上保存的形式。
.pyc即是PyCodeObject和头部信息的组合,包含了:
.pyc这一格式最值得注意的就是每个版本的Python生成的.pyc文件拥有不同的幻数。以Python 2.7为例,前两个字节以小端存储形式写入,后加上“rn”形成四个字节的.pyc文件幻数,可以表示为:MAGIC_NUMBER = (62211).to_bytes(2, 'little') + b'rn'
。
Python 2.7生成的.pyc文件前32个字节(前4个字节为03f3 0d0a
):
00000000: 03f3 0d0a b9c7 895e 6300 0000 0000 0000
00000010: 0003 0000 0040 0000 0073 1f00 0000 6400
这一部分的信息在不同Python版本之间差异较大。在Python 2系列中,这一部分只有4个字节包含信息,为源代码的修改时间(Unix Timestamp),精确到秒,以小端存储形式写入。如:(1586087865).to_bytes(4, 'little').hex()-> b9c7 895e
。
后续版本如Python 3.5和Python 3.6,在时间后又增加了4个有效字节用于表示源代码文件大小,单位为字节,以小端存储形式写入。如源码文件大小为87字节,则文件信息部分需要写入5700 0000
,与前面的修改时间一同存储,即为b9c7 895e 5700 0000
。Python 3.6生成的.pyc文件的前32个字节:
00000000: 330d 0d0a b9c7 895e 5700 0000 e300 0000
00000010: 0000 0000 0000 0000 0003 0000 0040 0000
从Python 3.7开始,支持hash-based .pyc文件。即Python不仅支持校验时间戳(timestamp)来判断文件是否被修改,也支持校验hash值。Python为了支持hash校验需要使源代码信息部分增加4个有效字节,故现版本源代码信息部分总共需要使用12个字节。但hash校验机制默认是不启用的(可以通过调用py_compile模块的compile函数传入参数invalidation_mode=PycInvalidationMode.CHECKED_HASH
启用)。不启用时前4个字节为0000 0000
,后8个字节为与先前版本(Python 3.6等)一样的源码文件修改时间和大小;启用时前四个字节为0100 0000
或者0300 0000
,后8个字节为源码文件的hash值。
PyCodeObject实际上是定义在Python源码Include/code.h中的结构体,结构体中的数据通过Python的marshal模块序列化后存储到.pyc文件中。不同版本的PyCodeObject内容并不一样,这导致了不同版本间的Python产生的.pyc文件不能完全通用。
mashal模块中实现了一些基本的Python对象(PyObject)的序列化,一个PyObject序列化时首先会写入一个字节表示这是一个什么类型的PyObject,不同类型的PyObject对应的类型如下(PyCodeObject对应的就是TYPE_CODE,写入第1个字节就是63):
// Python/marshal.c
// ......
#define TYPE_NULL '0'
#define TYPE_NONE 'N'
#define TYPE_FALSE 'F'
#define TYPE_TRUE 'T'
#define TYPE_STOPITER 'S'
#define TYPE_ELLIPSIS '.'
#define TYPE_INT 'i'
/* TYPE_INT64 is not generated anymore.
Supported for backward compatibility only. */
#define TYPE_INT64 'I'
#define TYPE_FLOAT 'f'
#define TYPE_BINARY_FLOAT 'g'
#define TYPE_COMPLEX 'x'
#define TYPE_BINARY_COMPLEX 'y'
#define TYPE_LONG 'l'
#define TYPE_STRING 's'
#define TYPE_INTERNED 't'
#define TYPE_REF 'r'
#define TYPE_TUPLE '('
#define TYPE_LIST '['
#define TYPE_DICT '{'
#define TYPE_CODE 'c'
#define TYPE_UNICODE 'u'
#define TYPE_UNKNOWN '?'
#define TYPE_SET '<'
#define TYPE_FROZENSET '>'
#define FLAG_REF 'x80' /* with a type, add obj to index */
// 以下都是Python3.5之后支持的
#define TYPE_ASCII 'a'
#define TYPE_ASCII_INTERNED 'A'
#define TYPE_SMALL_TUPLE ')'
#define TYPE_SHORT_ASCII 'z'
#define TYPE_SHORT_ASCII_INTERNED 'Z'
// ......
Python 3.7生成的.pyc文件前32个字节为:
00000000: 420d 0d0a 0000 0000 b9c7 895e 5700 0000
00000010: e300 0000 0000 0000 0000 0000 0003 0000
可知第17个字节(PyCodeObject的第1个字节)是0xe3
,这是因为PyObject对象的第1个字节还可以包含一个flag(#define FLAG_REF 'x80'
),即第1个字节为0x63|0x80 -> 0xe3
。FLAG_REF表示将这个对象加入引用列表,当下次出现这个对象时就不需要再次进行序列化,直接使用TYPE_REF取这个对象即可,这可以视作Python序列化的一种优化。
一般情况下PyCodeObject对象具有如下的属性和数据类型:
/* Bytecode object */
typedef struct {
PyObject_HEAD
int co_argcount; /* #arguments, except *args */
int co_posonlyargcount; /* #positional only arguments */
int co_kwonlyargcount; /* #keyword only arguments */
int co_nlocals; /* #local variables */
int co_stacksize; /* #entries needed for evaluation stack */
int co_flags; /* CO_..., see below */
int co_firstlineno; /* first source line number */
PyObject *co_code; /* instruction opcodes */
PyObject *co_consts; /* list (constants used) */
PyObject *co_names; /* list of strings (names used) */
PyObject *co_varnames; /* tuple of strings (local variable names) */
PyObject *co_freevars; /* tuple of strings (free variable names) */
PyObject *co_cellvars; /* tuple of strings (cell variable names) */
/* The rest aren't used in either hash or comparisons, except for co_name,
used in both. This is done to preserve the name and line number
for tracebacks and debuggers; otherwise, constant de-duplication
would collapse identical functions/lambdas defined on different lines.
*/
Py_ssize_t *co_cell2arg; /* Maps cell vars which are arguments. */
PyObject *co_filename; /* unicode (where it was loaded from) */
PyObject *co_name; /* unicode (name, for reference) */
PyObject *co_lnotab; /* string (encoding addr<->lineno mapping) See
Objects/lnotab_notes.txt for details. */
// ......
}PyCodeObject;
每个属性在虚拟机执行.pyc文件时都有其作用,但并非要求全部写入.pyc文件。marshal序列化PyCodeObject的实现部分:
// ......
else if (PyCode_Check(v)) {
PyCodeObject *co = (PyCodeObject *)v;
W_TYPE(TYPE_CODE, p);
w_long(co->co_argcount, p);
w_long(co->co_kwonlyargcount, p);
w_long(co->co_nlocals, p);
w_long(co->co_stacksize, p);
w_long(co->co_flags, p);
w_object(co->co_code, p);
w_object(co->co_consts, p);
w_object(co->co_names, p);
w_object(co->co_varnames, p);
w_object(co->co_freevars, p);
w_object(co->co_cellvars, p);
w_object(co->co_filename, p);
w_object(co->co_name, p);
w_long(co->co_firstlineno, p);
w_object(co->co_lnotab, p);
}
// ......
Python使用marshal.dump
的方法将PyCodeObject对象转化为对应的二进制文件结构。每个字段在二进制文件中的结构如下所示:
TYPE_CODE
byte
表示这是一个PyCodeObject
co_argcount
long
对应PyCodeObject结构体里的各个域
co_nlocals
long
co_stacksize
long
co_flags
long
TYPE_STRING
byte
字符串的表示方法,对应PyCodeObject的co_code
co_code size
long
co_code value
bytes
TYPE_LIST
byte
这是一个列表
co_consts size
long
列表co_consts的元素个数
TYPE_INT
byte
co_consts[0]是一个整型
co_consts[0]
long
TYPE_STRING
byte
co_consts[1]是一个字符串
co_consts[1] size
long
co_consts[1] value
bytes
TYPE_CODE
byte
co_consts[2]又是一个PyCodeObject对象,它对应的代码可能是一个函数或类
co_consts[2]
…
其中,byte表示仅占用1个字节,long表示占用4个字节,bytes表示该字段占用1个或者多个字节。值得注意的是,PyCodeObject对象中每个属性及其值都会按照一定的顺序表示在二进制文件中。
Python的opcode决定了程序的执行流程,这被作为TYPE_STRING类型的PyObject存到了PyCodeObject的co_code中。
Python 3.7的opcode序列:
00000000: 420d 0d0a 0000 0000 b9c7 895e 5700 0000
00000010: e300 0000 0000 0000 0000 0000 0003 0000
00000020: 0040 0000 0073 1e00 0000 6500 6400 8301
00000030: 0100 6401 6402 8400 5a01 6501 6403 6404
00000040: 8302 0100 6405 5300 2906 7a0c 4865 6c6c
00000050: 6f2c 2077 6f72 6c64 6302 0000 0000 0000
offset 0x2a-0x47
即为序列化后的opcode序列(6500 6400
直到6405 5300
)。第25个字节0x73
表示TYPE_STRING,第26-29个字节表示对象的长度,1e00 0000
就是小端存储形式的30。
Python的源码Include/opcode.h中定义了一系列的opcode。其中,以HAVE_ARGUMENT为界限,凡是大于HAVE_ARGUMENT的opcode都是有且仅有1个参数的,凡是小于HAVE_ARGUMENT的opcode都是没有参数的。
CPython implementation detail: Bytecode is an implementation detail of the CPython interpreter. No guarantees are made that bytecode will not be added, removed, or changed between versions of Python. Use of this module should not be considered to work across Python VMs or Python releases.
Changed in version 3.6: Use 2 bytes for each instruction. Previously the number of bytes varied by instruction.
Python不保证不同版本之间的opcode兼容性,这也是Python各个版本之间.pyc不兼容的一个原因。
从Python 3.6开始,有一个较大的改变,就是无论opcode有无参数,每一条指令的长度都是2个字节,opcode占用1个字节,若这个opcode是有参数的,则另外1个字节表示参数;如果opcode没有参数,则另外1个字节就会被忽略,一般为0x00
。实际上opcode的参数仅有1个:offset。
Python3.6 以前,对于有参数的opcode,指令长度为3个字节,包含opcode、argv_low、argv_high,opcode占用1个字节,参数占用2个字节,也采用小端存储。如Python 2.7中的指令6401 00
表示opcode为LOAD_CONST,参数为1。
LOAD_CONST(consti)Pushes co_consts[consti] onto the stack.
即从co_consts这个tuple对象取出第1个对象(从0开始计算,第1个元素即为co_consts[1]),压到栈顶。
可以使用Python自带的dis和marshal库帮助查看opcode序列,下面以2个经典版本(Python 2.7和Python 3.7)为例。
现设定源码:
print('Hello, world')
def fff(a,b):
c = a + b
return c & 0xffff
fff(34,67)
>>> import dis, marshal
>>> f=open('t.pyc', 'rb').read()
>>> co=marshal.loads(f[8:]) # Python2.7中,PyCodeObject在.pyc文件中的偏移为8
>>> dis.dis(co)
1 0 LOAD_CONST 0 ('Hello, world')
3 PRINT_ITEM
4 PRINT_NEWLINE
3 5 LOAD_CONST 1 (<code object fff at 0x10a1c9630, file "t.py", line 3>)
8 MAKE_FUNCTION 0
11 STORE_NAME 0 (fff)
7 14 LOAD_NAME 0 (fff)
17 LOAD_CONST 2 (34)
20 LOAD_CONST 3 (67)
23 CALL_FUNCTION 2
26 POP_TOP
27 LOAD_CONST 4 (None)
30 RETURN_VALUE
>>> co.co_names
('fff',)
>>> co.co_consts
('Hello, world', <code object fff at ..., file ".../t.py", line 3>, 34, 67, None)
16进制指令
行号
指令偏移与指令名称
参数
65 00 00
1
0 LOAD_CONST
0('Hello, world')
64 02 00
3 PRINT_ITEM
48
4 PRINT_NEWLINE
64 01 00
3
5 LOAD_CONST
1()
84 00 00
8 MAKE_FUNCTION
0
5a 00 00
11 STORE_NAME
0(fff)
65 00 00
7
14 LOAD_NAME
0(fff)
64 02 00
17 LOAD_CONST
2(34)
64 03 00
20 LOAD_CONST
3(67)
83 02 00
23 CALL_FUNCTION
2
01
26 POP_TOP
64 04 00
27 LOAD_CONST
4(None)
53
30 RETURN_VALUE
>>> import dis, marshal
>>> f=open('t.pyc', 'rb').read()
>>> co=marshal.loads(f[16:]) # Python3.7中,PyCodeObject在pyc文件中的偏移为16
>>> dis.dis(co)
1 0 LOAD_NAME 0 (print)
2 LOAD_CONST 0 ('Hello, world')
4 CALL_FUNCTION 1
6 POP_TOP
3 8 LOAD_CONST 1 (<code object fff at ..., line 3>)
10 LOAD_CONST 2 ('fff')
12 MAKE_FUNCTION 0
14 STORE_NAME 1 (fff)
7 16 LOAD_NAME 1 (fff)
18 LOAD_CONST 3 (34)
20 LOAD_CONST 4 (67)
22 CALL_FUNCTION 2
24 POP_TOP
26 LOAD_CONST 5 (None)
28 RETURN_VALUE
>>> co.co_names
('print', 'fff')
>>> co.co_name
'<module>'
>>> co.co_consts
('Hello, world', <code object fff at ..., file".../t.py", line 3>,'fff', 34, 67,None)
16进制指令
行号
指令偏移与指令名称
参数
65 00
1
0 LOAD_NAME
0(print)
64 00
2 LOAD_CONST
0('Hello, world')
83 01
4 CALL_FUNCTION
1
01 00
6 POP_TOP
64 01
3
8 LOAD_CONST
1()
64 02
10 LOAD_CONST
2('fff')
84 00
12 MAKE_FUNCTION
0
5a 01
14 STORE_NAME
1(fff)
65 01
7
16 LOAD_NAME
1(fff)
64 03
18 LOAD_CONST
3(34)
64 04
20 LOAD_CONST
4(67)
83 02
22 CALL_FUNCTION
2
01 00
24 POP_TOP
64 05
26 LOAD_CONST
5(None)
53 00
28 RETURN_VALUE
.pyc文件处理的重要难点在于版本的差异和结构、逻辑关系,本次处理的.exe文件是个没有任何保护的裸程序,也没有涉及去混淆操作,故很容易得到结果;当遇到.pyc混淆处理等问题时,则需要细致的分析,得到结果的难度显著增大,甚至不能得出结果。
能够得出源码确实值得庆幸,但更重要的是加深对.pyc文件结构、作用的了解。
手机扫一扫
移动阅读更方便
你可能感兴趣的文章