dk.exe自动填报程序的反编译
阅读原文时间:2023年07月12日阅读:4

dk.exe自动填报程序的反编译

dk.exe用于学校每日健康报的自动填写。

显而易见这是一个对Python进行打包生成的简易应用程序。最初取得时仅有.exe文件。现基于该.exe文件尝试取得源码。

现有工具有pyinstxtractor.py、archive_viewer.py和uncompyle6等。前两个单文件脚本用于直接处理.exe文件,拆解出库文件、.pyc或类似.pyc的文件;uncompyle6一键反编,理论上支持所有Python版本。

pyinstxtractor.py实现代码:

#!/usr/bin/python

"""
PyInstaller Extractor v1.8 (Supports pyinstaller 3.2, 3.1, 3.0, 2.1, 2.0)
Author : Extreme Coders
E-mail : extremecoders(at)hotmail(dot)com
Web    : https://0xec.blogspot.com
Date   : 28-April-2017
Url    : https://sourceforge.net/projects/pyinstallerextractor/
For any suggestions, leave a comment on
https://forum.tuts4you.com/topic/34455-pyinstaller-extractor/
This script extracts a pyinstaller generated executable file.
Pyinstaller installation is not needed. The script has it all.
For best results, it is recommended to run this script in the
same version of python as was used to create the executable.
This is just to prevent unmarshalling errors(if any) while
extracting the PYZ archive.
Usage : Just copy this script to the directory where your exe resides
        and run the script with the exe file name as a parameter
C:\path\to\exe\>python pyinstxtractor.py <filename>
$ /path/to/exe/python pyinstxtractor.py <filename>
Licensed under GNU General Public License (GPL) v3.
You are free to modify this source.
CHANGELOG
================================================
Version 1.1 (Jan 28, 2014)
-------------------------------------------------
- First Release
- Supports only pyinstaller 2.0
Version 1.2 (Sept 12, 2015)
-------------------------------------------------
- Added support for pyinstaller 2.1 and 3.0 dev
- Cleaned up code
- Script is now more verbose
- Executable extracted within a dedicated sub-directory
(Support for pyinstaller 3.0 dev is experimental)
Version 1.3 (Dec 12, 2015)
-------------------------------------------------
- Added support for pyinstaller 3.0 final
- Script is compatible with both python 2.x & 3.x (Thanks to Moritz Kroll @ Avira Operations GmbH & Co. KG)
Version 1.4 (Jan 19, 2016)
-------------------------------------------------
- Fixed a bug when writing pyc files >= version 3.3 (Thanks to Daniello Alto: https://github.com/Djamana)
Version 1.5 (March 1, 2016)
-------------------------------------------------
- Added support for pyinstaller 3.1 (Thanks to Berwyn Hoyt for reporting)
Version 1.6 (Sept 5, 2016)
-------------------------------------------------
- Added support for pyinstaller 3.2
- Extractor will use a random name while extracting unnamed files.
- For encrypted pyz archives it will dump the contents as is. Previously, the tool would fail.
Version 1.7 (March 13, 2017)
-------------------------------------------------
- Made the script compatible with python 2.6 (Thanks to Ross for reporting)
Version 1.8 (April 28, 2017)
-------------------------------------------------
- Support for sub-directories in .pyz files (Thanks to Moritz Kroll @ Avira Operations GmbH & Co. KG)
"""

"""
Author: In Ming Loh
Email: inming.loh@countercept.com
Changes have been made to Version 1.8 (April 28, 2017).
CHANGELOG
================================================
- Function extractFiles(self, custom_dir=None) has been modfied to allow custom output directory.
"""

import os
import struct
import marshal
import zlib
import sys
import imp
import types
from uuid import uuid4 as uniquename

class CTOCEntry:
    def __init__(self, position, cmprsdDataSize, uncmprsdDataSize, cmprsFlag, typeCmprsData, name):
        self.position = position
        self.cmprsdDataSize = cmprsdDataSize
        self.uncmprsdDataSize = uncmprsdDataSize
        self.cmprsFlag = cmprsFlag
        self.typeCmprsData = typeCmprsData
        self.name = name

class PyInstArchive:
    PYINST20_COOKIE_SIZE = 24           # For pyinstaller 2.0
    PYINST21_COOKIE_SIZE = 24 + 64      # For pyinstaller 2.1+
    MAGIC = b'MEI\014\013\012\013\016'  # Magic number which identifies pyinstaller

    def __init__(self, path):
        self.filePath = path

    def open(self):
        try:
            self.fPtr = open(self.filePath, 'rb')
            self.fileSize = os.stat(self.filePath).st_size
        except:
            print('[*] Error: Could not open {0}'.format(self.filePath))
            return False
        return True

    def close(self):
        try:
            self.fPtr.close()
        except:
            pass

    def checkFile(self):
        print('[*] Processing {0}'.format(self.filePath))
        # Check if it is a 2.0 archive
        self.fPtr.seek(self.fileSize - self.PYINST20_COOKIE_SIZE, os.SEEK_SET)
        magicFromFile = self.fPtr.read(len(self.MAGIC))

        if magicFromFile == self.MAGIC:
            self.pyinstVer = 20     # pyinstaller 2.0
            print('[*] Pyinstaller version: 2.0')
            return True

        # Check for pyinstaller 2.1+ before bailing out
        self.fPtr.seek(self.fileSize - self.PYINST21_COOKIE_SIZE, os.SEEK_SET)
        magicFromFile = self.fPtr.read(len(self.MAGIC))

        if magicFromFile == self.MAGIC:
            print('[*] Pyinstaller version: 2.1+')
            self.pyinstVer = 21     # pyinstaller 2.1+
            return True

        print('[*] Error : Unsupported pyinstaller version or not a pyinstaller archive')
        return False

    def getCArchiveInfo(self):
        try:
            if self.pyinstVer == 20:
                self.fPtr.seek(self.fileSize - self.PYINST20_COOKIE_SIZE, os.SEEK_SET)

                # Read CArchive cookie
                (magic, lengthofPackage, toc, tocLen, self.pyver) = \
                struct.unpack('!8siiii', self.fPtr.read(self.PYINST20_COOKIE_SIZE))

            elif self.pyinstVer == 21:
                self.fPtr.seek(self.fileSize - self.PYINST21_COOKIE_SIZE, os.SEEK_SET)

                # Read CArchive cookie
                (magic, lengthofPackage, toc, tocLen, self.pyver, pylibname) = \
                struct.unpack('!8siiii64s', self.fPtr.read(self.PYINST21_COOKIE_SIZE))

        except:
            print('[*] Error : The file is not a pyinstaller archive')
            return False

        print('[*] Python version: {0}'.format(self.pyver))

        # Overlay is the data appended at the end of the PE
        self.overlaySize = lengthofPackage
        self.overlayPos = self.fileSize - self.overlaySize
        self.tableOfContentsPos = self.overlayPos + toc
        self.tableOfContentsSize = tocLen

        print('[*] Length of package: {0} bytes'.format(self.overlaySize))
        return True

    def parseTOC(self):
        # Go to the table of contents
        self.fPtr.seek(self.tableOfContentsPos, os.SEEK_SET)

        self.tocList = []
        parsedLen = 0

        # Parse table of contents
        while parsedLen < self.tableOfContentsSize:
            (entrySize, ) = struct.unpack('!i', self.fPtr.read(4))
            nameLen = struct.calcsize('!iiiiBc')

            (entryPos, cmprsdDataSize, uncmprsdDataSize, cmprsFlag, typeCmprsData, name) = \
            struct.unpack( \
                '!iiiBc{0}s'.format(entrySize - nameLen), \
                self.fPtr.read(entrySize - 4))

            name = name.decode('utf-8').rstrip('\0')
            if len(name) == 0:
                name = str(uniquename())
                print('[!] Warning: Found an unamed file in CArchive. Using random name {0}'.format(name))

            self.tocList.append( \
                                CTOCEntry(                      \
                                    self.overlayPos + entryPos, \
                                    cmprsdDataSize,             \
                                    uncmprsdDataSize,           \
                                    cmprsFlag,                  \
                                    typeCmprsData,              \
                                    name                        \
                                ))

            parsedLen += entrySize
        print('[*] Found {0} files in CArchive'.format(len(self.tocList)))

    def extractFiles(self, custom_dir=None):
        print('[*] Beginning extraction...please standby')
        if custom_dir is None:
            extractionDir = os.path.join(os.getcwd(), os.path.basename(self.filePath) + '_extracted')

            if not os.path.exists(extractionDir):
                os.mkdir(extractionDir)

            os.chdir(extractionDir)
        else:
            if not os.path.exists(custom_dir):
                os.makedirs(custom_dir)
            os.chdir(custom_dir)

        for entry in self.tocList:
            basePath = os.path.dirname(entry.name)
            if basePath != '':
                # Check if path exists, create if not
                if not os.path.exists(basePath):
                    os.makedirs(basePath)

            self.fPtr.seek(entry.position, os.SEEK_SET)
            data = self.fPtr.read(entry.cmprsdDataSize)

            if entry.cmprsFlag == 1:
                data = zlib.decompress(data)
                # Malware may tamper with the uncompressed size
                # Comment out the assertion in such a case
                assert len(data) == entry.uncmprsdDataSize # Sanity Check

            with open(entry.name, 'wb') as f:
                f.write(data)

            if entry.typeCmprsData == b'z':
                self._extractPyz(entry.name)

    def _extractPyz(self, name):
        dirName =  name + '_extracted'
        # Create a directory for the contents of the pyz
        if not os.path.exists(dirName):
            os.mkdir(dirName)

        with open(name, 'rb') as f:
            pyzMagic = f.read(4)
            assert pyzMagic == b'PYZ\0' # Sanity Check

            pycHeader = f.read(4) # Python magic value

            if imp.get_magic() != pycHeader:
                print('[!] Warning: The script is running in a different python version than the one used to build the executable')
                print('    Run this script in Python{0} to prevent extraction errors(if any) during unmarshalling'.format(self.pyver))

            (tocPosition, ) = struct.unpack('!i', f.read(4))
            f.seek(tocPosition, os.SEEK_SET)

            try:
                toc = marshal.load(f)
            except:
                print('[!] Unmarshalling FAILED. Cannot extract {0}. Extracting remaining files.'.format(name))
                return

            print('[*] Found {0} files in PYZ archive'.format(len(toc)))

            # From pyinstaller 3.1+ toc is a list of tuples
            if type(toc) == list:
                toc = dict(toc)

            for key in toc.keys():
                (ispkg, pos, length) = toc[key]
                f.seek(pos, os.SEEK_SET)

                fileName = key
                try:
                    # for Python > 3.3 some keys are bytes object some are str object
                    fileName = key.decode('utf-8')
                except:
                    pass

                # Make sure destination directory exists, ensuring we keep inside dirName
                destName = os.path.join(dirName, fileName.replace("..", "__"))
                destDirName = os.path.dirname(destName)
                if not os.path.exists(destDirName):
                    os.makedirs(destDirName)

                try:
                    data = f.read(length)
                    data = zlib.decompress(data)
                except:
                    print('[!] Error: Failed to decompress {0}, probably encrypted. Extracting as is.'.format(fileName))
                    open(destName + '.pyc.encrypted', 'wb').write(data)
                    continue

                with open(destName + '.pyc', 'wb') as pycFile:
                    pycFile.write(pycHeader)      # Write pyc magic
                    pycFile.write(b'\0' * 4)      # Write timestamp
                    if self.pyver >= 33:
                        pycFile.write(b'\0' * 4)  # Size parameter added in Python 3.3
                    pycFile.write(data)

def main():
    if len(sys.argv) < 2:
        print('[*] Usage: pyinstxtractor.py <filename>')

    else:
        arch = PyInstArchive(sys.argv[1])
        if arch.open():
            if arch.checkFile():
                if arch.getCArchiveInfo():
                    arch.parseTOC()
                    arch.extractFiles()
                    arch.close()
                    print('[*] Successfully extracted pyinstaller archive: {0}'.format(sys.argv[1]))
                    print('')
                    print('You can now use a python decompiler on the pyc files within the extracted directory')
                    return

            arch.close()

if __name__ == '__main__':
    main()

​ archive_viewer.py实现代码:

#-----------------------------------------------------------------------------
# Copyright (c) 2013-2021, PyInstaller Development Team.
#
# Distributed under the terms of the GNU General Public License (version 2
# or later) with exception for distributing the bootloader.
#
# The full license is in the file COPYING.txt, distributed with this software.
#
# SPDX-License-Identifier: (GPL-2.0-or-later WITH Bootloader-exception)
#-----------------------------------------------------------------------------

"""
Viewer for archives packaged by archive.py
"""

import argparse
import os
import pprint
import sys
import tempfile
import zlib

from PyInstaller.loader import pyimod02_archive
from PyInstaller.archive.readers import CArchiveReader, NotAnArchiveError
from PyInstaller.compat import stdin_input
import PyInstaller.log

stack = []
cleanup = []

def main(name, brief, debug, rec_debug, **unused_options):

    global stack

    if not os.path.isfile(name):
        print(name, "is an invalid file name!", file=sys.stderr)
        return 1

    arch = get_archive(name)
    stack.append((name, arch))
    if debug or brief:
        show_log(arch, rec_debug, brief)
        raise SystemExit(0)
    else:
        show(name, arch)

    while 1:
        try:
            toks = stdin_input('? ').split(None, 1)
        except EOFError:
            # Ctrl-D
            print(file=sys.stderr)  # Clear line.
            break
        if not toks:
            usage()
            continue
        if len(toks) == 1:
            cmd = toks[0]
            arg = ''
        else:
            cmd, arg = toks
        cmd = cmd.upper()
        if cmd == 'U':
            if len(stack) > 1:
                arch = stack[-1][1]
                arch.lib.close()
                del stack[-1]
            name, arch = stack[-1]
            show(name, arch)
        elif cmd == 'O':
            if not arg:
                arg = stdin_input('open name? ')
            arg = arg.strip()
            try:
                arch = get_archive(arg)
            except NotAnArchiveError as e:
                print(e, file=sys.stderr)
                continue
            if arch is None:
                print(arg, "not found", file=sys.stderr)
                continue
            stack.append((arg, arch))
            show(arg, arch)
        elif cmd == 'X':
            if not arg:
                arg = stdin_input('extract name? ')
            arg = arg.strip()
            data = get_data(arg, arch)
            if data is None:
                print("Not found", file=sys.stderr)
                continue
            filename = stdin_input('to filename? ')
            if not filename:
                print(repr(data))
            else:
                with open(filename, 'wb') as fp:
                    fp.write(data)
        elif cmd == 'Q':
            break
        else:
            usage()
    do_cleanup()

def do_cleanup():
    global stack, cleanup
    for (name, arch) in stack:
        arch.lib.close()
    stack = []
    for filename in cleanup:
        try:
            os.remove(filename)
        except Exception as e:
            print("couldn't delete", filename, e.args, file=sys.stderr)
    cleanup = []

def usage():
    print("U: go Up one level", file=sys.stderr)
    print("O <name>: open embedded archive name", file=sys.stderr)
    print("X <name>: extract name", file=sys.stderr)
    print("Q: quit", file=sys.stderr)

def get_archive(name):
    if not stack:
        if name[-4:].lower() == '.pyz':
            return ZlibArchive(name)
        return CArchiveReader(name)
    parent = stack[-1][1]
    try:
        return parent.openEmbedded(name)
    except KeyError:
        return None
    except (ValueError, RuntimeError):
        ndx = parent.toc.find(name)
        dpos, dlen, ulen, flag, typcd, name = parent.toc[ndx]
        x, data = parent.extract(ndx)
        tempfilename = tempfile.mktemp()
        cleanup.append(tempfilename)
        with open(tempfilename, 'wb') as fp:
            fp.write(data)
        if typcd == 'z':
            return ZlibArchive(tempfilename)
        else:
            return CArchiveReader(tempfilename)

def get_data(name, arch):
    if isinstance(arch.toc, dict):
        (ispkg, pos, length) = arch.toc.get(name, (0, None, 0))
        if pos is None:
            return None
        with arch.lib:
            arch.lib.seek(arch.start + pos)
            return zlib.decompress(arch.lib.read(length))
    ndx = arch.toc.find(name)
    dpos, dlen, ulen, flag, typcd, name = arch.toc[ndx]
    x, data = arch.extract(ndx)
    return data

def show(name, arch):
    if isinstance(arch.toc, dict):
        print(" Name: (ispkg, pos, len)")
        toc = arch.toc
    else:
        print(" pos, length, uncompressed, iscompressed, type, name")
        toc = arch.toc.data
    pprint.pprint(toc)

def get_content(arch, recursive, brief, output):
    if isinstance(arch.toc, dict):
        toc = arch.toc
        if brief:
            for name, _ in toc.items():
                output.append(name)
        else:
            output.append(toc)
    else:
        toc = arch.toc.data
        for el in toc:
            if brief:
                output.append(el[5])
            else:
                output.append(el)
            if recursive:
                if el[4] in ('z', 'a'):
                    get_content(get_archive(el[5]), recursive, brief, output)
                    stack.pop()

def show_log(arch, recursive, brief):
    output = []
    get_content(arch, recursive, brief, output)
    # first print all TOCs
    for out in output:
        if isinstance(out, dict):
            pprint.pprint(out)
    # then print the other entries
    pprint.pprint([out for out in output if not isinstance(out, dict)])

def get_archive_content(filename):
    """
    Get a list of the (recursive) content of archive `filename`.
    This function is primary meant to be used by runtests.
    """
    archive = get_archive(filename)
    stack.append((filename, archive))
    output = []
    get_content(archive, recursive=True, brief=True, output=output)
    do_cleanup()
    return output

class ZlibArchive(pyimod02_archive.ZlibArchiveReader):

    def checkmagic(self):
        """ Overridable.
            Check to see if the file object self.lib actually has a file
            we understand.
        """
        self.lib.seek(self.start)  # default - magic is at start of file.
        if self.lib.read(len(self.MAGIC)) != self.MAGIC:
            raise RuntimeError("%s is not a valid %s archive file"
                               % (self.path, self.__class__.__name__))
        if self.lib.read(len(self.pymagic)) != self.pymagic:
            print("Warning: pyz is from a different Python version",
                  file=sys.stderr)
        self.lib.read(4)

def run():
    parser = argparse.ArgumentParser()
    parser.add_argument('-l', '--log',
                        default=False,
                        action='store_true',
                        dest='debug',
                        help='Print an archive log (default: %(default)s)')
    parser.add_argument('-r', '--recursive',
                        default=False,
                        action='store_true',
                        dest='rec_debug',
                        help='Recursively print an archive log (default: %(default)s). '
                        'Can be combined with -r')
    parser.add_argument('-b', '--brief',
                        default=False,
                        action='store_true',
                        dest='brief',
                        help='Print only file name. (default: %(default)s). '
                        'Can be combined with -r')
    PyInstaller.log.__add_options(parser)
    parser.add_argument('name', metavar='pyi_archive',
                        help="pyinstaller archive to show content of")

    args = parser.parse_args()
    PyInstaller.log.__process_options(parser, args)

    try:
        raise SystemExit(main(**vars(args)))
    except KeyboardInterrupt:
        raise SystemExit("Aborted by user request.")

if __name__ == '__main__':
    run()

当拥有Python源码时,直接使用Python自带的py_compile模块即可将源码.py文件编译成.pyc文件,这对保护源码有一定作用;但在上述工具帮助下,将.pyc或者.exe还原成.py的难度大大降低。

使用cmd命令python pyinstxtractor.py dk.exe获得初代解压文件。

执行完成后即得dk.exe_extracted文件夹。使用该工具的期望结果是得到可用的.pyc文件,但实际上文件夹中并不包含期望结果,有大量文件为库文件,没有利用价值;值得关注的是几个无后缀名文件。

pyinstxtractor.py功能存在瑕疵,未能得到正确格式,需要手动进行修复。显然直接修改dk文件名(手动加上后缀.pyc)并不能直接用于反编,因为后缀名并不是本质错误;至关重要的是确定该dk文件头的幻数(magic number),而不同版本的Python拥有不同的幻数。

因为是从二进制层面操作.pyc文件,有必要对其文件结构进行剖析

struct文件包含了正确的.pyc文件重要的头信息,dk文件则是缺乏头信息的部分.pyc文件,只需将struct的第1行添加到dk文件即可。

如此即可取得正确的dk_true.pyc文件。

使用uncompyle6即可。使用cmd命令uncompyle6 -o dk_true.py dk_true.pyc即得源代码文件。

附源代码:

import tkinter as tk, tkinter.messagebox, pickle, requests, re, json
session = requests.Session()

def gui():
    window = tk.Tk()
    window.title('便捷化打卡系统')
    screenWidth = window.winfo_screenwidth()
    screenHeight = window.winfo_screenheight()
    width = 300
    height = 200
    left = (screenWidth - width) / 2
    top = (screenHeight - height) / 2
    window.geometry('%dx%d+%d+%d' % (width, height, left, top))
    tk.Label(window, text='学号', font=('Arial', 14)).place(x=10, y=10)
    tk.Label(window, text='密码', font=('Arial', 14)).place(x=10, y=80)
    var_usr_name = tk.StringVar()
    entry_usr_name = tk.Entry(window, textvariable=var_usr_name, font=('Arial', 14))
    entry_usr_name.place(x=60, y=10)
    var_usr_pwd = tk.StringVar()
    entry_usr_pwd = tk.Entry(window, textvariable=var_usr_pwd, font=('Arial', 14), show='*')
    entry_usr_pwd.place(x=60, y=80)

    def usr_login():
        global password
        global username
        username = var_usr_name.get()
        password = var_usr_pwd.get()
        r = check()
        if r['m'] == '操作成功':
            json_data = post()
            if '今天已经填报了' in json_data['m']:
                tkinter.messagebox.showinfo(title='打卡系统', message='已经填报过了噢!')
            elif '操作成功' in json_data['m']:
                tkinter.messagebox.showerror(title='打卡系统', message='今日填报成功!')
        else:
            tkinter.messagebox.showerror(title='打卡系统', message='账户/密码有误')

    btn_login = tk.Button(window, text='打卡', command=usr_login)
    btn_login.place(x=110, y=150)
    window.mainloop()

def check():
    url = 'https://wfw.scu.edu.cn/a_scu/api/sso/check'
    data = {'username':username,
     'password':password,
     'redirect':'https://wfw.scu.edu.cn/ncov/wap/default/index'}
    header = {'Referer':'https://wfw.scu.edu.cn/site/polymerization/polymerizationLogin?redirect=https%3A%2F%2Fwfw.scu.edu.cn%2Fncov%2Fwap%2Fdefault%2Findex&from=wap',
     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.4034.400',
     'Host':'wfw.scu.edu.cn',
     'Origin':'https://wfw.scu.edu.cn'}
    r = session.post(url, data=data, headers=header, timeout=3).json()
    return r

def data_get() -> dict:
    url_for_id = 'https://wfw.scu.edu.cn/ncov/wap/default/index'
    header = {'Referer':'https://wfw.scu.edu.cn/site/polymerization/polymerizationLogin?redirect=https%3A%2F%2Fwfw.scu.edu.cn%2Fncov%2Fwap%2Fdefault%2Findex&from=wap',
     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.4034.400',
     'Host':'wfw.scu.edu.cn',
     'Origin':'https://wfw.scu.edu.cn'}
    r2 = session.get(url_for_id, headers=header).text
    x = re.findall('.*?oldInfo: (.*),.*?', r2)
    data = eval(x[0])
    return data

def post() -> json:
    headers = {'Host':'wfw.scu.edu.cn',
     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.4034.400',
     'Accept':'application/json,text/javascript,*/*;q=0.01',
     'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
     'Accept-Encoding':'gzip,deflate,br',
     'Content-Type':'application/x-www-form-urlencoded;',
     'X-Requested-With':'XMLHttpRequest',
     'Content-Length':'2082',
     'Origin':'https://wfw.scu.edu.cn',
     'Connection':'keep-alive',
     'Referer':'https://wfw.scu.edu.cn/ncov/wap/default/index'}
    data = data_get()
    r1 = session.post('https://wfw.scu.edu.cn/ncov/wap/default/save', headers=headers, data=data)
    return r1.json()

if __name__ == '__main__':
    gui()

.pyc文件是Python编译过程中产生的中间过程文件。.pyc是二进制的,可以直接被Python虚拟机执行。显然.pyc文件对于实现Python的编译和反编译都尤为重要。

Python代码的编译结果就是PyCodeObject对象。PyCodeObject对象可以由虚拟机加载后直接运行,而.pyc文件就是PyCodeObject对象在硬盘上保存的形式。

.pyc即是PyCodeObject和头部信息的组合,包含了:

  • 4个字节的幻数(magic number)
  • 12个字节的源代码文件信息(因版本而异)
  • 序列化后的PyCodeObject对象

幻数(magic number)

.pyc这一格式最值得注意的就是每个版本的Python生成的.pyc文件拥有不同的幻数。以Python 2.7为例,前两个字节以小端存储形式写入,后加上“rn”形成四个字节的.pyc文件幻数,可以表示为:MAGIC_NUMBER = (62211).to_bytes(2, 'little') + b'rn'

Python 2.7生成的.pyc文件前32个字节(前4个字节为03f3 0d0a):

00000000: 03f3 0d0a b9c7 895e 6300 0000 0000 0000
00000010: 0003 0000 0040 0000 0073 1f00 0000 6400

源代码文件信息

这一部分的信息在不同Python版本之间差异较大。在Python 2系列中,这一部分只有4个字节包含信息,为源代码的修改时间(Unix Timestamp),精确到秒,以小端存储形式写入。如:(1586087865).to_bytes(4, 'little').hex()-> b9c7 895e

后续版本如Python 3.5和Python 3.6,在时间后又增加了4个有效字节用于表示源代码文件大小,单位为字节,以小端存储形式写入。如源码文件大小为87字节,则文件信息部分需要写入5700 0000,与前面的修改时间一同存储,即为b9c7 895e 5700 0000。Python 3.6生成的.pyc文件的前32个字节:

00000000: 330d 0d0a b9c7 895e 5700 0000 e300 0000
00000010: 0000 0000 0000 0000 0003 0000 0040 0000

从Python 3.7开始,支持hash-based .pyc文件。即Python不仅支持校验时间戳(timestamp)来判断文件是否被修改,也支持校验hash值。Python为了支持hash校验需要使源代码信息部分增加4个有效字节,故现版本源代码信息部分总共需要使用12个字节。但hash校验机制默认是不启用的(可以通过调用py_compile模块的compile函数传入参数invalidation_mode=PycInvalidationMode.CHECKED_HASH启用)。不启用时前4个字节为0000 0000,后8个字节为与先前版本(Python 3.6等)一样的源码文件修改时间和大小;启用时前四个字节为0100 0000或者0300 0000,后8个字节为源码文件的hash值。

PyCodeObject对象

PyCodeObject实际上是定义在Python源码Include/code.h中的结构体,结构体中的数据通过Python的marshal模块序列化后存储到.pyc文件中。不同版本的PyCodeObject内容并不一样,这导致了不同版本间的Python产生的.pyc文件不能完全通用。

mashal模块中实现了一些基本的Python对象(PyObject)的序列化,一个PyObject序列化时首先会写入一个字节表示这是一个什么类型的PyObject,不同类型的PyObject对应的类型如下(PyCodeObject对应的就是TYPE_CODE,写入第1个字节就是63):

// Python/marshal.c
// ......
#define TYPE_NULL               '0'
#define TYPE_NONE               'N'
#define TYPE_FALSE              'F'
#define TYPE_TRUE               'T'
#define TYPE_STOPITER           'S'
#define TYPE_ELLIPSIS           '.'
#define TYPE_INT                'i'
/* TYPE_INT64 is not generated anymore.
   Supported for backward compatibility only. */
#define TYPE_INT64              'I'
#define TYPE_FLOAT              'f'
#define TYPE_BINARY_FLOAT       'g'
#define TYPE_COMPLEX            'x'
#define TYPE_BINARY_COMPLEX     'y'
#define TYPE_LONG               'l'
#define TYPE_STRING             's'
#define TYPE_INTERNED           't'
#define TYPE_REF                'r'
#define TYPE_TUPLE              '('
#define TYPE_LIST               '['
#define TYPE_DICT               '{'
#define TYPE_CODE               'c'
#define TYPE_UNICODE            'u'
#define TYPE_UNKNOWN            '?'
#define TYPE_SET                '<'
#define TYPE_FROZENSET          '>'
#define FLAG_REF                'x80' /* with a type, add obj to index */

// 以下都是Python3.5之后支持的
#define TYPE_ASCII              'a'
#define TYPE_ASCII_INTERNED     'A'
#define TYPE_SMALL_TUPLE        ')'
#define TYPE_SHORT_ASCII        'z'
#define TYPE_SHORT_ASCII_INTERNED 'Z'
// ......

Python 3.7生成的.pyc文件前32个字节为:

00000000: 420d 0d0a 0000 0000 b9c7 895e 5700 0000
00000010: e300 0000 0000 0000 0000 0000 0003 0000

可知第17个字节(PyCodeObject的第1个字节)是0xe3,这是因为PyObject对象的第1个字节还可以包含一个flag(#define FLAG_REF 'x80'),即第1个字节为0x63|0x80 -> 0xe3。FLAG_REF表示将这个对象加入引用列表,当下次出现这个对象时就不需要再次进行序列化,直接使用TYPE_REF取这个对象即可,这可以视作Python序列化的一种优化。

一般情况下PyCodeObject对象具有如下的属性和数据类型:

/* Bytecode object */
typedef struct {
  PyObject_HEAD
    int co_argcount;            /* #arguments, except *args */
    int co_posonlyargcount;     /* #positional only arguments */
    int co_kwonlyargcount;      /* #keyword only arguments */
    int co_nlocals;             /* #local variables */
    int co_stacksize;           /* #entries needed for evaluation stack */
    int co_flags;               /* CO_..., see below */
    int co_firstlineno;         /* first source line number */
  PyObject *co_code;          /* instruction opcodes */
  PyObject *co_consts;        /* list (constants used) */
  PyObject *co_names;         /* list of strings (names used) */
  PyObject *co_varnames;      /* tuple of strings (local variable names) */
  PyObject *co_freevars;      /* tuple of strings (free variable names) */
  PyObject *co_cellvars;      /* tuple of strings (cell variable names) */
    /* The rest aren't used in either hash or comparisons, except for co_name,
       used in both. This is done to preserve the name and line number
  for tracebacks and debuggers; otherwise, constant de-duplication
       would collapse identical functions/lambdas defined on different lines.
    */
  Py_ssize_t *co_cell2arg;    /* Maps cell vars which are arguments. */
  PyObject *co_filename;      /* unicode (where it was loaded from) */
  PyObject *co_name;          /* unicode (name, for reference) */
  PyObject *co_lnotab;        /* string (encoding addr<->lineno mapping) See
  Objects/lnotab_notes.txt for details. */
//  ......
}PyCodeObject;

每个属性在虚拟机执行.pyc文件时都有其作用,但并非要求全部写入.pyc文件。marshal序列化PyCodeObject的实现部分:

// ......
  else if (PyCode_Check(v)) {
        PyCodeObject *co = (PyCodeObject *)v;
        W_TYPE(TYPE_CODE, p);
  w_long(co->co_argcount, p);
  w_long(co->co_kwonlyargcount, p);
  w_long(co->co_nlocals, p);
  w_long(co->co_stacksize, p);
  w_long(co->co_flags, p);
  w_object(co->co_code, p);
  w_object(co->co_consts, p);
  w_object(co->co_names, p);
  w_object(co->co_varnames, p);
  w_object(co->co_freevars, p);
  w_object(co->co_cellvars, p);
  w_object(co->co_filename, p);
  w_object(co->co_name, p);
  w_long(co->co_firstlineno, p);
  w_object(co->co_lnotab, p);
    }
// ......

Python使用marshal.dump的方法将PyCodeObject对象转化为对应的二进制文件结构。每个字段在二进制文件中的结构如下所示:

TYPE_CODE

byte

表示这是一个PyCodeObject

co_argcount

long

对应PyCodeObject结构体里的各个域

co_nlocals

long

co_stacksize

long

co_flags

long

TYPE_STRING

byte

字符串的表示方法,对应PyCodeObject的co_code

co_code size

long

co_code value

bytes

TYPE_LIST

byte

这是一个列表

co_consts size

long

列表co_consts的元素个数

TYPE_INT

byte

co_consts[0]是一个整型

co_consts[0]

long

TYPE_STRING

byte

co_consts[1]是一个字符串

co_consts[1] size

long

co_consts[1] value

bytes

TYPE_CODE

byte

co_consts[2]又是一个PyCodeObject对象,它对应的代码可能是一个函数或类

co_consts[2]

其中,byte表示仅占用1个字节,long表示占用4个字节,bytes表示该字段占用1个或者多个字节。值得注意的是,PyCodeObject对象中每个属性及其值都会按照一定的顺序表示在二进制文件中。

PyCodeObject中的co_code

Python的opcode决定了程序的执行流程,这被作为TYPE_STRING类型的PyObject存到了PyCodeObject的co_code中。

Python 3.7的opcode序列:

00000000: 420d 0d0a 0000 0000 b9c7 895e 5700 0000
00000010: e300 0000 0000 0000 0000 0000 0003 0000
00000020: 0040 0000 0073 1e00 0000 6500 6400 8301
00000030: 0100 6401 6402 8400 5a01 6501 6403 6404
00000040: 8302 0100 6405 5300 2906 7a0c 4865 6c6c
00000050: 6f2c 2077 6f72 6c64 6302 0000 0000 0000

offset 0x2a-0x47即为序列化后的opcode序列(6500 6400直到6405 5300)。第25个字节0x73表示TYPE_STRING,第26-29个字节表示对象的长度,1e00 0000就是小端存储形式的30。

opcode

Python的源码Include/opcode.h中定义了一系列的opcode。其中,以HAVE_ARGUMENT为界限,凡是大于HAVE_ARGUMENT的opcode都是有且仅有1个参数的,凡是小于HAVE_ARGUMENT的opcode都是没有参数的。

CPython implementation detail: Bytecode is an implementation detail of the CPython interpreter. No guarantees are made that bytecode will not be added, removed, or changed between versions of Python. Use of this module should not be considered to work across Python VMs or Python releases.

Changed in version 3.6: Use 2 bytes for each instruction. Previously the number of bytes varied by instruction.

Python不保证不同版本之间的opcode兼容性,这也是Python各个版本之间.pyc不兼容的一个原因。

从Python 3.6开始,有一个较大的改变,就是无论opcode有无参数,每一条指令的长度都是2个字节,opcode占用1个字节,若这个opcode是有参数的,则另外1个字节表示参数;如果opcode没有参数,则另外1个字节就会被忽略,一般为0x00。实际上opcode的参数仅有1个:offset。

Python3.6 以前,对于有参数的opcode,指令长度为3个字节,包含opcode、argv_low、argv_high,opcode占用1个字节,参数占用2个字节,也采用小端存储。如Python 2.7中的指令6401 00表示opcode为LOAD_CONST,参数为1。

LOAD_CONST(consti)Pushes co_consts[consti] onto the stack.

即从co_consts这个tuple对象取出第1个对象(从0开始计算,第1个元素即为co_consts[1]),压到栈顶。

查看opcode

可以使用Python自带的dis和marshal库帮助查看opcode序列,下面以2个经典版本(Python 2.7和Python 3.7)为例。

现设定源码:

print('Hello, world')
def fff(a,b):
    c = a + b
    return c & 0xffff
fff(34,67)
Python 2.7
>>> import dis, marshal
>>> f=open('t.pyc', 'rb').read()
>>> co=marshal.loads(f[8:]) # Python2.7中,PyCodeObject在.pyc文件中的偏移为8
>>> dis.dis(co)
  1           0 LOAD_CONST           0 ('Hello, world')
              3 PRINT_ITEM
              4 PRINT_NEWLINE

  3           5 LOAD_CONST               1 (<code object fff at 0x10a1c9630, file "t.py", line 3>)
              8 MAKE_FUNCTION            0
             11 STORE_NAME               0 (fff)

  7          14 LOAD_NAME                0 (fff)
             17 LOAD_CONST               2 (34)
             20 LOAD_CONST               3 (67)
             23 CALL_FUNCTION            2
             26 POP_TOP
             27 LOAD_CONST               4 (None)
             30 RETURN_VALUE
>>> co.co_names
('fff',)
>>> co.co_consts
('Hello, world', <code object fff at ..., file ".../t.py", line 3>, 34, 67, None)

16进制指令

行号

指令偏移与指令名称

参数

65 00 00

1

0 LOAD_CONST

0('Hello, world')

64 02 00

3 PRINT_ITEM

48

4 PRINT_NEWLINE

64 01 00

3

5 LOAD_CONST

1()

84 00 00

8 MAKE_FUNCTION

0

5a 00 00

11 STORE_NAME

0(fff)

65 00 00

7

14 LOAD_NAME

0(fff)

64 02 00

17 LOAD_CONST

2(34)

64 03 00

20 LOAD_CONST

3(67)

83 02 00

23 CALL_FUNCTION

2

01

26 POP_TOP

64 04 00

27 LOAD_CONST

4(None)

53

30 RETURN_VALUE

Python 3.7
>>> import dis, marshal
>>> f=open('t.pyc', 'rb').read()
>>> co=marshal.loads(f[16:]) # Python3.7中,PyCodeObject在pyc文件中的偏移为16
>>> dis.dis(co)
  1           0 LOAD_NAME                0 (print)
              2 LOAD_CONST               0 ('Hello, world')
              4 CALL_FUNCTION            1
              6 POP_TOP

  3           8 LOAD_CONST               1 (<code object fff at ..., line 3>)
             10 LOAD_CONST               2 ('fff')
             12 MAKE_FUNCTION            0
             14 STORE_NAME               1 (fff)

  7          16 LOAD_NAME                1 (fff)
             18 LOAD_CONST               3 (34)
             20 LOAD_CONST               4 (67)
             22 CALL_FUNCTION            2
             24 POP_TOP
             26 LOAD_CONST               5 (None)
             28 RETURN_VALUE
>>> co.co_names
('print', 'fff')
>>> co.co_name
'<module>'
>>> co.co_consts
('Hello, world', <code object fff at ..., file".../t.py", line 3>,'fff', 34, 67,None)

16进制指令

行号

指令偏移与指令名称

参数

65 00

1

0 LOAD_NAME

0(print)

64 00

2 LOAD_CONST

0('Hello, world')

83 01

4 CALL_FUNCTION

1

01 00

6 POP_TOP

64 01

3

8 LOAD_CONST

1()

64 02

10 LOAD_CONST

2('fff')

84 00

12 MAKE_FUNCTION

0

5a 01

14 STORE_NAME

1(fff)

65 01

7

16 LOAD_NAME

1(fff)

64 03

18 LOAD_CONST

3(34)

64 04

20 LOAD_CONST

4(67)

83 02

22 CALL_FUNCTION

2

01 00

24 POP_TOP

64 05

26 LOAD_CONST

5(None)

53 00

28 RETURN_VALUE

.pyc文件处理的重要难点在于版本的差异和结构、逻辑关系,本次处理的.exe文件是个没有任何保护的裸程序,也没有涉及去混淆操作,故很容易得到结果;当遇到.pyc混淆处理等问题时,则需要细致的分析,得到结果的难度显著增大,甚至不能得出结果。

能够得出源码确实值得庆幸,但更重要的是加深对.pyc文件结构、作用的了解。

手机扫一扫

移动阅读更方便

阿里云服务器
腾讯云服务器
七牛云服务器

你可能感兴趣的文章