[Python]문자열 Encoding과 Python이 문자열을 처리하는 방법

티스토리 뷰

Python 잡지식/소스코드 톱아보기

[Python]문자열 Encoding과 Python이 문자열을 처리하는 방법

Teus 2023. 12. 6. 20:53

728x90

안녕하세요. Teus입니다.
이번 포스팅은 Encoding에 대해서 간단히 정리하고, Python에서 char를 다루는 방법과 Encoding에 대해서 다룹니다.

1. 문자열 Encoding

이제는 식상한 말 이지만, 컴퓨터는 모든 데이터를 0/1 이진 Data로 처리합니다.
그렇기 때문에 숫자 같은 경우 10진수->2진수 변환을 통해서 저장됩니다
(Ex. 숫자 22 -> 2진수 101100)
반면, char 같은경우 char -> 2진 변환을 하는 다양한 방법이 존재합니다.
이때 char -> 2진 변환은 Encoding이라고 하며, char에 대응되는 특정한 2진수로 coding한다는 것을 의미합니다.
대표적인 char Encoding 방법으로 ASCII Code와 Unicode(UTF-8.16...) 등이 있습니다.
ASCII 코드의 경우 7bit = 128개에 대응되는 문자열 Table을 기반으로, 2진수<->char를 변환 합니다.
ASCII Wikipedia
때문에, C언어의 경우 char type을 integer로 출력할 경우 해당 char에 대응되는 ASCII 코드 값이 Int로 출력됩니다.

#include <stdio.h>

int main()
{
    char t = 'B';
    printf("t's char is : %c\n", t);
    printf("t's ascii is : %d\n", t);
    return 0;
}
/*
t's char is : B
t's ascii is : 66
*/

하지만 ASCII Code는 128개의 문자 이상을 표현할 수 없기에, 영미권 이외에 다른 지역의 문자 표기를 위해서 Unicode가 등장합니다.
Unicode의 경우 2Byte~4Byte(16Bit~32Bit)을 사용해서 표현됩니다. ASCII code의 경우 ASCII Code로 바로 사용하고, 나머지의 경우 코드의 자리수에 따라 유동적으로 bit을 사용하게 됩니다.
(UTF 8은 8bit 단위로, UTF16는 16bit으로 Encoding 합니다)
Unicode Wikipedia
여튼, 결국 컴퓨터는 char를 표현하기 위해서, char에 해당하는 2진수를 어떤한 방법으로던 Encoding하여 저장한다는 것을 알 수 있습니다.

2. Python의 문자열 처리와 Encoding

2_1. Class str

Python을 처음 접하시는 분들이면, 무조건 char의 경우 'class str'를 사용해서 표현하게 됩니다.
Python str Class Official Docs

class str의 경우 공식 홈페이지에서 볼 수 있듯, Unicode형태로 모든 char를 처리합니다.
아래는 Python의 CPython에서 정의된 PyUnicodeObject의 소스코드 입니다.

출처 : https://github.com/python/cpython/blob/3f2dd0a7c0b1a5112f2164dce78fcfaa0c4b39c7/Include/cpython/unicodeobject.h#L52

typedef struct {
    /* There are 4 forms of Unicode strings:
       - compact ascii:
       - compact:
       - legacy string, not ready:
       - legacy string, ready:
       Compact strings use only one memory block (structure + characters),
       whereas legacy strings use one block for the structure and one block
       for characters.

       Legacy strings are created by PyUnicode_FromUnicode() and
       PyUnicode_FromStringAndSize(NULL, size) functions. They become ready
       when PyUnicode_READY() is called.
    */
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        /* Compact is with respect to the allocation scheme. Compact unicode
        objects only require one memory block while non-compact objects use
        one block for the PyUnicodeObject struct and another for its data
        buffer. */
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
        unsigned int :24;
    } state;
    //char로 표현할 수 없는 값에대해서 wchar_t를 사용해서 표시
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;


typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;


typedef struct {
    PyCompactUnicodeObject _base;
    //None Compact 일 경우 union을 사용해서 데이터 저장
    //Compact일 경우 union위치에 void pointer를 만들어서 저장
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;

이때, PyUnicode를 생성하는 다양한 생성자 중에서, _PyUnicode_FromASCII를 통해서 PyUnicodeObject가 어떻게 Data를 관리하는지 유추해 볼 수 있습니다.

출처 : https://github.com/python/cpython/blob/3f2dd0a7c0b1a5112f2164dce78fcfaa0c4b39c7/Objects/unicodeobject.c#L1926

PyObject*
_PyUnicode_FromASCII(const char *buffer, Py_ssize_t size)
{
    const unsigned char *s = (const unsigned char *)buffer;
    PyObject *unicode;
    if (size == 1) {
#ifdef Py_DEBUG
        assert((unsigned char)s[0] < 128);
#endif
        return get_latin1_char(s[0]);
    }
    //ascii code로 부터 unicode를 만들기 때문에
    //max char를 127로 고정
    unicode = PyUnicode_New(size, 127);
    if (!unicode)
        return NULL;
    //만들어진 unicode variable에
    //PyUnicode_1BYTE_DATA API를 사용해서 unicode의 data ptr에 접근하고
    //data ptr에 char s의 데이터를 memory copy함
    memcpy(PyUnicode_1BYTE_DATA(unicode), s, size);
    assert(_PyUnicode_CheckConsistency(unicode, 1));
    return unicode;
}

//PyUnicode의 Data에 접근하기 위한 API
#define PyUnicode_1BYTE_DATA(op) ((Py_UCS1*)PyUnicode_DATA(op))
#define _PyUnicode_COMPACT_DATA(op)                     \
    (PyUnicode_IS_ASCII(op) ?                   \
     ((void*)((PyASCIIObject*)(op) + 1)) :              \
     ((void*)((PyCompactUnicodeObject*)(op) + 1)))

#define _PyUnicode_NONCOMPACT_DATA(op)                  \
    (assert(((PyUnicodeObject*)(op))->data.any),        \
     ((((PyUnicodeObject *)(op))->data.any)))

#define PyUnicode_DATA(op) \
    (assert(PyUnicode_Check(op)), \
     PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \
     _PyUnicode_NONCOMPACT_DATA(op))


static PyObject*
get_latin1_char(Py_UCS1 ch)
{
    struct _Py_unicode_state *state = get_unicode_state();

    PyObject *unicode = state->latin1[ch];
    if (unicode) {
        Py_INCREF(unicode);
        return unicode;
    }

    unicode = PyUnicode_New(1, ch);
    if (!unicode) {
        return NULL;
    }

    PyUnicode_1BYTE_DATA(unicode)[0] = ch;
    assert(_PyUnicode_CheckConsistency(unicode, 1));

    Py_INCREF(unicode);
    state->latin1[ch] = unicode;
    return unicode;
}

PyUnicodeObject는 아래와 같은 방법으로 Data를 저장, 관리하는 것을 알 수 있습니다.

No compact type(Legacy String) : PyUnicodeObject.data.any에 char 정보를 저장
Compact Type : PyUnicodeObject를 PyASCIIObject or PyCompactUnicodeObject로 형변환 한 이후에 바로 뒤 부분을 void pointer로 형변환 한 이후에 바로 데이터로 사용

2_2. Class bytes

이제 Python의 또다른 문자열 처리 방법을 알아보겠습니다.
Python Bytes Class Official Docs
class bytes는 모든 Data를 bytes(8bit) 형태로 저장하는 방법 입니다.

"다".encode()
->b'\xeb\x8b\xa4'
#utf-8로 '다'를 encoding 하면 eb 8b a4가 나옴
type("다".encode())
-><class 'bytes'>

Bytes는 어떤 데이터던 Encoding이 가능 할 때 최종적으로 Ascii Code로 저장되는 것을 볼 수 있습니다(=8bit 정수=16진수 2개)
Python의 Bytes String은 아래 PyBytesObject의 원형에서 볼 수 있듯, C언어의 char로 Data를 보관합니다(=ascii char형태로 저장합니다)

출처 : https://github.com/python/cpython/blob/3f2dd0a7c0b1a5112f2164dce78fcfaa0c4b39c7/Include/cpython/bytesobject.h#L5

typedef struct {
    PyObject_VAR_HEAD
    Py_hash_t ob_shash;
    //Data가 담기는 부분
    char ob_sval[1];
} PyBytesObject;

소스코드만 봐도 알 수 있지만, Python의 str class 대비 byte class가 가벼운 것을 알 수 있습니다.
이 가벼운 장점 덕분에, 실제 Python 인터프리터 환경에서 str을 사용하는 것 보다 bytes를 사용할 때 더 좋은 performance를 낼 수 있습니다.
아래는 1억개의 string list를 2개 만들고, list를 Elementwise string concat하는 예시입니다.

temp1 = ['가나asbc다라' for i in range(100000000)]
temp2 = ['가나asbc다라'.encode() for i in range(100000000)]

import time
st = time.time()
[i+j for i, j in zip(temp1, temp1)]
print("string time :" ,time.time()-st)
st = time.time()
[i+j for i, j in zip(temp2, temp2)]
print("bytes time :" ,time.time()-st)
'''
string time : 15.826701402664185
bytes time : 12.365976333618164
'''

3. Python에서 Bytes를 쓸 일이 있는가?

네 있습니다.

최근 빅데이터의 문서압축을 위해 사용되는 parquet filesystem의 경우 object data를 bytes형태로 저장합니다.(parquet에 대해서는, 나중에 기회되면 다뤄보는 걸로)
때문에 .csv가 아닌 .parquet 파일을 다룰 경우 bytes를 자연스럽게 접하게 됩니다.
이때, bytes를 다시 str로 decode해서 사용 할 경우 decoding 시간 + performance의 감소를 감내해야 합니다.

따라서, Python의 최적의 성능 활용을 위해서 Bytes에 대해서 알고, str대신 Bytes를 사용할 경우 String Operation에서 발생하던 병목현상은 개선할 수가 있습니다.
도움이 되셨길 바랍니다!

감사합니다!

728x90

저작자표시 (새창열림)

'Python 잡지식 > 소스코드 톱아보기' 카테고리의 다른 글

[Pandas]groupby 동작에 대해서 (2편) (0)	2023.12.06
[Pandas]groupby 동작에 대해서 (1편) (0)	2023.12.06
[Pandas] Pandas의 apply동작에 대해서 (0)	2023.12.06
[Pandas]inplace=True동작에 대해서 (0)	2023.12.06
[Pandas]Series의 구조에 대해서 알아보자2 (0)	2023.12.06

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

ITeus

티스토리 뷰