6 분 소요


판다스 데이터프레임과 시리즈

시리즈 다루기 ― 응용

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('data/scientists.csv')
df.head(2)
Out [2]:
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
In [3]:
df['Age'].max()
Out [3]:
90
  • 불린(Boolean) 추출
    조건에 맞는 데이터 추출
In [4]:
df[df['Age'] > df['Age'].mean()]
Out [4]:
Name Born Died Age Occupation
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
In [5]:
df['Age'].mean()
Out [5]:
59.125
In [6]:
# 조건식만 보기
df['Age'] > df['Age'].mean()
Out [6]:
0    False
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: Age, dtype: bool
In [7]:
cond = [False, True, True, False, False, False, True, True]
In [8]:
df[cond]
Out [8]:
Name Born Died Age Occupation
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
In [9]:
# 직업이 Chemist인 행 추출
df[df['Occupation'] == 'Chemist']
Out [9]:
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
In [10]:
# 직업이 Chemist를 제외하고 행 추출
df[df['Occupation'] != 'Chemist']
Out [10]:
Name Born Died Age Occupation
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist
5 John Snow 1813-03-15 1858-06-16 45 Physician
6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
  • 시리즈(Seires)와 브로드캐스팅(Broadcasting)
In [11]:
# 스칼라(Scalar)와 브로드캐스팅
df['Age'] + 100
Out [11]:
0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64
In [12]:
df['Age'] + pd.Series([1, 100]) # 나머지 값들은 NaN으로 되어 계산됨
# NaN과의 계산은 NaN이 된다.
Out [12]:
0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64
In [13]:
df['Age'].sort_index() # 많이 쓰는 옵션: axis, ascending, inplace, ignore_index
# ascending : default는 True, 오름차순 ; 내림차순으로 하려면 False
# inplace : default는 False; 원본에 적용하려면 True
Out [13]:
0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64
In [14]:
df.sort_index(ascending=False)
Out [14]:
Name Born Died Age Occupation
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist
5 John Snow 1813-03-15 1858-06-16 45 Physician
4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist
In [15]:
df.sort_values(by='Age', ascending=False)
Out [15]:
Name Born Died Age Occupation
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist
5 John Snow 1813-03-15 1858-06-16 45 Physician
6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist

데이터프레임 다루기

  • 불린 추출과 브로드캐스팅
In [16]:
df[df['Age'] > df['Age'].mean()]
Out [16]:
Name Born Died Age Occupation
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician

시리즈와 데이터프레임의 데이터 처리하기

  • 열의 자료형 바꾸기와 새로운 열 추가
In [17]:
df.info()
Out [17]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        8 non-null      object
 1   Born        8 non-null      object
 2   Died        8 non-null      object
 3   Age         8 non-null      int64 
 4   Occupation  8 non-null      object
dtypes: int64(1), object(4)
memory usage: 448.0+ bytes

날짜계산을 위해 object타입을 datatime타입으로 변환해야 한다.

In [18]:
pd.to_datetime(df['Born']) # 옵션: arg-변환할 값, format-형식지정(형태가 다르면 반드시 지정),
    # errors-default는 raise(오류시 에러메시지 보내고 종료), ignore(오류 시 그냥 종료), coerce(문제값 NaT로 대체)
# to_datetime : 날짜로 변환
# to_numeric : 숫자로 변환
# tp_timedelta : 날짜간격을 객체로 변환
Out [18]:
0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]
In [19]:
pd.to_datetime(df['Born'], format='%Y-%m-%d')
Out [19]:
0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]
In [20]:
df.columns
Out [20]:
Index(['Name', 'Born', 'Died', 'Age', 'Occupation'], dtype='object')
In [21]:
# 데이터프레임 추가
df['Born_dt'] = pd.to_datetime(df['Born'], format='%Y-%m-%d')
# 없으면 추가, 있으면 덮어씀
In [22]:
df.columns
Out [22]:
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'Born_dt'], dtype='object')
In [23]:
df.head(2)
Out [23]:
Name Born Died Age Occupation Born_dt
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist 1920-07-25
1 William Gosset 1876-06-13 1937-10-16 61 Statistician 1876-06-13
In [24]:
df.info()
Out [24]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        8 non-null      object        
 1   Born        8 non-null      object        
 2   Died        8 non-null      object        
 3   Age         8 non-null      int64         
 4   Occupation  8 non-null      object        
 5   Born_dt     8 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 512.0+ bytes

In [25]:
# 데이터프레임 추가
df['Died_dt'] = pd.to_datetime(df['Died'], format='%Y-%m-%d')
In [26]:
df.head(2)
Out [26]:
Name Born Died Age Occupation Born_dt Died_dt
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist 1920-07-25 1958-04-16
1 William Gosset 1876-06-13 1937-10-16 61 Statistician 1876-06-13 1937-10-16
In [27]:
# datetime 타입의 연산
df['Died_dt'] - df['Born_dt']
Out [27]:
0   13779 days
1   22404 days
2   32964 days
3   24345 days
4   20777 days
5   16529 days
6   15324 days
7   28422 days
dtype: timedelta64[ns]
  • 시리즈, 데이터프레임의 데이터 섞기
In [28]:
import random
In [29]:
random.seed(42) # seed값을 고정해 같은값 생성
random.shuffle(df['Age']) # 해당 column만 섞어준다. 
Out [29]:
C:\Users\user\anaconda3\lib\random.py:362: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x[i], x[j] = x[j], x[i]

In [30]:
df
Out [30]:
Name Born Died Age Occupation Born_dt Died_dt
0 Rosaline Franklin 1920-07-25 1958-04-16 66 Chemist 1920-07-25 1958-04-16
1 William Gosset 1876-06-13 1937-10-16 56 Statistician 1876-06-13 1937-10-16
2 Florence Nightingale 1820-05-12 1910-08-13 41 Nurse 1820-05-12 1910-08-13
3 Marie Curie 1867-11-07 1934-07-04 77 Chemist 1867-11-07 1934-07-04
4 Rachel Carson 1907-05-27 1964-04-14 90 Biologist 1907-05-27 1964-04-14
5 John Snow 1813-03-15 1858-06-16 45 Physician 1813-03-15 1858-06-16
6 Alan Turing 1912-06-23 1954-06-07 37 Computer Scientist 1912-06-23 1954-06-07
7 Johann Gauss 1777-04-30 1855-02-23 61 Mathematician 1777-04-30 1855-02-23
  • column 삭제
In [31]:
df.columns
Out [31]:
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'Born_dt', 'Died_dt'], dtype='object')
In [32]:
df.drop(columns='Age', inplace=True)
# df.drop(['Age'], axis=1)
# 옵션의 lable과 axis는 같이 사용해야 하므로 index나 columns를 단독으로 사용하는게 편하다
# inplace 옵션을 True로 해야 원본에 적용된다
# df = df.drop(columns='Age') 도 가능하다
In [33]:
df
Out [33]:
Name Born Died Occupation Born_dt Died_dt
0 Rosaline Franklin 1920-07-25 1958-04-16 Chemist 1920-07-25 1958-04-16
1 William Gosset 1876-06-13 1937-10-16 Statistician 1876-06-13 1937-10-16
2 Florence Nightingale 1820-05-12 1910-08-13 Nurse 1820-05-12 1910-08-13
3 Marie Curie 1867-11-07 1934-07-04 Chemist 1867-11-07 1934-07-04
4 Rachel Carson 1907-05-27 1964-04-14 Biologist 1907-05-27 1964-04-14
5 John Snow 1813-03-15 1858-06-16 Physician 1813-03-15 1858-06-16
6 Alan Turing 1912-06-23 1954-06-07 Computer Scientist 1912-06-23 1954-06-07
7 Johann Gauss 1777-04-30 1855-02-23 Mathematician 1777-04-30 1855-02-23

데이터 저장하고 불러오기

  • 데이터를 피클, CSV, TSV 파일로 저장하고 불러오기
In [34]:
df
Out [34]:
Name Born Died Occupation Born_dt Died_dt
0 Rosaline Franklin 1920-07-25 1958-04-16 Chemist 1920-07-25 1958-04-16
1 William Gosset 1876-06-13 1937-10-16 Statistician 1876-06-13 1937-10-16
2 Florence Nightingale 1820-05-12 1910-08-13 Nurse 1820-05-12 1910-08-13
3 Marie Curie 1867-11-07 1934-07-04 Chemist 1867-11-07 1934-07-04
4 Rachel Carson 1907-05-27 1964-04-14 Biologist 1907-05-27 1964-04-14
5 John Snow 1813-03-15 1858-06-16 Physician 1813-03-15 1858-06-16
6 Alan Turing 1912-06-23 1954-06-07 Computer Scientist 1912-06-23 1954-06-07
7 Johann Gauss 1777-04-30 1855-02-23 Mathematician 1777-04-30 1855-02-23
In [35]:
import pickle
In [36]:
f = open('data.pickle', 'wb')
pickle.dump(df, f)
# dump는 파일로 저장, dumps 파일이 아닌것 저장
f.close()
In [37]:
f = open('data.pickle', 'rb')
data = pickle.load(f)
f.close()
In [38]:
data
Out [38]:
Name Born Died Occupation Born_dt Died_dt
0 Rosaline Franklin 1920-07-25 1958-04-16 Chemist 1920-07-25 1958-04-16
1 William Gosset 1876-06-13 1937-10-16 Statistician 1876-06-13 1937-10-16
2 Florence Nightingale 1820-05-12 1910-08-13 Nurse 1820-05-12 1910-08-13
3 Marie Curie 1867-11-07 1934-07-04 Chemist 1867-11-07 1934-07-04
4 Rachel Carson 1907-05-27 1964-04-14 Biologist 1907-05-27 1964-04-14
5 John Snow 1813-03-15 1858-06-16 Physician 1813-03-15 1858-06-16
6 Alan Turing 1912-06-23 1954-06-07 Computer Scientist 1912-06-23 1954-06-07
7 Johann Gauss 1777-04-30 1855-02-23 Mathematician 1777-04-30 1855-02-23

파일객체를 만들지 않고 저장할 경우 pandas의 to_pickle, read_pickle를 사용해도 괜찮다. 하지만 이 메서드들은 시리즈와 데이터프레임에만 적용되므로 다른 모든 객체에 대해 적용하려면 pickle 모듈을 사용하는것이 좋다

  • 텍스트(json)으로 저장
In [39]:
import json

해당 형태를 만족해야 저장이 가능하다.

In [40]:
f = open('data.json', 'w')
json.dump(df['Name'].to_list(), f)
f.close()
In [41]:
f = open('data.json', 'r')
data = json.load(f)
f.close()
In [42]:
data
Out [42]:
['Rosaline Franklin',
 'William Gosset',
 'Florence Nightingale',
 'Marie Curie',
 'Rachel Carson',
 'John Snow',
 'Alan Turing',
 'Johann Gauss']

pandas에 있는 to_pickle, to_load, to_json는 그 대상이 시리즈와 데이터프레임이다.
tsv는 sep옵션에 tab임을 지정하고 확장자명은 csv로 저장해야 한다.

  • 엑셀파일 저장하고 불러오기
In [43]:
import openpyxl
In [44]:
# !pip install xlwt # cell에서 설치(입력값 넣기는 안됨)
In [45]:
# 구버전 엑셀파일(xls) 지원
import xlwt
In [46]:
df
Out [46]:
Name Born Died Occupation Born_dt Died_dt
0 Rosaline Franklin 1920-07-25 1958-04-16 Chemist 1920-07-25 1958-04-16
1 William Gosset 1876-06-13 1937-10-16 Statistician 1876-06-13 1937-10-16
2 Florence Nightingale 1820-05-12 1910-08-13 Nurse 1820-05-12 1910-08-13
3 Marie Curie 1867-11-07 1934-07-04 Chemist 1867-11-07 1934-07-04
4 Rachel Carson 1907-05-27 1964-04-14 Biologist 1907-05-27 1964-04-14
5 John Snow 1813-03-15 1858-06-16 Physician 1813-03-15 1858-06-16
6 Alan Turing 1912-06-23 1954-06-07 Computer Scientist 1912-06-23 1954-06-07
7 Johann Gauss 1777-04-30 1855-02-23 Mathematician 1777-04-30 1855-02-23
In [47]:
df.to_excel('df_excel.xls')
Out [47]:
C:\Users\user\AppData\Local\Temp\ipykernel_3236\1042641208.py:1: FutureWarning: As the xlwt package is no longer maintained, the xlwt engine will be removed in a future version of pandas. This is the only engine in pandas that supports writing in the xls format. Install openpyxl and write to an xlsx file instead. You can set the option io.excel.xls.writer to 'xlwt' to silence this warning. While this option is deprecated and will also raise a warning, it can be globally set and the warning suppressed.
  df.to_excel('df_excel.xls')

In [48]:
df.to_excel('df_excel.xlsx')
In [49]:
pd.read_excel('df_excel.xlsx') # 옵션: 파일명, 시트이름, 헤더값 등..
Out [49]:
Unnamed: 0 Name Born Died Occupation Born_dt Died_dt
0 0 Rosaline Franklin 1920-07-25 1958-04-16 Chemist 1920-07-25 1958-04-16
1 1 William Gosset 1876-06-13 1937-10-16 Statistician 1876-06-12 1937-10-16
2 2 Florence Nightingale 1820-05-12 1910-08-13 Nurse 1820-05-11 1910-08-13
3 3 Marie Curie 1867-11-07 1934-07-04 Chemist 1867-11-06 1934-07-04
4 4 Rachel Carson 1907-05-27 1964-04-14 Biologist 1907-05-27 1964-04-14
5 5 John Snow 1813-03-15 1858-06-16 Physician 1813-03-14 1858-06-15
6 6 Alan Turing 1912-06-23 1954-06-07 Computer Scientist 1912-06-23 1954-06-07
7 7 Johann Gauss 1777-04-30 1855-02-23 Mathematician 1777-04-29 1855-02-22

Reference

  • 이 포스트는 SeSAC 인공지능 자연어처리, 컴퓨터비전 기술을 활용한 응용 SW 개발자 양성 과정 - 심선조 강사님의 강의를 정리한 내용입니다.

댓글남기기