Pandas (3) - DataFrame과 Series (2)

2022년 11월 16일 6 분 소요

판다스 데이터프레임과 시리즈

시리즈 다루기 ― 응용

In [1]:

import pandas as pd

In [2]:

df = pd.read_csv('data/scientists.csv')
df.head(2)

Out [2]:

	Name	Born	Died	Age	Occupation
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician

In [3]:

df['Age'].max()

Out [3]:

불린(Boolean) 추출

조건에 맞는 데이터 추출

In [4]:

df[df['Age'] > df['Age'].mean()]

Out [4]:

	Name	Born	Died	Age	Occupation
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician

In [5]:

df['Age'].mean()

Out [5]:

59.125

In [6]:

# 조건식만 보기
df['Age'] > df['Age'].mean()

Out [6]:

  False
   True
   True
   True
  False
  False
  False
   True
Name: Age, dtype: bool

In [7]:

cond = [False, True, True, False, False, False, True, True]

In [8]:

df[cond]

Out [8]:

	Name	Born	Died	Age	Occupation
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician

In [9]:

# 직업이 Chemist인 행 추출
df[df['Occupation'] == 'Chemist']

Out [9]:

	Name	Born	Died	Age	Occupation
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist

In [10]:

# 직업이 Chemist를 제외하고 행 추출
df[df['Occupation'] != 'Chemist']

Out [10]:

	Name	Born	Died	Age	Occupation
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist
5	John Snow	1813-03-15	1858-06-16	45	Physician
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician

시리즈(Seires)와 브로드캐스팅(Broadcasting)

In [11]:

# 스칼라(Scalar)와 브로드캐스팅
df['Age'] + 100

Out [11]:

  137
  161
  190
  166
  156
  145
  141
  177
Name: Age, dtype: int64

In [12]:

df['Age'] + pd.Series([1, 100]) # 나머지 값들은 NaN으로 되어 계산됨
# NaN과의 계산은 NaN이 된다.

Out [12]:

   38.0
  161.0
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
dtype: float64

In [13]:

df['Age'].sort_index() # 많이 쓰는 옵션: axis, ascending, inplace, ignore_index
# ascending : default는 True, 오름차순 ; 내림차순으로 하려면 False
# inplace : default는 False; 원본에 적용하려면 True

Out [13]:

  37
  61
  90
  66
  56
  45
  41
  77
Name: Age, dtype: int64

In [14]:

df.sort_index(ascending=False)

Out [14]:

	Name	Born	Died	Age	Occupation
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist
5	John Snow	1813-03-15	1858-06-16	45	Physician
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist

In [15]:

df.sort_values(by='Age', ascending=False)

Out [15]:

	Name	Born	Died	Age	Occupation
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist
5	John Snow	1813-03-15	1858-06-16	45	Physician
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist

데이터프레임 다루기

불린 추출과 브로드캐스팅

In [16]:

df[df['Age'] > df['Age'].mean()]

Out [16]:

	Name	Born	Died	Age	Occupation
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician

시리즈와 데이터프레임의 데이터 처리하기

열의 자료형 바꾸기와 새로운 열 추가

In [17]:

df.info()

Out [17]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        8 non-null      object
 1   Born        8 non-null      object
 2   Died        8 non-null      object
 3   Age         8 non-null      int64 
 4   Occupation  8 non-null      object
dtypes: int64(1), object(4)
memory usage: 448.0+ bytes

날짜계산을 위해 object타입을 datatime타입으로 변환해야 한다.

In [18]:

pd.to_datetime(df['Born']) # 옵션: arg-변환할 값, format-형식지정(형태가 다르면 반드시 지정),
    # errors-default는 raise(오류시 에러메시지 보내고 종료), ignore(오류 시 그냥 종료), coerce(문제값 NaT로 대체)
# to_datetime : 날짜로 변환
# to_numeric : 숫자로 변환
# tp_timedelta : 날짜간격을 객체로 변환

Out [18]:

 1920-07-25
 1876-06-13
 1820-05-12
 1867-11-07
 1907-05-27
 1813-03-15
 1912-06-23
 1777-04-30
Name: Born, dtype: datetime64[ns]

In [19]:

pd.to_datetime(df['Born'], format='%Y-%m-%d')

Out [19]:

 1920-07-25
 1876-06-13
 1820-05-12
 1867-11-07
 1907-05-27
 1813-03-15
 1912-06-23
 1777-04-30
Name: Born, dtype: datetime64[ns]

In [20]:

df.columns

Out [20]:

Index(['Name', 'Born', 'Died', 'Age', 'Occupation'], dtype='object')

In [21]:

# 데이터프레임 추가
df['Born_dt'] = pd.to_datetime(df['Born'], format='%Y-%m-%d')
# 없으면 추가, 있으면 덮어씀

In [22]:

df.columns

Out [22]:

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'Born_dt'], dtype='object')

In [23]:

df.head(2)

Out [23]:

	Name	Born	Died	Age	Occupation	Born_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist	1920-07-25
1	William Gosset	1876-06-13	1937-10-16	61	Statistician	1876-06-13

In [24]:

df.info()

Out [24]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Name        8 non-null      object        
 1   Born        8 non-null      object        
 2   Died        8 non-null      object        
 3   Age         8 non-null      int64         
 4   Occupation  8 non-null      object        
 5   Born_dt     8 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 512.0+ bytes

In [25]:

# 데이터프레임 추가
df['Died_dt'] = pd.to_datetime(df['Died'], format='%Y-%m-%d')

In [26]:

df.head(2)

Out [26]:

	Name	Born	Died	Age	Occupation	Born_dt	Died_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist	1920-07-25	1958-04-16
1	William Gosset	1876-06-13	1937-10-16	61	Statistician	1876-06-13	1937-10-16

In [27]:

# datetime 타입의 연산
df['Died_dt'] - df['Born_dt']

Out [27]:

 13779 days
 22404 days
 32964 days
 24345 days
 20777 days
 16529 days
 15324 days
 28422 days
dtype: timedelta64[ns]

시리즈, 데이터프레임의 데이터 섞기

In [28]:

import random

In [29]:

random.seed(42) # seed값을 고정해 같은값 생성
random.shuffle(df['Age']) # 해당 column만 섞어준다. 

Out [29]:

C:\Users\user\anaconda3\lib\random.py:362: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x[i], x[j] = x[j], x[i]

In [30]:

df

Out [30]:

	Name	Born	Died	Age	Occupation	Born_dt	Died_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	66	Chemist	1920-07-25	1958-04-16
1	William Gosset	1876-06-13	1937-10-16	56	Statistician	1876-06-13	1937-10-16
2	Florence Nightingale	1820-05-12	1910-08-13	41	Nurse	1820-05-12	1910-08-13
3	Marie Curie	1867-11-07	1934-07-04	77	Chemist	1867-11-07	1934-07-04
4	Rachel Carson	1907-05-27	1964-04-14	90	Biologist	1907-05-27	1964-04-14
5	John Snow	1813-03-15	1858-06-16	45	Physician	1813-03-15	1858-06-16
6	Alan Turing	1912-06-23	1954-06-07	37	Computer Scientist	1912-06-23	1954-06-07
7	Johann Gauss	1777-04-30	1855-02-23	61	Mathematician	1777-04-30	1855-02-23

column 삭제

In [31]:

df.columns

Out [31]:

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'Born_dt', 'Died_dt'], dtype='object')

In [32]:

df.drop(columns='Age', inplace=True)
# df.drop(['Age'], axis=1)
# 옵션의 lable과 axis는 같이 사용해야 하므로 index나 columns를 단독으로 사용하는게 편하다
# inplace 옵션을 True로 해야 원본에 적용된다
# df = df.drop(columns='Age') 도 가능하다

In [33]:

df

Out [33]:

	Name	Born	Died	Occupation	Born_dt	Died_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	Chemist	1920-07-25	1958-04-16
1	William Gosset	1876-06-13	1937-10-16	Statistician	1876-06-13	1937-10-16
2	Florence Nightingale	1820-05-12	1910-08-13	Nurse	1820-05-12	1910-08-13
3	Marie Curie	1867-11-07	1934-07-04	Chemist	1867-11-07	1934-07-04
4	Rachel Carson	1907-05-27	1964-04-14	Biologist	1907-05-27	1964-04-14
5	John Snow	1813-03-15	1858-06-16	Physician	1813-03-15	1858-06-16
6	Alan Turing	1912-06-23	1954-06-07	Computer Scientist	1912-06-23	1954-06-07
7	Johann Gauss	1777-04-30	1855-02-23	Mathematician	1777-04-30	1855-02-23

데이터 저장하고 불러오기

데이터를 피클, CSV, TSV 파일로 저장하고 불러오기

In [34]:

df

Out [34]:

	Name	Born	Died	Occupation	Born_dt	Died_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	Chemist	1920-07-25	1958-04-16
1	William Gosset	1876-06-13	1937-10-16	Statistician	1876-06-13	1937-10-16
2	Florence Nightingale	1820-05-12	1910-08-13	Nurse	1820-05-12	1910-08-13
3	Marie Curie	1867-11-07	1934-07-04	Chemist	1867-11-07	1934-07-04
4	Rachel Carson	1907-05-27	1964-04-14	Biologist	1907-05-27	1964-04-14
5	John Snow	1813-03-15	1858-06-16	Physician	1813-03-15	1858-06-16
6	Alan Turing	1912-06-23	1954-06-07	Computer Scientist	1912-06-23	1954-06-07
7	Johann Gauss	1777-04-30	1855-02-23	Mathematician	1777-04-30	1855-02-23

In [35]:

import pickle

In [36]:

f = open('data.pickle', 'wb')
pickle.dump(df, f)
# dump는 파일로 저장, dumps 파일이 아닌것 저장
f.close()

In [37]:

f = open('data.pickle', 'rb')
data = pickle.load(f)
f.close()

In [38]:

data

Out [38]:

	Name	Born	Died	Occupation	Born_dt	Died_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	Chemist	1920-07-25	1958-04-16
1	William Gosset	1876-06-13	1937-10-16	Statistician	1876-06-13	1937-10-16
2	Florence Nightingale	1820-05-12	1910-08-13	Nurse	1820-05-12	1910-08-13
3	Marie Curie	1867-11-07	1934-07-04	Chemist	1867-11-07	1934-07-04
4	Rachel Carson	1907-05-27	1964-04-14	Biologist	1907-05-27	1964-04-14
5	John Snow	1813-03-15	1858-06-16	Physician	1813-03-15	1858-06-16
6	Alan Turing	1912-06-23	1954-06-07	Computer Scientist	1912-06-23	1954-06-07
7	Johann Gauss	1777-04-30	1855-02-23	Mathematician	1777-04-30	1855-02-23

파일객체를 만들지 않고 저장할 경우 pandas의 to_pickle, read_pickle를 사용해도 괜찮다. 하지만 이 메서드들은 시리즈와 데이터프레임에만 적용되므로 다른 모든 객체에 대해 적용하려면 pickle 모듈을 사용하는것이 좋다

텍스트(json)으로 저장

In [39]:

import json

해당 형태를 만족해야 저장이 가능하다.

In [40]:

f = open('data.json', 'w')
json.dump(df['Name'].to_list(), f)
f.close()

In [41]:

f = open('data.json', 'r')
data = json.load(f)
f.close()

In [42]:

data

Out [42]:

['Rosaline Franklin',
 'William Gosset',
 'Florence Nightingale',
 'Marie Curie',
 'Rachel Carson',
 'John Snow',
 'Alan Turing',
 'Johann Gauss']

pandas에 있는 to_pickle, to_load, to_json는 그 대상이 시리즈와 데이터프레임이다.
tsv는 sep옵션에 tab임을 지정하고 확장자명은 csv로 저장해야 한다.

엑셀파일 저장하고 불러오기

In [43]:

import openpyxl

In [44]:

# !pip install xlwt # cell에서 설치(입력값 넣기는 안됨)

In [45]:

# 구버전 엑셀파일(xls) 지원
import xlwt

In [46]:

df

Out [46]:

	Name	Born	Died	Occupation	Born_dt	Died_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	Chemist	1920-07-25	1958-04-16
1	William Gosset	1876-06-13	1937-10-16	Statistician	1876-06-13	1937-10-16
2	Florence Nightingale	1820-05-12	1910-08-13	Nurse	1820-05-12	1910-08-13
3	Marie Curie	1867-11-07	1934-07-04	Chemist	1867-11-07	1934-07-04
4	Rachel Carson	1907-05-27	1964-04-14	Biologist	1907-05-27	1964-04-14
5	John Snow	1813-03-15	1858-06-16	Physician	1813-03-15	1858-06-16
6	Alan Turing	1912-06-23	1954-06-07	Computer Scientist	1912-06-23	1954-06-07
7	Johann Gauss	1777-04-30	1855-02-23	Mathematician	1777-04-30	1855-02-23

In [47]:

df.to_excel('df_excel.xls')

Out [47]:

C:\Users\user\AppData\Local\Temp\ipykernel_3236\1042641208.py:1: FutureWarning: As the xlwt package is no longer maintained, the xlwt engine will be removed in a future version of pandas. This is the only engine in pandas that supports writing in the xls format. Install openpyxl and write to an xlsx file instead. You can set the option io.excel.xls.writer to 'xlwt' to silence this warning. While this option is deprecated and will also raise a warning, it can be globally set and the warning suppressed.
  df.to_excel('df_excel.xls')

In [48]:

df.to_excel('df_excel.xlsx')

In [49]:

pd.read_excel('df_excel.xlsx') # 옵션: 파일명, 시트이름, 헤더값 등..

Out [49]:

	Unnamed: 0	Name	Born	Died	Occupation	Born_dt	Died_dt
0	0	Rosaline Franklin	1920-07-25	1958-04-16	Chemist	1920-07-25	1958-04-16
1	1	William Gosset	1876-06-13	1937-10-16	Statistician	1876-06-12	1937-10-16
2	2	Florence Nightingale	1820-05-12	1910-08-13	Nurse	1820-05-11	1910-08-13
3	3	Marie Curie	1867-11-07	1934-07-04	Chemist	1867-11-06	1934-07-04
4	4	Rachel Carson	1907-05-27	1964-04-14	Biologist	1907-05-27	1964-04-14
5	5	John Snow	1813-03-15	1858-06-16	Physician	1813-03-14	1858-06-15
6	6	Alan Turing	1912-06-23	1954-06-07	Computer Scientist	1912-06-23	1954-06-07
7	7	Johann Gauss	1777-04-30	1855-02-23	Mathematician	1777-04-29	1855-02-22

Reference

이 포스트는 SeSAC 인공지능 자연어처리, 컴퓨터비전 기술을 활용한 응용 SW 개발자 양성 과정 - 심선조 강사님의 강의를 정리한 내용입니다.

Twitter Facebook LinkedIn

Kwak Jisub

Pandas (3) - DataFrame과 Series (2)

판다스 데이터프레임과 시리즈

시리즈 다루기 ― 응용

데이터프레임 다루기

시리즈와 데이터프레임의 데이터 처리하기

데이터 저장하고 불러오기

Reference

공유하기

댓글남기기

참고

python max 함수의 매개변수

python round 함수의 문제

python set 자료형과 메소드

MySQL GROUP BY와 MAX