판다스 데이터프레임과 시리즈
시리즈 다루기 ― 응용
In [1]:
In [2]:
Out [2]:
|
Name |
Born |
Died |
Age |
Occupation |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
37 |
Chemist |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
In [3]:
Out [3]:
-
- 불린(Boolean) 추출
- 조건에 맞는 데이터 추출
In [4]:
Out [4]:
|
Name |
Born |
Died |
Age |
Occupation |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
90 |
Nurse |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
66 |
Chemist |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
77 |
Mathematician |
In [5]:
Out [5]:
In [6]:
Out [6]:
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 True
Name: Age, dtype: bool
In [7]:
In [8]:
Out [8]:
|
Name |
Born |
Died |
Age |
Occupation |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
90 |
Nurse |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
41 |
Computer Scientist |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
77 |
Mathematician |
In [9]:
Out [9]:
|
Name |
Born |
Died |
Age |
Occupation |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
37 |
Chemist |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
66 |
Chemist |
In [10]:
Out [10]:
|
Name |
Born |
Died |
Age |
Occupation |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
90 |
Nurse |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
56 |
Biologist |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
45 |
Physician |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
41 |
Computer Scientist |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
77 |
Mathematician |
- 시리즈(Seires)와 브로드캐스팅(Broadcasting)
In [11]:
Out [11]:
0 137
1 161
2 190
3 166
4 156
5 145
6 141
7 177
Name: Age, dtype: int64
In [12]:
Out [12]:
0 38.0
1 161.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
In [13]:
Out [13]:
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
In [14]:
Out [14]:
|
Name |
Born |
Died |
Age |
Occupation |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
77 |
Mathematician |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
41 |
Computer Scientist |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
45 |
Physician |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
56 |
Biologist |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
66 |
Chemist |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
90 |
Nurse |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
37 |
Chemist |
In [15]:
Out [15]:
|
Name |
Born |
Died |
Age |
Occupation |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
90 |
Nurse |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
77 |
Mathematician |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
66 |
Chemist |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
56 |
Biologist |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
45 |
Physician |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
41 |
Computer Scientist |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
37 |
Chemist |
데이터프레임 다루기
In [16]:
Out [16]:
|
Name |
Born |
Died |
Age |
Occupation |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
90 |
Nurse |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
66 |
Chemist |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
77 |
Mathematician |
시리즈와 데이터프레임의 데이터 처리하기
In [17]:
Out [17]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 8 non-null object
1 Born 8 non-null object
2 Died 8 non-null object
3 Age 8 non-null int64
4 Occupation 8 non-null object
dtypes: int64(1), object(4)
memory usage: 448.0+ bytes
날짜계산을 위해 object타입을 datatime타입으로 변환해야 한다.
In [18]:
Out [18]:
0 1920-07-25
1 1876-06-13
2 1820-05-12
3 1867-11-07
4 1907-05-27
5 1813-03-15
6 1912-06-23
7 1777-04-30
Name: Born, dtype: datetime64[ns]
In [19]:
Out [19]:
0 1920-07-25
1 1876-06-13
2 1820-05-12
3 1867-11-07
4 1907-05-27
5 1813-03-15
6 1912-06-23
7 1777-04-30
Name: Born, dtype: datetime64[ns]
In [20]:
Out [20]:
Index(['Name', 'Born', 'Died', 'Age', 'Occupation'], dtype='object')
In [21]:
In [22]:
Out [22]:
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'Born_dt'], dtype='object')
In [23]:
Out [23]:
|
Name |
Born |
Died |
Age |
Occupation |
Born_dt |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
37 |
Chemist |
1920-07-25 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
1876-06-13 |
In [24]:
Out [24]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 8 non-null object
1 Born 8 non-null object
2 Died 8 non-null object
3 Age 8 non-null int64
4 Occupation 8 non-null object
5 Born_dt 8 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 512.0+ bytes
In [25]:
In [26]:
Out [26]:
|
Name |
Born |
Died |
Age |
Occupation |
Born_dt |
Died_dt |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
37 |
Chemist |
1920-07-25 |
1958-04-16 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
61 |
Statistician |
1876-06-13 |
1937-10-16 |
In [27]:
Out [27]:
0 13779 days
1 22404 days
2 32964 days
3 24345 days
4 20777 days
5 16529 days
6 15324 days
7 28422 days
dtype: timedelta64[ns]
In [28]:
In [29]:
Out [29]:
C:\Users\user\anaconda3\lib\random.py:362: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
x[i], x[j] = x[j], x[i]
In [30]:
Out [30]:
|
Name |
Born |
Died |
Age |
Occupation |
Born_dt |
Died_dt |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
66 |
Chemist |
1920-07-25 |
1958-04-16 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
56 |
Statistician |
1876-06-13 |
1937-10-16 |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
41 |
Nurse |
1820-05-12 |
1910-08-13 |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
77 |
Chemist |
1867-11-07 |
1934-07-04 |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
90 |
Biologist |
1907-05-27 |
1964-04-14 |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
45 |
Physician |
1813-03-15 |
1858-06-16 |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
37 |
Computer Scientist |
1912-06-23 |
1954-06-07 |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
61 |
Mathematician |
1777-04-30 |
1855-02-23 |
In [31]:
Out [31]:
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'Born_dt', 'Died_dt'], dtype='object')
In [32]:
In [33]:
Out [33]:
|
Name |
Born |
Died |
Occupation |
Born_dt |
Died_dt |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
Chemist |
1920-07-25 |
1958-04-16 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
Statistician |
1876-06-13 |
1937-10-16 |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
Nurse |
1820-05-12 |
1910-08-13 |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
Chemist |
1867-11-07 |
1934-07-04 |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
Biologist |
1907-05-27 |
1964-04-14 |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
Physician |
1813-03-15 |
1858-06-16 |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
Computer Scientist |
1912-06-23 |
1954-06-07 |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
Mathematician |
1777-04-30 |
1855-02-23 |
데이터 저장하고 불러오기
- 데이터를 피클, CSV, TSV 파일로 저장하고 불러오기
In [34]:
Out [34]:
|
Name |
Born |
Died |
Occupation |
Born_dt |
Died_dt |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
Chemist |
1920-07-25 |
1958-04-16 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
Statistician |
1876-06-13 |
1937-10-16 |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
Nurse |
1820-05-12 |
1910-08-13 |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
Chemist |
1867-11-07 |
1934-07-04 |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
Biologist |
1907-05-27 |
1964-04-14 |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
Physician |
1813-03-15 |
1858-06-16 |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
Computer Scientist |
1912-06-23 |
1954-06-07 |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
Mathematician |
1777-04-30 |
1855-02-23 |
In [35]:
In [36]:
In [37]:
In [38]:
Out [38]:
|
Name |
Born |
Died |
Occupation |
Born_dt |
Died_dt |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
Chemist |
1920-07-25 |
1958-04-16 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
Statistician |
1876-06-13 |
1937-10-16 |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
Nurse |
1820-05-12 |
1910-08-13 |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
Chemist |
1867-11-07 |
1934-07-04 |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
Biologist |
1907-05-27 |
1964-04-14 |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
Physician |
1813-03-15 |
1858-06-16 |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
Computer Scientist |
1912-06-23 |
1954-06-07 |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
Mathematician |
1777-04-30 |
1855-02-23 |
파일객체를 만들지 않고 저장할 경우 pandas의 to_pickle, read_pickle를 사용해도 괜찮다.
하지만 이 메서드들은 시리즈와 데이터프레임에만 적용되므로 다른 모든 객체에 대해 적용하려면 pickle 모듈을 사용하는것이 좋다
In [39]:
해당 형태를 만족해야 저장이 가능하다.
In [40]:
In [41]:
In [42]:
Out [42]:
['Rosaline Franklin',
'William Gosset',
'Florence Nightingale',
'Marie Curie',
'Rachel Carson',
'John Snow',
'Alan Turing',
'Johann Gauss']
pandas에 있는 to_pickle, to_load, to_json는 그 대상이 시리즈와 데이터프레임이다.
tsv는 sep옵션에 tab임을 지정하고 확장자명은 csv로 저장해야 한다.
In [43]:
In [44]:
In [45]:
In [46]:
Out [46]:
|
Name |
Born |
Died |
Occupation |
Born_dt |
Died_dt |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
Chemist |
1920-07-25 |
1958-04-16 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
Statistician |
1876-06-13 |
1937-10-16 |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
Nurse |
1820-05-12 |
1910-08-13 |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
Chemist |
1867-11-07 |
1934-07-04 |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
Biologist |
1907-05-27 |
1964-04-14 |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
Physician |
1813-03-15 |
1858-06-16 |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
Computer Scientist |
1912-06-23 |
1954-06-07 |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
Mathematician |
1777-04-30 |
1855-02-23 |
In [47]:
Out [47]:
C:\Users\user\AppData\Local\Temp\ipykernel_3236\1042641208.py:1: FutureWarning: As the xlwt package is no longer maintained, the xlwt engine will be removed in a future version of pandas. This is the only engine in pandas that supports writing in the xls format. Install openpyxl and write to an xlsx file instead. You can set the option io.excel.xls.writer to 'xlwt' to silence this warning. While this option is deprecated and will also raise a warning, it can be globally set and the warning suppressed.
df.to_excel('df_excel.xls')
In [48]:
In [49]:
Out [49]:
|
Unnamed: 0 |
Name |
Born |
Died |
Occupation |
Born_dt |
Died_dt |
0 |
0 |
Rosaline Franklin |
1920-07-25 |
1958-04-16 |
Chemist |
1920-07-25 |
1958-04-16 |
1 |
1 |
William Gosset |
1876-06-13 |
1937-10-16 |
Statistician |
1876-06-12 |
1937-10-16 |
2 |
2 |
Florence Nightingale |
1820-05-12 |
1910-08-13 |
Nurse |
1820-05-11 |
1910-08-13 |
3 |
3 |
Marie Curie |
1867-11-07 |
1934-07-04 |
Chemist |
1867-11-06 |
1934-07-04 |
4 |
4 |
Rachel Carson |
1907-05-27 |
1964-04-14 |
Biologist |
1907-05-27 |
1964-04-14 |
5 |
5 |
John Snow |
1813-03-15 |
1858-06-16 |
Physician |
1813-03-14 |
1858-06-15 |
6 |
6 |
Alan Turing |
1912-06-23 |
1954-06-07 |
Computer Scientist |
1912-06-23 |
1954-06-07 |
7 |
7 |
Johann Gauss |
1777-04-30 |
1855-02-23 |
Mathematician |
1777-04-29 |
1855-02-22 |
Reference
- 이 포스트는 SeSAC 인공지능 자연어처리, 컴퓨터비전 기술을 활용한 응용 SW 개발자 양성 과정 - 심선조 강사님의 강의를 정리한 내용입니다.
댓글남기기