[2025/01/02]내일배움캠프 QA/QC 1기

TIL(Today I Learned)

[2025/01/02]내일배움캠프 QA/QC 1기 - 12일차

essay2892 2025. 1. 2. 19:56

데이터 전처리 실습

iris 데이터셋을 활용해서 전처리를 해보자

import seaborn as sns

import pandas as pd

# 데이터셋 불러오기

iris_data = sns.load_dataset('iris')

Q1. 'species' 열 값이 'setosa'인 데이터 선택하기

setosa = iris_data.loc[iris_data['species'] == 'setosa']

setosa

Q2. 10부터 20까지의 행과 1부터 3까지의 열 선택하기

subs = iris_data.iloc[10:21, 1:4]

subs

tips 데이터셋을 활용해서 전처리를 해보자

import seaborn as sns

# 데이터셋 불러오기

tips_data = sns.load_dataset('tips')

Q1. total_bill이 30 이상인 데이터만 선택하기

high30 = tips_data.loc[tips_data['total_bill'] >= 30]

high30

Q2. 성별('sex')을 기준으로 데이터 그룹화하여 팁(tip)의 평균 계산

tips_data[['sex','tip']].groupby('sex').mean()

Q3. 'day'와 'time'을 기준으로 데이터 그룹화하여 전체 지불 금액(total_bill)의 합 계산

tips_data[['day','time','total_bill']].groupby(['day', 'time']).sum()

Q4. 'day' 열을 기준으로 각 요일별로 팁(tip)의 평균을 새로운 데이터프레임으로 만든 후, 이를 기존의 tips 데이터셋에 합쳐보자

day_tip = tips_data[['day','tip']].groupby('day').mean().reset_index()

day_tip.columns = ['day','avg_tip']

day_tip

tips_data = pd.merge(tips_data, day_tip, on='day', how='left')

tips_data

전처리 & 시각화 4주차

데이터 시각화는 의사결정에 도움을 주는 도구

어떠한 행위에 대한 효과와 영향을 인식시키고 설득시킬 때 중요한 역할

기대효과에 대한 시각화된 자료와 함께 분석 결과를 전달하여 큰 설득력을 갖추자

import pandas as pd

import matplotlib.pyplot as plt

x = [1,2,3,4,5]

y = [2,4,6,8,10]

plt.plot(x, y)

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.title('Example')

plt.show()

df = pd.DataFrame({

'A': [1, 2, 3, 4, 5],

'B': [5, 4, 3, 2, 1]

})

df.plot(x='A', y='B')

df.plot(x='A', y='B', color='green', linestyle='--', marker = 'o')

import pandas as pd

import matplotlib.pyplot as plt

x = [1,2,3,4,5]

y = [2,4,6,8,10]

plt.plot(x, y)

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.title('Example')

plt.show()

df = pd.DataFrame({

'A': [1, 2, 3, 4, 5],

'B': [5, 4, 3, 2, 1]

})

df.plot(x='A', y='B')

df.plot(x='A', y='B', color='green', linestyle='--', marker = 'o')

ax = df.plot(x='A', y='B', color='red', linestyle='--', marker = 'o')

ax.legend(['Data Series'])

plt.show()

ax = df.plot(x='A', y='B', color='red', linestyle='--', marker = 'o')

ax.legend(['Data Series'])

ax.set_xlabel('X-axis')

ax.set_ylabel('Y-axis')

ax.set_title('Title')

ax.text(3, 3, 'Some Text', fontsize = 15)

ax.text(2, 2, 'Some Text2', fontsize = 10)

plt.show()

plt.figure(figsize=(18,6))

x = [1,2,3,4,5]

y = [2,4,6,8,10]

plt.plot(x, y)

plt.show()

fig, ax = plt.subplots(figsize = (18,6))

ax = df.plot(x='A', y='B', color='red', linestyle='--', marker = 'o', ax = ax)

ax.legend(['Data Series'])

ax.set_xlabel('X-axis')

ax.set_ylabel('Y-axis')

ax.set_title('Title')

ax.text(3, 3, 'Some Text', fontsize = 15)

ax.text(2, 2, 'Some Text2', fontsize = 10)

plt.show()

import seaborn as sns

data = sns.load_dataset('flights')

data

data_grouped = data[['year','passengers']].groupby('year').sum().reset_index()

plt.plot(data_grouped['year'], data_grouped['passengers'])

plt.xlabel('Year')

plt.ylabel('Passengers')

plt.show()

df = pd.DataFrame({

'도시': ['서울', '부산', '대구', '인천'],

'인구': [990, 250, 250, 290]

})

plt.rcParams['font.family'] ='Malgun Gothic'

plt.rcParams['axes.unicode_minus'] =False

# 한글 깨져서 추가

plt.bar(df['도시'], df['인구'])

plt.xlabel('도시')

plt.ylabel('인구')

plt.title('도시별 인구 수')

plt.show()

import numpy as np

data = np.random.randn(1000)

plt.hist(data, bins=30)

plt.xlabel('Value')

plt.ylabel('Frequency')

plt.title('Histogram')

plt.show()

* bins : 구간 나누기

sizes = [30, 20, 25, 15, 10]

labels = ['A', 'B', 'C', 'D', 'E']

plt.pie(sizes, labels=labels)

plt.title('Pie Chart')

plt.show()

iris = sns.load_dataset('iris')

species = iris['species'].unique()

sepal_lengths_list = [iris[iris['species'] == s]['sepal_length'].tolist() for s in species]

plt.boxplot(sepal_lengths_list, labels = species)

plt.xlabel('Species')

plt.ylabel('Sepal length')

plt.title('Box Plot')

plt.show()

* 박스는 75% ~ 25%

바깥 선은 최댓값, 최솟값

중앙 선은 중앙값

plt.scatter(iris['sepal_length'], iris['sepal_width'])

plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.show()

iris.corr(numeric_only=True)

* 데이터들의 상관관계 나타냄

species는 String이기 때문에 corr 사용 불가능, numeric_only를 사용하여 숫자값만 나타냄

동일 글자 한번에 다루기

Ctrl + Shift + L

Ctrl + F2

데이터 시각화 실습 문제

import pandas as pd

import numpy as np

from datetime import datetime, timedelta

import random

# 데이터 크기 설정

num_samples = 1000

# 랜덤 시드 설정

np.random.seed(42)

# 랜덤 데이터 생성

user_ids = np.arange(1, num_samples + 1)

purchase_dates = [datetime(2023, 1, 1) + timedelta(days=np.random.randint(0, 60)) for _ in range(num_samples)]

product_ids = np.random.randint(100, 200, size=num_samples)

categories = np.random.choice(['Electronics', 'Books', 'Clothing', 'Home', 'Toys'], size=num_samples)

prices = np.round(np.random.uniform(5, 300, size=num_samples), 2)

quantities = np.random.randint(1, 6, size=num_samples)

total_spent = prices * quantities

ages = np.random.randint(18, 65, size=num_samples)

genders = np.random.choice(['M', 'F'], size=num_samples)

locations = np.random.choice(['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Houston', 'Dallas', 'Seattle', 'Austin', 'Miami', 'Boston'], size=num_samples)

membership_levels = np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_samples)

ad_spends = np.round(np.random.uniform(5, 50, size=num_samples), 2)

visit_durations = np.random.randint(10, 120, size=num_samples)

# 데이터프레임 생성

data = {

'user_id': user_ids,

'purchase_date': purchase_dates,

'product_id': product_ids,

'category': categories,

'price': prices,

'quantity': quantities,

'total_spent': total_spent,

'age': ages,

'gender': genders,

'location': locations,

'membership_level': membership_levels,

'ad_spend': ad_spends,

'visit_duration': visit_durations

}

# 데이터프레임 완성

df = pd.DataFrame(data)

# 결측치 추가

nan_indices = np.random.choice(df.index, size=50, replace=False)

df.loc[nan_indices, 'price'] = np.nan

df.loc[nan_indices[:25], 'quantity'] = np.nan

# 중복 데이터 추가

duplicate_indices = np.random.choice(df.index, size=20, replace=False)

duplicates = df.loc[duplicate_indices]

df = pd.concat([df, duplicates], ignore_index=True)

# 아웃라이어 추가

outlier_indices = np.random.choice(df.index, size=10, replace=False)

df.loc[outlier_indices, 'price'] = df['price'] * 10

df.loc[outlier_indices, 'total_spent'] = df['total_spent'] * 10

# CSV 파일로 저장

df.to_csv('./user_purchase_data.csv', index=False)

전처리 실습 문제(어제 못푼 것)

3. 중복된 구매 데이터를 확인하고 제거하세요. 중복의 기준은 user_id, purchase_date, product_id가 동일한 행으로 합니다.

df.duplicated(subset=['user_id', 'purchase_date', 'product_id'])

df = df.drop_duplicates(subset=['user_id', 'purchase_date', 'product_id'])

4. price 컬럼에 이상치가 존재합니다. IQR (Interquartile Range) 방법을 사용하여 이상치를 찾아 제거하세요.

Q3 = df['price'].quantile(0.75)

Q1 = df['price'].quantile(0.25)

IQR = Q3 - Q1

high_price = df['price'] > Q3 + 1.5 *IQR

low_price = df['price'] < Q1 - 1.5 * IQR

a = df[high_price].index

b = df[low_price].index

df.drop(a,inplace=True) # inplace = Ture : 결과값을 df에 적용

df.drop(b,inplace=True)

5. total_spent 컬럼을 Min-Max 정규화를 사용하여 0과 1 사이의 값으로 변환하세요.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['total_spent_normalized'] = scaler.fit_transform(df[['total_spent']])

데이터 시각화 문제(전처리 진행한 데이터로 풀었음)

1. price 컬럼에 대해 제품 가격의 분포를 Box Plot으로 시각화하세요. 카테고리별로 그룹화하여 시각화하세요.

import seaborn as sns

import matplotlib.pyplot as plt

plt.boxplot([df[df['category'] == category]['price']\

for category in df['category'].unique()]\

, tick_labels=df['category'].unique())

plt.title('Price Distribution by Category')

plt.show()

2. age와 total_spent 컬럼을 이용하여 사용자 나이와 총 지출 금액 간의 관계를 Scatter Plot으로 시각화하세요.

plt.scatter(df['age'],df['total_spent'], color = 'brown', edgecolors= 'black')

plt.xlabel('Age')

plt.ylabel('Total spent')

plt.title('Age vs Total Spent')

plt.show()

3. 모든 수치형 데이터 (price, quantity, total_spent, age, ad_spend, visit_duration) 간의 상관관계를 분석하고, heatmap을 사용하여 시각화하세요.

correlation = df[['price', 'quantity', 'total_spent', 'age', 'ad_spend', 'visit_duration']].corr()

plt.rcParams['font.family'] ='Malgun Gothic'

plt.rcParams['axes.unicode_minus'] =False

sns.heatmap(correlation, annot=True, cmap='Pastel2', fmt='.2f')

plt.title('변수간 상관관계')

plt.xticks(rotation = 45)

plt.show()

4. age 컬럼에 대한 히스토그램을 작성하여 사용자 나이 분포를 시각화하세요.

plt.hist(df['age'], bins = 20, color = 'indigo', edgecolor = 'white')

plt.title('Age Distribution')

plt.xlabel('Age')

plt.ylabel('Frequency')

plt.show()

5. membership_level 컬럼을 사용하여 각 회원 등급별 총 지출 금액을 바 차트로 시각화하세요.

plt.bar(df['membership_level'], df['total_spent'], color = 'royalblue')

plt.title('Total Spent by Membership Level')

plt.xlabel('Membership Level')

plt.ylabel('Total Spent')

plt.show()

데이터 리터러시 강의 작성 내용 날아감

코드카타 진행(https://essay2892.tistory.com/40)

데이터 전처리 & 시각화 숙제(https://essay2892.tistory.com/41)

10분 판다스(https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

Stack 이전 내용까지 진행

'TIL(Today I Learned)' 카테고리의 다른 글

[2025/01/06]내일배움캠프 QA/QC 1기 - 14일차 (0)	2025.01.06
[2025/01/03]내일배움캠프 QA/QC 1기 - 13일차 (0)	2025.01.03
[2025/01/01]내일배움캠프 QA/QC 1기 - 자습 (0)	2025.01.01
[2024/12/31]내일배움캠프 QA/QC 1기 - 11일차 (0)	2024.12.31
[2024/12/30]내일배움캠프 QA/QC 1기 - 10일차 (0)	2024.12.30

현재글[2025/01/02]내일배움캠프 QA/QC 1기 - 12일차

essay2892 님의 블로그

essay2892 님의 블로그 입니다.

Today :
Yesterday :

essay2892 님의 블로그