[2025/01/03]내일배움캠프 QA/QC 1기

TIL(Today I Learned)

[2025/01/03]내일배움캠프 QA/QC 1기 - 13일차

essay2892 2025. 1. 3. 20:00

코드카타(https://essay2892.tistory.com/46)

Library 개인 과제(https://essay2892.tistory.com/47)

10분 Pandas(https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

Stack 이후부터 끝까지

실습문제로 더 익숙해지기 - 한 걸음 더!

import pandas as pd

import numpy as np

from datetime import datetime, timedelta

import random

import matplotlib.pyplot as plt

import seaborn as sns

# 데이터 크기 설정

num_samples = 1000

# 랜덤 시드 설정

np.random.seed(42)

# 랜덤 데이터 생성

user_ids = np.arange(1, num_samples + 1)

purchase_dates = [datetime(2023, 1, 1) + timedelta(days=np.random.randint(0, 60)) for _ in range(num_samples)]

product_ids = np.random.randint(100, 200, size=num_samples)

categories = np.random.choice(['Electronics', 'Books', 'Clothing', 'Home', 'Toys'], size=num_samples)

prices = np.round(np.random.uniform(5, 300, size=num_samples), 2)

quantities = np.random.randint(1, 6, size=num_samples)

total_spent = prices * quantities

ages = np.random.randint(18, 65, size=num_samples)

genders = np.random.choice(['M', 'F'], size=num_samples)

locations = np.random.choice(['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Houston', 'Dallas', 'Seattle', 'Austin', 'Miami', 'Boston'], size=num_samples)

membership_levels = np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], size=num_samples)

ad_spends = np.round(np.random.uniform(5, 50, size=num_samples), 2)

visit_durations = np.random.randint(10, 120, size=num_samples)

# 데이터프레임 생성

data = {

'user_id': user_ids,

'purchase_date': purchase_dates,

'product_id': product_ids,

'category': categories,

'price': prices,

'quantity': quantities,

'total_spent': total_spent,

'age': ages,

'gender': genders,

'location': locations,

'membership_level': membership_levels,

'ad_spend': ad_spends,

'visit_duration': visit_durations

}

# 데이터프레임 완성

df = pd.DataFrame(data)

# 결측치 추가

nan_indices = np.random.choice(df.index, size=50, replace=False)

df.loc[nan_indices, 'price'] = np.nan

df.loc[nan_indices[:25], 'quantity'] = np.nan

# 중복 데이터 추가

duplicate_indices = np.random.choice(df.index, size=20, replace=False)

duplicates = df.loc[duplicate_indices]

df = pd.concat([df, duplicates], ignore_index=True)

# 아웃라이어 추가

outlier_indices = np.random.choice(df.index, size=10, replace=False)

df.loc[outlier_indices, 'price'] = df['price'] * 10

df.loc[outlier_indices, 'total_spent'] = df['total_spent'] * 10

# CSV 파일로 저장

df.to_csv('./user_purchase_data.csv', index=False)

1. total_spent 컬럼에 대해 회원 등급별 (membership_level)로 Box Plot을 작성하고, 이상치를 분석하세요. 이상치가 있는 데이터를 별도로 추출하여 outliers.csv로 저장하세요.

df = pd.read_csv('user_purchase_data.csv')

plt.boxplot([df[df['membership_level'] == membership_level]['total_spent']\

for membership_level in df['membership_level'].unique()],

tick_labels = df['membership_level'].unique())

plt.xlabel('Membership level')

plt.ylabel('Total spent')

plt.title('Outliers Detection')

plt.show()

Q3 = df['total_spent'].quantile(0.75)

Q1 = df['total_spent'].quantile(0.25)

IQR = Q3 - Q1

upper_out = df['total_spent'] > Q3 + 1.5 * IQR

lower_out = df['total_spent'] < Q1 - 1.5 * IQR

outliers = df[upper_out | lower_out]

outliers.to_csv('temp/outliers.csv', index = False)

2. ad_spend와 total_spent 컬럼을 사용하여 Scatter Plot을 작성하고, 두 변수 간의 관계를 분석하세요. 광고비 지출이 총 지출 금액에 미치는 영향을 분석하세요.

df.drop(df[upper_out].index, inplace= True)

df.drop(df[lower_out].index, inplace= True) # 이상치 제거 후 진행

plt.scatter(df['ad_spend'], df['total_spent'], edgecolors= 'black')

plt.xlabel('Ad spent')

plt.ylabel('Total spent')

plt.title('Ad spend VS Total Spent')

plt.show()

df[['ad_spend', 'total_spent']].corr()

data = pd.read_csv('user_purchase_data.csv')

data[['ad_spend','total_spent']].corr() # 이상치 존재하는 경우

* 따라서 광고비 지출은 총 지출 금액에 별다른 영향을 미치지 않는다.

3. 모든 수치형 컬럼 간의 상관관계를 계산하고, 어떤 변수들이 높은 상관관계를 가지는지 분석하세요.

data[['price', 'quantity', 'total_spent', 'age', 'ad_spend', 'visit_duration']].corr()

correlation_matrix = data.corr(numeric_only=True)

correlation_matrix[correlation_matrix.abs() > 0.7]

4. price 컬럼과 total_spent 컬럼의 아웃라이어를 식별하세요. IQR 방법을 사용하세요.

def find_outliers(column):

Q3 = df[column].quantile(0.75)

Q1 = df[column].quantile(0.25)

IQR = Q3 - Q1

upper_outliers = df[column] > Q3 + 1.5 * IQR

lower_outliers = df[column] < Q1 - 1.5 * IQR

return df[upper_outliers | lower_outliers]

price_outliers = find_outliers('price')

total_spent_ouliers = find_outliers('total_spent')

price_outliers

total_spent_outliers

추가 자료

pd.cut(array, bins, labels) - 데이터 구간 나누기. bins는 배열 또는 양의 정수

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

'TIL(Today I Learned)' 카테고리의 다른 글

[2025/01/07]내일배움캠프 QA/QC 1기 - 15일차 (0)	2025.01.07
[2025/01/06]내일배움캠프 QA/QC 1기 - 14일차 (0)	2025.01.06
[2025/01/02]내일배움캠프 QA/QC 1기 - 12일차 (0)	2025.01.02
[2025/01/01]내일배움캠프 QA/QC 1기 - 자습 (0)	2025.01.01
[2024/12/31]내일배움캠프 QA/QC 1기 - 11일차 (0)	2024.12.31

현재글[2025/01/03]내일배움캠프 QA/QC 1기 - 13일차

essay2892 님의 블로그

essay2892 님의 블로그 입니다.

Today :
Yesterday :

essay2892 님의 블로그