如何處理 Pandas DataFrame ( 3/3 )

黃偉誠

2022-09-14 10:50:51

Wow ~ 不知不覺，已經來到了最後一篇，接下來將專注於如了使用 Groupby 、儲存和匯出資料，並且會給一些小任務。 ___ # 基本要求 1. 已安裝Python3 和 pandas 模組，或是可以使用其他軟件(Google Colab, Jupyter Notebook etc.)。 2. 已知道Python的基本操作(例如：如何宣告一個變數，並且給予一個值等等) 。 # 將說明以下 Pandas 基本觀念＆操作 - 如何 **Groupby** **DataFrame** 數據 - 如何儲存和匯出 **DataFrame** 數據 - 小任務 & 預期結果 **在此聲明** 對於以下假設情況，可以有非常多不同方式得出相同的結果，如果你有更好的建議～非常歡迎提供給小弟＾＾ ### 如何 Groupby DataFrame 數這個功能對於有匯總統計需求來說非常的好用假設，今天我們想知道每個年紀有多少人，但在計算之前，我們可以給予一個新的欄位 ⇒ count ``` import pandas as pd df = pd.read_csv("file_name.csv") print("全部資料") print(df) print() print("每個年紀人數") df['count'] = 1 print(df.groupby(['Age']).count()['count']) ``` 執行後結果 ![image](wehelp-storage://8480ae8d132953224c8e824d5322e03a) 此時，你可能會說知道同年紀的人數資訊，可能還是不夠進一步解釋資料，那我們可以假設，在每個語言中不同年紀的分佈 ``` import pandas as pd df = pd.read_csv("file_name.csv") print("全部資料") print(df) print() print("每個語言中不同年紀的分佈") df['count'] = 1 print(df.groupby(['Languages','Age']).count()['count']) ``` 執行後結果 ![image](wehelp-storage://8f044fba7a7e99939ad89d40a1d9c80b) 當然最好的情況是，你在做**Groupby**之前，先把有空值的 rows 清除最後，還可以將每個語言的數量資訊，給予一個新的欄位，但這裡將用 `transform()` 功能 (對於`transform()` 有興趣的可以參考官方資料 ⇒ [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html)) ``` import pandas as pd df = pd.read_csv("file_name.csv") print("全部資料") print(df) print() print("每個語言中不同年紀的分佈") df['count'] = 1 df['group_count'] = df['count'].groupby(df['Languages']).transform('count') # new column print(df.sort_values(['group_count'], ascending=False)) # 從新排序 ``` 執行後結果 ![image](wehelp-storage://26c8ebcac2a5b0f130bf4a88c2d0f022) ### 如何儲存和匯出 **DataFrame** 數據 YA ~ 終於來到最後一個主題了，就是將我們所整理後的資料做一個儲存或是匯出最簡單的方式就是將整理後的資料用CSV的格式儲存假設，你已經位於要儲存資料的資料夾內 ``` import pandas as pd df = pd.read_csv("file_name.csv") df['count'] = 1 df['group_count'] = df['count'].groupby(df['Languages']).transform('count') df.to_csv("new_file_name.csv", index=False) ``` 執行前結果 ![image](wehelp-storage://0d76a795a51b0ddb3074167299b82b8b) 執行後結果 ![image](wehelp-storage://ff4d2b81b07fac56c1cffc8dad377574) 另一種情況可能是，你所在的資料夾，不是你想儲存的位子假設，你想把資料儲存在桌面 ``` import pandas as pd from pathlib import Path df = pd.read_csv("file_name.csv") df['count'] = 1 df['group_count'] = df['count'].groupby(df['Languages']).transform('count') filepath = Path('~/Desktop/new_file_name.csv') df.to_csv(filepath, index=False) ``` 這時，可以去桌面檢查，是否有一個檔案是 “new_file_name.csv” 執行後結果 ![image](wehelp-storage://fb7539aa02560b0f745fda343de8885c) ### 小任務 & 預期結果你們可經由連結下載資料：https://drive.google.com/file/d/1lEMrkNGREhSsQpW4BhG14X8fFXWhvNyK/view?usp=sharing 任務一：經理想知道每個獨立 “url_id” 的數量是多少預期結果 ![image](wehelp-storage://72fe1dfae3d00a5a5c3e687b32602268) 任務二：請將任務一，做一個簡單的遞增排序預期結果 ![image](wehelp-storage://a3f37884fb81ea2f5edcb154f6ce03d2) 任務三：經理看完 “任務一” 資料後，又覺是否可以知道每個獨立的“url_id” 中 “keyword_id” 的數量預期結果 ![image](wehelp-storage://67cc9f312cd9f7794608a2becb8d6281) 任務四(optional): Step one: 請隨意選一個 “url_id” ，然後會得出 n 個有相同 “url_id” 的結果 Step two: 請依據上一步的 n 個 “url_id” 去找到對應的 n 個 “keyword_id” Step three: 再根據步驟二對應 n 個 “keyword_id” 去找出其他有相同的 “keyword_id” Step four: 再根據步驟三所得出的“keyword_id” 對應的“url_id” 然後就一直重複這個邏輯直到找不到新的“keyword_id” and “url_id” 。這時你可能還不是很懂上面的說明，不過沒關係，你可以參考下面的圖片說明來理解此題的邏輯 ![image](wehelp-storage://2913928d12b52e2085ea5d16033b819c) 顏色順序：Green ⇒ Pink ⇒ Yellow 依序這個邏輯一直往下找，直到沒有新的結果例如，上圖中的 “url_id” => ”a66590f921a0abf2bf8454ffbd5b2ef8” 就是找不到新的相同 “url_id” 所以就結束了預期結果 ![image](wehelp-storage://a35ba663cd68866f5f4c3885482281bd) yay!! 🎉恭喜你🥳 終於有對pandas的基本認識了，但如果還想精進技能，可以參考Pandas 官方文件 ⇒ [https://pandas.pydata.org/docs/index.html](https://pandas.pydata.org/docs/index.html) 那我們來回顧一下這篇了解到什麼吧！ ### 小總結這篇主要關注點在於~如何使用 Groupby 將資料分類且計算個數，再者，還提及了如何將所需的資料以CSV format 匯出，最最最後，就是提供了幾個小任務，供各位練練手，希望你有學到東西。如果有任何問題可以 *留言* 或是 *email* 給我喔感謝＾＾

python pandas

點擊複製文章連結