{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Prognoosimine.ipynb","provenance":[],"collapsed_sections":[],"authorship_tag":"ABX9TyNEFqUurjiIF4DNhLf+1ASR"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","metadata":{"id":"tNYjrdWTExHd"},"source":["# Seame eesmärgi ja tulemusmõõdiku\n","Iga tehisintellekti esimene samm peaks olema eesmärgi seadmine.\n","\n","Meie eesmärgiks on luua **mudel, mis prognoosib ettevõtte müügitulu töötajate arvu ja tegevusala põhjal**.\n","\n","Tulemusmõõdikuna kasutame keskmist absoluutviga ehk mitu eurot meie mudel keskmiselt prognoosi tegemisel eksib. "]},{"cell_type":"markdown","metadata":{"id":"6JKdQQWvVpP6"},"source":["# Impordime tööriistad\n","Prognoosimiseks kasutame Scikit learn teeki, mis sisaldab suurt hulka erinevaid masinõppe algoritme ja tööriistu andmete haldamiseks.\n","https://scikit-learn.org/stable/ "]},{"cell_type":"code","metadata":{"id":"aEofWhfoXRn9"},"source":["# Impordime vajalikud teegid\n","import pandas as pd\n","import sklearn as sk"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"-zbrup7lh38h"},"source":["# Andmete importimine\n","Impordime allolevas näites .csv faili Avaandmete portaalist saadud andmetega ettevõtete müügitulu ja tasutud maksude kohta. Andmed on eelnevalt salvestatud Github'i."]},{"cell_type":"code","metadata":{"id":"Jk9MfsgBh13q","colab":{"base_uri":"https://localhost:8080/","height":825},"executionInfo":{"status":"ok","timestamp":1639408147748,"user_tz":-120,"elapsed":1229,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"0b7c965a-ef66-487c-ab10-8035ee653b17"},"source":["raw_data_url = 'https://raw.githubusercontent.com/kristjan-eljand/andmeteadus_on_popp/main/maksud_2021_iii_kvartal.csv'\n","raw_data = pd.read_csv(raw_data_url)\n","raw_data"],"execution_count":3,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
reg_code
\n","
name
\n","
type
\n","
vat_registry
\n","
emtak
\n","
county
\n","
national_tax
\n","
employee_tax
\n","
revenue
\n","
employees
\n","
\n"," \n"," \n","
\n","
0
\n","
10000018
\n","
AMSERV AUTO AKTSIASELTS
\n","
Äriühing
\n","
jah
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
Harju ( Tallinn )
\n","
1375329.0
\n","
801558.0
\n","
21587224.0
\n","
191.0
\n","
\n","
\n","
1
\n","
10000024
\n","
EESTI RAAMAT, OÜ
\n","
Äriühing
\n","
jah
\n","
INFO JA SIDE
\n","
Harju ( Tallinn )
\n","
26769.0
\n","
21439.0
\n","
144389.0
\n","
12.0
\n","
\n","
\n","
2
\n","
10000062
\n","
ALDO KOPPEL
\n","
FIE
\n","
jah
\n","
PÕLLUMAJANDUS, METSAMAJANDUS JA KALAPÜÜK
\n","
Ida-Viru ( Lüganuse vald )
\n","
3982.0
\n","
0.0
\n","
27514.0
\n","
NaN
\n","
\n","
\n","
3
\n","
10000127
\n","
ARAVETE APTEEK, TÜ
\n","
Äriühing
\n","
jah
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
Järva ( Järva vald )
\n","
7138.0
\n","
5784.0
\n","
158317.0
\n","
2.0
\n","
\n","
\n","
4
\n","
10000165
\n","
KIVIÕLI KAUBAHOOV, AS
\n","
Äriühing
\n","
jah
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
Ida-Viru ( Lüganuse vald )
\n","
216776.0
\n","
94754.0
\n","
1582858.0
\n","
39.0
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
147252
\n","
BB000770
\n","
AMAZON EU SARL
\n","
Mitteresident
\n","
jah
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
NaN
\n","
235295.0
\n","
0.0
\n","
1230805.0
\n","
NaN
\n","
\n","
\n","
147253
\n","
KK106568
\n","
SEGERS FABRIKER AB
\n","
Mitteresident
\n","
jah
\n","
TÖÖTLEV TÖÖSTUS
\n","
NaN
\n","
0.0
\n","
0.0
\n","
1251075.0
\n","
NaN
\n","
\n","
\n","
147254
\n","
MM000001
\n","
KONSULTTITOIMISTO SEPPO HOFFREN OY CONSULTANCY
\n","
Mitteresident
\n","
ei
\n","
NaN
\n","
NaN
\n","
2614.0
\n","
2834.0
\n","
NaN
\n","
1.0
\n","
\n","
\n","
147255
\n","
MM000047
\n","
WHITE BEACH GOLF OY
\n","
Mitteresident
\n","
ei
\n","
NaN
\n","
NaN
\n","
NaN
\n","
NaN
\n","
NaN
\n","
1.0
\n","
\n","
\n","
147256
\n","
QQ000003
\n","
RAUMASTER OY
\n","
Mitteresident
\n","
jah
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
NaN
\n","
15695.0
\n","
16580.0
\n","
64996.0
\n","
3.0
\n","
\n"," \n","
\n","
147257 rows × 10 columns
\n","
"],"text/plain":[" reg_code ... employees\n","0 10000018 ... 191.0\n","1 10000024 ... 12.0\n","2 10000062 ... NaN\n","3 10000127 ... 2.0\n","4 10000165 ... 39.0\n","... ... ... ...\n","147252 BB000770 ... NaN\n","147253 KK106568 ... NaN\n","147254 MM000001 ... 1.0\n","147255 MM000047 ... 1.0\n","147256 QQ000003 ... 3.0\n","\n","[147257 rows x 10 columns]"]},"metadata":{},"execution_count":3}]},{"cell_type":"markdown","metadata":{"id":"4W5PaZgeIOXe"},"source":["# Andmete eeltöötlus\n","## Valime vajalikud muutujad\n","Antud näites on meil vaja alles jätta kolm muutujat: tegevusala, müügitulu ja töötajate arv."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":423},"id":"u9ccimjSIlGj","executionInfo":{"status":"ok","timestamp":1639408168874,"user_tz":-120,"elapsed":283,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"4b097ab3-49d5-431f-da17-d2a93fc139f0"},"source":["selected_data = raw_data[['emtak', 'employees', 'revenue']]\n","selected_data"],"execution_count":4,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
emtak
\n","
employees
\n","
revenue
\n","
\n"," \n"," \n","
\n","
0
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
191.0
\n","
21587224.0
\n","
\n","
\n","
1
\n","
INFO JA SIDE
\n","
12.0
\n","
144389.0
\n","
\n","
\n","
2
\n","
PÕLLUMAJANDUS, METSAMAJANDUS JA KALAPÜÜK
\n","
NaN
\n","
27514.0
\n","
\n","
\n","
3
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
2.0
\n","
158317.0
\n","
\n","
\n","
4
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
39.0
\n","
1582858.0
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
147252
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
NaN
\n","
1230805.0
\n","
\n","
\n","
147253
\n","
TÖÖTLEV TÖÖSTUS
\n","
NaN
\n","
1251075.0
\n","
\n","
\n","
147254
\n","
NaN
\n","
1.0
\n","
NaN
\n","
\n","
\n","
147255
\n","
NaN
\n","
1.0
\n","
NaN
\n","
\n","
\n","
147256
\n","
HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO...
\n","
3.0
\n","
64996.0
\n","
\n"," \n","
\n","
147257 rows × 3 columns
\n","
"],"text/plain":[" emtak ... revenue\n","0 HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO... ... 21587224.0\n","1 INFO JA SIDE ... 144389.0\n","2 PÕLLUMAJANDUS, METSAMAJANDUS JA KALAPÜÜK ... 27514.0\n","3 HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO... ... 158317.0\n","4 HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO... ... 1582858.0\n","... ... ... ...\n","147252 HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO... ... 1230805.0\n","147253 TÖÖTLEV TÖÖSTUS ... 1251075.0\n","147254 NaN ... NaN\n","147255 NaN ... NaN\n","147256 HULGI- JA JAEKAUBANDUS; MOOTORSÕIDUKITE JA MOO... ... 64996.0\n","\n","[147257 rows x 3 columns]"]},"metadata":{},"execution_count":4}]},{"cell_type":"markdown","metadata":{"id":"jzT2D3KKI3_r"},"source":["## Jätame lihtsuse huvides alles vaid kolm tegevusala \"INFO JA SIDE\", \"TÖÖTLEV TÖÖSTUS\" ja \"EHITUS\""]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":423},"id":"oCHTUbV4JD9B","executionInfo":{"status":"ok","timestamp":1639408213245,"user_tz":-120,"elapsed":284,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"0f97a816-e83c-474f-877a-ddf7185a9d23"},"source":["emtak_to_keep = ['INFO JA SIDE', 'TÖÖTLEV TÖÖSTUS', 'EHITUS']\n","rows_to_keep = selected_data.emtak.isin(emtak_to_keep)\n","filtered_data = selected_data.loc[rows_to_keep,:]\n","filtered_data"],"execution_count":5,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
emtak
\n","
employees
\n","
revenue
\n","
\n"," \n"," \n","
\n","
1
\n","
INFO JA SIDE
\n","
12.0
\n","
144389.0
\n","
\n","
\n","
5
\n","
TÖÖTLEV TÖÖSTUS
\n","
9.0
\n","
33043.0
\n","
\n","
\n","
8
\n","
TÖÖTLEV TÖÖSTUS
\n","
128.0
\n","
5924007.0
\n","
\n","
\n","
9
\n","
TÖÖTLEV TÖÖSTUS
\n","
6.0
\n","
25659.0
\n","
\n","
\n","
10
\n","
EHITUS
\n","
8.0
\n","
141853.0
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
147242
\n","
TÖÖTLEV TÖÖSTUS
\n","
NaN
\n","
31950.0
\n","
\n","
\n","
147244
\n","
TÖÖTLEV TÖÖSTUS
\n","
NaN
\n","
66935.0
\n","
\n","
\n","
147246
\n","
INFO JA SIDE
\n","
2.0
\n","
28218.0
\n","
\n","
\n","
147250
\n","
TÖÖTLEV TÖÖSTUS
\n","
NaN
\n","
0.0
\n","
\n","
\n","
147253
\n","
TÖÖTLEV TÖÖSTUS
\n","
NaN
\n","
1251075.0
\n","
\n"," \n","
\n","
35735 rows × 3 columns
\n","
"],"text/plain":[" emtak employees revenue\n","1 INFO JA SIDE 12.0 144389.0\n","5 TÖÖTLEV TÖÖSTUS 9.0 33043.0\n","8 TÖÖTLEV TÖÖSTUS 128.0 5924007.0\n","9 TÖÖTLEV TÖÖSTUS 6.0 25659.0\n","10 EHITUS 8.0 141853.0\n","... ... ... ...\n","147242 TÖÖTLEV TÖÖSTUS NaN 31950.0\n","147244 TÖÖTLEV TÖÖSTUS NaN 66935.0\n","147246 INFO JA SIDE 2.0 28218.0\n","147250 TÖÖTLEV TÖÖSTUS NaN 0.0\n","147253 TÖÖTLEV TÖÖSTUS NaN 1251075.0\n","\n","[35735 rows x 3 columns]"]},"metadata":{},"execution_count":5}]},{"cell_type":"markdown","metadata":{"id":"76YkoqYTJqcy"},"source":["## Eemaldame puuduvad andmed\n","Juba ülalolevast tabelist näeme, et meie andmestikus on puuduvaid väärtusi (NaN) väärtusi.\n","Eemaldame need read andmestikust (NB: kui andmeid on vähe, peaks eemaldamise asemel need väärtused millegagi asendama - meil hetkel seda probleemi pole)."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":423},"id":"rNLR8kBGKF8C","executionInfo":{"status":"ok","timestamp":1639408244266,"user_tz":-120,"elapsed":320,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"7ac6c6b2-8eba-45fd-a8cf-8ea23f4dcf2d"},"source":["clean_data = filtered_data.dropna()\n","clean_data"],"execution_count":6,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
emtak
\n","
employees
\n","
revenue
\n","
\n"," \n"," \n","
\n","
1
\n","
INFO JA SIDE
\n","
12.0
\n","
144389.0
\n","
\n","
\n","
5
\n","
TÖÖTLEV TÖÖSTUS
\n","
9.0
\n","
33043.0
\n","
\n","
\n","
8
\n","
TÖÖTLEV TÖÖSTUS
\n","
128.0
\n","
5924007.0
\n","
\n","
\n","
9
\n","
TÖÖTLEV TÖÖSTUS
\n","
6.0
\n","
25659.0
\n","
\n","
\n","
10
\n","
EHITUS
\n","
8.0
\n","
141853.0
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
147076
\n","
INFO JA SIDE
\n","
3.0
\n","
143609.0
\n","
\n","
\n","
147079
\n","
INFO JA SIDE
\n","
6.0
\n","
22225.0
\n","
\n","
\n","
147121
\n","
INFO JA SIDE
\n","
2.0
\n","
0.0
\n","
\n","
\n","
147220
\n","
TÖÖTLEV TÖÖSTUS
\n","
3.0
\n","
4059.0
\n","
\n","
\n","
147246
\n","
INFO JA SIDE
\n","
2.0
\n","
28218.0
\n","
\n"," \n","
\n","
17450 rows × 3 columns
\n","
"],"text/plain":[" emtak employees revenue\n","1 INFO JA SIDE 12.0 144389.0\n","5 TÖÖTLEV TÖÖSTUS 9.0 33043.0\n","8 TÖÖTLEV TÖÖSTUS 128.0 5924007.0\n","9 TÖÖTLEV TÖÖSTUS 6.0 25659.0\n","10 EHITUS 8.0 141853.0\n","... ... ... ...\n","147076 INFO JA SIDE 3.0 143609.0\n","147079 INFO JA SIDE 6.0 22225.0\n","147121 INFO JA SIDE 2.0 0.0\n","147220 TÖÖTLEV TÖÖSTUS 3.0 4059.0\n","147246 INFO JA SIDE 2.0 28218.0\n","\n","[17450 rows x 3 columns]"]},"metadata":{},"execution_count":6}]},{"cell_type":"markdown","metadata":{"id":"n5Wp_k3dLSb-"},"source":["## Vaatame andmed üle"]},{"cell_type":"markdown","metadata":{"id":"KP8ZGyaELZhP"},"source":["### Uurime andmete üldist jaotust"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":394},"id":"iWp4HGBPKvZl","executionInfo":{"status":"ok","timestamp":1639408270530,"user_tz":-120,"elapsed":286,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"d8b6f822-471a-457e-d909-72b0ae1d79b5"},"source":["clean_data.describe(include='all')"],"execution_count":7,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
emtak
\n","
employees
\n","
revenue
\n","
\n"," \n"," \n","
\n","
count
\n","
17450
\n","
17450.000000
\n","
1.745000e+04
\n","
\n","
\n","
unique
\n","
3
\n","
NaN
\n","
NaN
\n","
\n","
\n","
top
\n","
EHITUS
\n","
NaN
\n","
NaN
\n","
\n","
\n","
freq
\n","
9012
\n","
NaN
\n","
NaN
\n","
\n","
\n","
mean
\n","
NaN
\n","
10.262521
\n","
4.121827e+05
\n","
\n","
\n","
std
\n","
NaN
\n","
40.649160
\n","
2.764708e+06
\n","
\n","
\n","
min
\n","
NaN
\n","
1.000000
\n","
-1.331100e+04
\n","
\n","
\n","
25%
\n","
NaN
\n","
1.000000
\n","
1.120875e+04
\n","
\n","
\n","
50%
\n","
NaN
\n","
3.000000
\n","
3.466550e+04
\n","
\n","
\n","
75%
\n","
NaN
\n","
7.000000
\n","
1.281268e+05
\n","
\n","
\n","
max
\n","
NaN
\n","
1809.000000
\n","
2.020712e+08
\n","
\n"," \n","
\n","
"],"text/plain":[" emtak employees revenue\n","count 17450 17450.000000 1.745000e+04\n","unique 3 NaN NaN\n","top EHITUS NaN NaN\n","freq 9012 NaN NaN\n","mean NaN 10.262521 4.121827e+05\n","std NaN 40.649160 2.764708e+06\n","min NaN 1.000000 -1.331100e+04\n","25% NaN 1.000000 1.120875e+04\n","50% NaN 3.000000 3.466550e+04\n","75% NaN 7.000000 1.281268e+05\n","max NaN 1809.000000 2.020712e+08"]},"metadata":{},"execution_count":7}]},{"cell_type":"markdown","metadata":{"id":"oAORx6UkLc27"},"source":["### Vaatame jaotust tegevusalade järgi"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"rt3uSVquLgaE","executionInfo":{"status":"ok","timestamp":1639408290843,"user_tz":-120,"elapsed":298,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"33076c21-d041-400a-9a07-bf30ffc69e3f"},"source":["clean_data.emtak.value_counts()"],"execution_count":8,"outputs":[{"output_type":"execute_result","data":{"text/plain":["EHITUS 9012\n","TÖÖTLEV TÖÖSTUS 5566\n","INFO JA SIDE 2872\n","Name: emtak, dtype: int64"]},"metadata":{},"execution_count":8}]},{"cell_type":"markdown","metadata":{"id":"ObbIoocPMZ2v"},"source":["### Vaatame töötajate arvu ja müügitulu korrelatsiooni"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":112},"id":"fsy_yMoSNEfm","executionInfo":{"status":"ok","timestamp":1639408328348,"user_tz":-120,"elapsed":279,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"9b0ee746-9925-42ef-8e20-0a3cc37ab501"},"source":["clean_data.corr()"],"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
employees
\n","
revenue
\n","
\n"," \n"," \n","
\n","
employees
\n","
1.000000
\n","
0.793414
\n","
\n","
\n","
revenue
\n","
0.793414
\n","
1.000000
\n","
\n"," \n","
\n","
"],"text/plain":[" employees revenue\n","employees 1.000000 0.793414\n","revenue 0.793414 1.000000"]},"metadata":{},"execution_count":9}]},{"cell_type":"markdown","metadata":{"id":"HR21FF8gL85W"},"source":["# Andmete ettevalmistamine masinõppeks"]},{"cell_type":"markdown","metadata":{"id":"u672gTPMNc0X"},"source":["## Muudame kõik muutujad numbrilisteks\n","Sageli soovime mudelites kasutada muutujaid, mis ei ole numbrilised (N: tegevusala). Masinõppemudelid seevastu tahavad, et kõik muutujad oleksid numbrilised.\n","\n","Lahendus on luua andmestikku uued binaarsed (0 või 1) muutujad -> üks muutuja iga mittenumbrilise muutuja väärtuse kohta (vt. alljärgnev näide, et paremini mõista).\n","\n","Lahendada saab seda mitmel eri viisil, kuid kuna meie kasutame oma näites Pandas andmetabelit, siis saame need 0/1 muutujad luua üherealise käsuga `pd.get_dummies`."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":423},"id":"S80UC6eUQPRE","executionInfo":{"status":"ok","timestamp":1639408415004,"user_tz":-120,"elapsed":342,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"a812e8b2-6757-4563-8751-fde2fbf6af51"},"source":["# Muudame tekstilise veeru 'emtak' numbrilisteks veergudeks.\n","numeric_data = pd.get_dummies(data=clean_data, columns=['emtak'])\n","numeric_data"],"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/html":["
"],"text/plain":[" employees ... emtak_TÖÖTLEV TÖÖSTUS\n","employees 1.00 ... 0.14\n","revenue 0.79 ... 0.11\n","emtak_EHITUS -0.13 ... -0.71\n","emtak_INFO JA SIDE -0.01 ... -0.30\n","emtak_TÖÖTLEV TÖÖSTUS 0.14 ... 1.00\n","\n","[5 rows x 5 columns]"]},"metadata":{},"execution_count":11}]},{"cell_type":"markdown","metadata":{"id":"f-APegEfSHRE"},"source":["### Visualiseerime korrelatsioonimaatriksi"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":540},"id":"YGhsJcp1SKTy","executionInfo":{"status":"ok","timestamp":1639408502583,"user_tz":-120,"elapsed":1029,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"1d31c965-fbb0-423b-a42e-ae7b83c7d41a"},"source":["import seaborn as sn\n","import matplotlib.pyplot as plt\n","\n","plt.figure(dpi=100)\n","sn.heatmap(correlation_matrix, annot=True)\n","plt.show()"],"execution_count":12,"outputs":[{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"Eh3Pt-TEVEpk"},"source":["## Jaotame andmed treening- ja testandmeteks\n","Treening- ja testandmeid on võimalik moodustada väga erinval moel. Sisuliselt tahame me jaotada oma andmestiku kaheks: 80% jätame treenimiseks ja 20% jätame testimiseks.\n","\n","Allolevas näites kasutame selle tegemiseks sklearn teegi funktsiooni `train_test_split`, mis loob meile 4 andmestiku ühe käiguga."]},{"cell_type":"code","metadata":{"id":"2PwJgT90VJy2","executionInfo":{"status":"ok","timestamp":1639408644857,"user_tz":-120,"elapsed":424,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}}},"source":["from sklearn.model_selection import train_test_split\n","\n","# Teeme koopia oma andmestikust\n","ml_data = numeric_data.copy()\n","\n","# eemaldame müügitulu 'var' ehk prognoositava muutuja prognoosis kasutavatest muutujatest 'vars'\n","result = ml_data.pop('revenue')\n","variables = ml_data\n","\n","# Loome treening ja testandmestikud\n","train_variables, test_variables, train_result, test_result = train_test_split(variables, result, test_size=0.2, random_state=123)"],"execution_count":13,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"GBfOgbVae9o4"},"source":["## Standardiseerime andmed\n","Hetkel on meie andmestikus nii, et andmete \"skaalad\" on erinevad. Paljud masinõppealgoritmid tahavad aga, et skaalad oleksid sarnased (N: et kõik muutujad oleks 0 ja 1 vahel)."]},{"cell_type":"code","metadata":{"id":"OtnaT7l4KLus","executionInfo":{"status":"ok","timestamp":1639408701352,"user_tz":-120,"elapsed":297,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}}},"source":["from sklearn.preprocessing import MinMaxScaler\n","# Loome ka skaleeritud muutujad\n","train_variables_minmax = MinMaxScaler().fit_transform(train_variables)\n","test_variables_minmax = MinMaxScaler().fit_transform(test_variables)"],"execution_count":14,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"KTPtFRmEUc5x"},"source":["# Prognoosimine"]},{"cell_type":"markdown","metadata":{"id":"dqsM5l_N3m1y"},"source":["## Prognoosimine lineaarse mudeliga\n","Lineaarne regressioonmudel on kõige lihtsam võimalik mudel ja annab meile lõpptulemuse kujul y = nx + my + ..., kus y on väärtus, mida ennustame (antud juhul müügitulu), x, y on muutujad (meie mudeli töötajate arv ja tegevusala) ning m,n on muutujate parameetrid.\n","\n","sklearn teegi hea omadus on, et valikus on küll palju erinevaid mudeleid, kuid nende treenimine käib alati *peaaegu* samamoodi: \n","1. impordime mudeli,\n","2. kasutame `fit(train_variables, train_result)` meetodit treenimiseks,\n","3. kasutame `predict(test_variables)` meetodit prognoosimiseks."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"x4Bab_2vgsSs","executionInfo":{"status":"ok","timestamp":1639408803158,"user_tz":-120,"elapsed":275,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"9d58c997-285f-4145-e556-e82e455c3e24"},"source":["# Impordime lineaarse regressiooni mudeli\n","from sklearn.linear_model import LinearRegression\n","\n","# Treenime mudeli, kasutades treenimise ettenähtud andmeid\n","model = LinearRegression()\n","model.fit(X=train_variables, y=train_result)\n","\n","# Teeme prognoosi testandmete peal\n","linear_prediction = model.predict(X=test_variables)\n","linear_prediction"],"execution_count":15,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([ -25931.88302618, 15150399.35135626, -101267.14605191, ...,\n"," 411571.11630774, 73302.92730663, 266599.03530726])"]},"metadata":{},"execution_count":15}]},{"cell_type":"markdown","metadata":{"id":"lG8TYHBM87Ni"},"source":["### Meie mudel on lihtsalt üks matemaatiline valem\n","Kasutame mudeli meetodit `coef_` et printida välja muutujate parameetrite väärtused."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"YXOBJ_TU9Lzl","executionInfo":{"status":"ok","timestamp":1639408827738,"user_tz":-120,"elapsed":285,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"5efb2d71-3d54-4308-c4e9-b94242aef6ea"},"source":["# Muutujate parameetrid\n","print(\"Muutujate parameetrid:\", [int(x) for x in model.coef_])\n","\n","# Vabaliige, mis ütleb, milline on müügitulu eeldatav väärtus, \n","# kui muutujate väärtused on nullid\n","print(\"Vabaliige:\", int(model.intercept_))"],"execution_count":16,"outputs":[{"output_type":"stream","name":"stdout","text":["Muutujate parameetrid: [48324, 24249, -51085, 26836]\n","Vabaliige: -98505\n"]}]},{"cell_type":"markdown","metadata":{"id":"hX7hmg8Z5rW4"},"source":["### Lisame prognoositud väärtused tegelikele väärtustele"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":476},"id":"jbqdSS1_5yze","executionInfo":{"status":"ok","timestamp":1639408882760,"user_tz":-120,"elapsed":318,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"188b1c2c-0222-4813-eda2-34d3aa6949d6"},"source":["def make_evaluation_table(predictions):\n"," # Paneme ennustused Pandas formaati\n"," prediction_series = pd.Series(data=predictions, index=test_result.index, name=\"prediction\")\n","\n"," # Lisame prognoosiveeru\n"," evaluate_results = pd.concat([test_variables, test_result, prediction_series], axis=1)\n","\n"," # Lisame nimeveeru algsest andmestikust\n"," evaluate_results = pd.merge(raw_data['name'], evaluate_results, left_index=True, right_index=True)\n"," return evaluate_results\n","\n","linear_eval_table = make_evaluation_table(linear_prediction)\n","linear_eval_table"],"execution_count":17,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
name
\n","
employees
\n","
emtak_EHITUS
\n","
emtak_INFO JA SIDE
\n","
emtak_TÖÖTLEV TÖÖSTUS
\n","
revenue
\n","
prediction
\n","
\n"," \n"," \n","
\n","
36
\n","
ESTIKO-PLASTAR, AS
\n","
167.0
\n","
0
\n","
0
\n","
1
\n","
15609277.0
\n","
7.998443e+06
\n","
\n","
\n","
45
\n","
CE TEHNIKA OÜ
\n","
2.0
\n","
0
\n","
0
\n","
1
\n","
8647.0
\n","
2.497890e+04
\n","
\n","
\n","
86
\n","
SAVEKATE, OÜ
\n","
50.0
\n","
1
\n","
0
\n","
0
\n","
3189996.0
\n","
2.341945e+06
\n","
\n","
\n","
91
\n","
SKILINE, OÜ
\n","
1.0
\n","
1
\n","
0
\n","
0
\n","
5200.0
\n","
-2.593188e+04
\n","
\n","
\n","
102
\n","
ANNINET-V, OÜ
\n","
7.0
\n","
0
\n","
0
\n","
1
\n","
51856.0
\n","
2.665990e+05
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
144394
\n","
MITTETULUNDUSÜHING HIIUMAA KINO
\n","
1.0
\n","
0
\n","
1
\n","
0
\n","
8550.0
\n","
-1.012671e+05
\n","
\n","
\n","
145438
\n","
MTÜ NORDIC INSTITUTE FOR INTEROPERABILITY SOLU...
\n","
4.0
\n","
0
\n","
1
\n","
0
\n","
286342.0
\n","
4.370493e+04
\n","
\n","
\n","
146740
\n","
IURIDICUM, SIHTASUTUS
\n","
6.0
\n","
0
\n","
1
\n","
0
\n","
31961.0
\n","
1.403530e+05
\n","
\n","
\n","
146826
\n","
KULTUURILEHT, SIHTASUTUS
\n","
104.0
\n","
0
\n","
1
\n","
0
\n","
114214.0
\n","
4.876108e+06
\n","
\n","
\n","
147079
\n","
EMAKEELE SIHTASUTUS
\n","
6.0
\n","
0
\n","
1
\n","
0
\n","
22225.0
\n","
1.403530e+05
\n","
\n"," \n","
\n","
3490 rows × 7 columns
\n","
"],"text/plain":[" name ... prediction\n","36 ESTIKO-PLASTAR, AS ... 7.998443e+06\n","45 CE TEHNIKA OÜ ... 2.497890e+04\n","86 SAVEKATE, OÜ ... 2.341945e+06\n","91 SKILINE, OÜ ... -2.593188e+04\n","102 ANNINET-V, OÜ ... 2.665990e+05\n","... ... ... ...\n","144394 MITTETULUNDUSÜHING HIIUMAA KINO ... -1.012671e+05\n","145438 MTÜ NORDIC INSTITUTE FOR INTEROPERABILITY SOLU... ... 4.370493e+04\n","146740 IURIDICUM, SIHTASUTUS ... 1.403530e+05\n","146826 KULTUURILEHT, SIHTASUTUS ... 4.876108e+06\n","147079 EMAKEELE SIHTASUTUS ... 1.403530e+05\n","\n","[3490 rows x 7 columns]"]},"metadata":{},"execution_count":17}]},{"cell_type":"markdown","metadata":{"id":"qUYAwasT-OGW"},"source":["### Visualiseerime prognoosi ja tegelikud andmed graafikul"]},{"cell_type":"code","metadata":{"id":"OSJ6xcqqjkyg","executionInfo":{"status":"ok","timestamp":1639408937220,"user_tz":-120,"elapsed":5986,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}}},"source":["!pip install -q plotly\n","import plotly.express as px"],"execution_count":18,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":542},"id":"YzUnzqf4jok_","executionInfo":{"status":"ok","timestamp":1639409007626,"user_tz":-120,"elapsed":1444,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"c2052bce-194c-4b7f-8815-646d4a954b95"},"source":["def plot_pred_and_actual(evaluation_table, model_name):\n"," plot_data = evaluation_table.sort_values(by=\"revenue\", ascending=False).iloc[1:120,]\n"," plot_data = plot_data[['name','revenue', 'prediction']]\n"," plot_data = plot_data.melt(id_vars='name', value_vars=['revenue', 'prediction'], var_name='type')\n","\n","\n"," fig = px.line(plot_data, \n"," y='value',\n"," color='type',\n"," title=f\"Prognoositud vs Tegelik müügitulu ({model_name})\")\n"," fig.show()\n","\n","plot_pred_and_actual(linear_eval_table, \"Lineaarne mudel\")"],"execution_count":19,"outputs":[{"output_type":"display_data","data":{"text/html":["\n","\n","\n","
\n"," \n"," \n"," \n"," \n"," \n","
\n","\n",""]},"metadata":{}}]},{"cell_type":"markdown","metadata":{"id":"i6buZJnr4Ubl"},"source":["### Tulemuste statistiline hindamine\n","Tulemuste statistiliseks hindamiseks kasutame kaht mõõdikut: \n","* keskmine absoluutviga\n","* R2 skoor, mis näitab, kui suure osa ennustatavast muutujast suudab meie prognoos ära kirjeldada."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"hRZRe9qx4ZNn","executionInfo":{"status":"ok","timestamp":1639409096231,"user_tz":-120,"elapsed":308,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"d62890c0-359d-4f49-a025-37b1556a9d02"},"source":["from sklearn.metrics import mean_absolute_error, r2_score\n","\n","def result_stats_printer(test_result, prediction):\n"," # Arvutame keskmise absoluutvea\n"," mae = mean_absolute_error(test_result, prediction)\n"," print(f\"Mudeli keskmine absoluutviga on {mae:.0f} €\")\n","\n"," # Arvutame R2 statistiku\n"," r2 = r2_score(test_result, prediction)\n"," print(f\"Meie muutujad suudavad kirjeldada {r2*100:.0f}% müügitulu muutusest.\")\n","\n","result_stats_printer(test_result, linear_prediction)"],"execution_count":20,"outputs":[{"output_type":"stream","name":"stdout","text":["Mudeli keskmine absoluutviga on 359035 €\n","Meie muutujad suudavad kirjeldada 62% müügitulu muutusest.\n"]}]},{"cell_type":"markdown","metadata":{"id":"AXjHrtzj_u3W"},"source":["## Proovime ka otsustuspuu mudelit"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"HhdMOHNIm8P9","executionInfo":{"status":"ok","timestamp":1639409148301,"user_tz":-120,"elapsed":311,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"38f0e653-c471-4517-cbb4-89bc3aa8e304"},"source":["from sklearn.tree import DecisionTreeRegressor\n","\n","# Treenime mudeli, kasutades treenimise ettenähtud andmeid\n","model = DecisionTreeRegressor(max_depth=5)\n","model.fit(X=train_variables, y=train_result)\n","\n","# Teeme prognoosi testandmete peal\n","tree_prediction = model.predict(X=test_variables)\n","\n","# Prindime tulemused, kasutades eelnevalt defineeritud funktsiooni\n","result_stats_printer(test_result, tree_prediction)\n"],"execution_count":21,"outputs":[{"output_type":"stream","name":"stdout","text":["Mudeli keskmine absoluutviga on 318467 €\n","Meie muutujad suudavad kirjeldada 73% müügitulu muutusest.\n"]}]},{"cell_type":"markdown","metadata":{"id":"eNifj6rO-05S"},"source":["### Visualiseerime tulemused\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":542},"id":"cCWMpc7W_JGj","executionInfo":{"status":"ok","timestamp":1639409169144,"user_tz":-120,"elapsed":753,"user":{"displayName":"Kristjan Eljand","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Gip4UgEVFXL2Q-oKuIBr2JP5268uFbs3Bf7aNhj=s64","userId":"12801037868987527469"}},"outputId":"8044c414-db4b-4460-827f-316a2b734b9a"},"source":["tree_eval_table = make_evaluation_table(tree_prediction)\n","plot_pred_and_actual(tree_eval_table, \"Otsustuspuu\")"],"execution_count":22,"outputs":[{"output_type":"display_data","data":{"text/html":["\n","\n","\n","