# Huggingface - eeltreenitud tekstitöötlusmudelite kodu
[Huggingface](https://huggingface.co/) on ettevõte, mis pakub ligipääsu eeltreenitud keeletöötlus (aga ka masinnägemise) mudelitele. Mudeleid saab kasutada kas läbi API [inference API](https://api-inference.huggingface.co/docs/python/html/index.html) või läbi Transformeri teegi [Transformers](https://huggingface.co/transformers/). 


# Transformers teek
Installeeri teek `!pip install transformers`.

Teek laeb alla eeltreenitud keeletöötlusmudelid. Kõige lihtsaim viis konkreetse mudeli kasutamiseks on läbi `pipeline()` funktsiooni.

Transformers teek võimaldab teostada järgnevaid ülesandeid:
1. Meelsuse analüüs: Kas tekst on positiivne või negatiivne?
2. Teksti loomine: anna teksti sissejuhatus ja masin loob ise jätkuteksti.
3. Nimede ja objektide tuvastamine (NER): anna sisendlause ja mudel märgendab kõik sõnad, mis viitavad inimestele või kohtadele.
4. Küsimustele vastamine: Anna mudelile tekst, milles peitub vastus -> esita küsimus ja eralda vastus.
5. Täida tühimikud tekstis õigete sõnadega.
6. Teksti kokkuvõtete loomine: anna pikk tekst ja palu luua sellest lühikokkuvõte.


Me käime need ülesanded ükshaaval läbi. Alustame teegi installeerimisest ja pipeline funktsiooni importimisest.

In [None]:
!pip install -q transformers
from transformers import pipeline

## Meelsuse analüüs
Esimese ülesandena vaatame meelsuse analüüsi (teksti märkimine positiivseks või negatiivseks).

Esimest korda, kui käivitame `pipeline()` funktsiooni koos uue mudeliga, laetakse alla eeltreenitud mudel ja salvestatakse see arvuti vahemälus.

### Meelsuse klassifitseerija loomine
Vaikimisi laetakse alla mudel nimega `distilbert-base-uncased-finetuned-sst-2-english` 

In [None]:
# 0. Loo meelsuse analüüsi klassifitseerija
classifier = pipeline('sentiment-analysis')

### Meelsuse analüüs üksikute lausetega

In [3]:
# NÄIDE 1
# Sisendtekst
input_text = "We wanted the best, but it turned out the way it always did."

# Klassifitseerime teksti
result = classifier(input_text)[0]

# Prindime tulemuse
print(f"Meie tekst on: {result['label']}, tõenäosusega: {round(result['score'], 2)}")

Meie tekst on: NEGATIVE, tõenäosusega: 0.96


In [4]:
# NÄIDE 2
# Sisendtekst
input_text = "The best argument against Democracy is a five-minute conversation with the average voter."

# Klassifitseerime teksti
result = classifier(input_text)[0]

# Prindime tulemuse
print(f"Meie tekst on: {result['label']}, tõenäosusega: {round(result['score'], 2)}")

Meie tekst on: POSITIVE, tõenäosusega: 0.99


### Meelsuse analüüs mitmele lausele korraga

In [5]:
# Anname listi tekstidega
input_text_list = [
                   "We wanted the best, but it turned out the same as always", 
                   "The best argument against Democracy is a five-minute conversation with the average voter."
                   ]

# Kasutame klassifitseerijat täpselt samamoodi kui varem
results = classifier(input_text_list)

# Prindime tulemused välja
for result in results:
    print(f"Meie tekst on: {result['label']}, tõenäosusega: {round(result['score'], 2)}")

  cpuset_checked))


Meie tekst on: NEGATIVE, tõenäosusega: 0.99
Meie tekst on: POSITIVE, tõenäosusega: 0.99


### Meelsuse analüüs enda poolt valitud mudeliga
Me saame ka ise valida, millise mudeliga meelsuse analüüsi teeme [text-classification models](https://huggingface.co/models?pipeline_tag=text-classification) andes mudeli nime otse klassifitseerijale `pipeline()`.

Alljärgnevas näites valime mudeli, mis suudab märkida teksti kas positiivseks, negatiivseks või neutraalseks (vaikimisi mudelil neutraalseks klassifitiseerimist pole).


In [7]:
# Loome klassifitseerija, andes lisaks ette ka mudeli nime
classifier = pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment')

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [8]:
# See mudel suudab anda kolm erinevat silti - anname neile paremad nimed
sildid = {
    "LABEL_0": "Positiivne",
    "LABEL_1": "Neutraalne",
    "LABEL_2": "Negatiivne"
}

# Klassifitseerime teksti 1
input_text = "We wanted the best, but it turned out the same as always"
print("Sisendtekst:", input_text)
result = classifier(input_text)[0]
print(f"Meie tekst on: {sildid[result['label']]}, tõenäosusega: {round(result['score'], 2)}\n\n")

# Klassifitseerime teksti 2
input_text = "The best argument against Democracy is a five-minute conversation with the average voter."
print("Sisendtekst:", input_text)
result = classifier(input_text)[0]
print(f"Meie tekst on: {sildid[result['label']]}, tõenäosusega: {round(result['score'], 2)}")

Sisendtekst: We wanted the best, but it turned out the same as always
Meie tekst on: Positiivne, tõenäosusega: 0.54


Sisendtekst: The best argument against Democracy is a five-minute conversation with the average voter.
Meie tekst on: Neutraalne, tõenäosusega: 0.5


## Küsimustele vastamine
"question-answering" pipeline võimaldab anda ette teksti, milles sisaldub vastus -> esitada küsimus selle teksti kohta ja eraldada vastus.

In [9]:
classifier = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [10]:
# Sisendtekst
input_text = """It has been estimated that more than 80% of enterprise data is unstructured 
(emails, documents, job descriptions, etc.). On the other hand, most of the decisions 
are based purely on structured data."""

# Küsimus
question = "what are most of the decisions based on?"

# Kasutame mudelit, et küsimusele vastata
result = classifier(question=question, context=input_text)
print("Sisendtekst:", input_text, "\n")
print("Küsimus:", question, "\n")
print(f"Vastus küsimusel: '{result['answer']}', skoor: {round(result['score'], 4)}")

Sisendtekst: It has been estimated that more than 80% of enterprise data is unstructured 
(emails, documents, job descriptions, etc.). On the other hand, most of the decisions 
are based purely on structured data. 

Küsimus: what are most of the decisions based on? 

Vastus küsimusel: 'structured data', skoor: 0.8724


## Tühimike täitmine
Tühimike täitmise puhul on ülesandeks leida sobivad sõnad tekstis olevate puuduvate sõnade asendamiseks.

In [11]:
# loome tühimike täitmise mudeli
classifier = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [12]:
# pprint on teek, mis võimaldab natukene ilusamini väljundit printida
from pprint import pprint

# sõnade maskeerija - ehk aitab tekitada tühimiku
mask_token = classifier.tokenizer.mask_token

# sisendtekst
input_text = f"A pre-trained machine learning model has been previously {mask_token} on a large dataset"

# prindime tulemused
pprint(classifier(input_text, top_k=2))

[{'score': 0.3762722611427307,
  'sequence': 'A pre-trained machine learning model has been previously tested '
              'on a large dataset',
  'token': 4776,
  'token_str': ' tested'},
 {'score': 0.12328524142503738,
  'sequence': 'A pre-trained machine learning model has been previously '
              'trained on a large dataset',
  'token': 5389,
  'token_str': ' trained'}]


## Teksti loomine
Teksti loomine on ülesanne prognoosida järgmisi sõnu etteantud sisendteksti põhjal.

In [13]:
# Loome klassifitseerija, kasutades enda poolt valitud mudelit 'distilgpt2'
classifier = pipeline('text-generation', model='distilgpt2')

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [15]:
# Kaks sisendteksti
input_text = 'Nobody should use vaccines because'
#input_text = 'Learning natural language processing is useful because'

# Laseme mudelil genereerida jätku sisendtekstile
# Palume luua maksimaalselt 50-sõnalise jätkuteksti.
result = classifier(input_text,  max_length=100, num_return_sequences=1)
text = result[0]['generated_text']

# Prindime tulemuse
print("Loodud tekst:\n")
pprint(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Loodud tekst:

('Nobody should use vaccines because it affects their offspring,\u202b he‡s '
 'not a science-fiction figure, but a medical scientist and a public health '
 'advocate.\n'
 '\n'
 '\n'
 'Advertisement\n'
 '\n'
 '\n'
 '\u200fThe debate over vaccines is complicated. Many scientists agree that '
 'vaccines cause autism, which is a human disorder causing autism, and that '
 'people with a disease can cause autism.\n'
 'But the truth, with some estimates made up by children, is that some people '
 'with autism have autism. That\u202a')


## Nimede ja objektide tuvastamine
Nimede ja objektide tuvastamise mudelid üritavad eraldada tekstidest inimesi, asukohti ja organisatsioone.

In [16]:
# ner ehk named entity recognition
classifier = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

In [17]:
# Sisendtekst
input_text = """
This training is organized by the Ministry of economic affairs 
and communication that is located in Tallinn. The training is 
carried out by Andmeteadus OÜ and the main trainer is Kristjan Eljand.
"""

# Proovime leida nimed ja organisatsioonid
result = classifier(input_text)

# Prindime tu
result

[{'end': 43,
  'entity': 'I-ORG',
  'index': 7,
  'score': 0.53924775,
  'start': 35,
  'word': 'Ministry'},
 {'end': 63,
  'entity': 'I-ORG',
  'index': 10,
  'score': 0.5610883,
  'start': 56,
  'word': 'affairs'},
 {'end': 68,
  'entity': 'I-ORG',
  'index': 11,
  'score': 0.8030855,
  'start': 65,
  'word': 'and'},
 {'end': 82,
  'entity': 'I-ORG',
  'index': 12,
  'score': 0.76436216,
  'start': 69,
  'word': 'communication'},
 {'end': 109,
  'entity': 'I-LOC',
  'index': 17,
  'score': 0.9929943,
  'start': 102,
  'word': 'Tallinn'},
 {'end': 146,
  'entity': 'I-ORG',
  'index': 25,
  'score': 0.9987657,
  'start': 143,
  'word': 'And'},
 {'end': 149,
  'entity': 'I-ORG',
  'index': 26,
  'score': 0.99567723,
  'start': 146,
  'word': '##met'},
 {'end': 152,
  'entity': 'I-ORG',
  'index': 27,
  'score': 0.99760324,
  'start': 149,
  'word': '##ead'},
 {'end': 154,
  'entity': 'I-ORG',
  'index': 28,
  'score': 0.99557674,
  'start': 152,
  'word': '##us'},
 {'end': 156,
  'entit

### Väljastame tulemused loetaval kujul

In [18]:
# Klasside tähendused mõistlikus eesti keeles
classes_est = {
    "O": "Ei ole nimi",
    "B-MIS": "Nime algus kohe pärast teist nimeüksust",
    "I-MIS": "Muu üksus",
    "B-PER": "Inimese nime algus kohe pärast teise inimese nime",
    "I-PER": "Inimene",
    "B-ORG": "Organisatsiooni nime algus kohe pärast teise organisatsiooni nime",
    "I-ORG": "Organisatsioon",
    "B-LOC": "Asukoha nime algus kohe pärast teist asukohta",
    "I-LOC": "Asukoht"
    }

# Asendame objektide võtmed loetavate väärtustega
output = []
for i, ent in enumerate(result):
    ent_key = ent['entity']
    ent_readable = classes_est[ent_key]
    ent['entity'] = ent_readable
    output.append(ent)

# Grupeerime üksikud kokkukuuluvad sõnad kokku
output = classifier.group_entities(output)

# Prindime tulemused panda andmetabelina
import pandas as pd
pd.DataFrame(output)

Unnamed: 0,entity_group,score,word,start,end
0,Organisatsioon,0.666946,Ministry affairs and communication,35,82
1,Asukoht,0.992994,Tallinn,102,109
2,Organisatsioon,0.997205,Andmeteadus OÜ,143,157
3,Inimene,0.995302,Kristjan Eljand,182,197


## Teksti kokkuvõtete tegemine
Kokkuvõtete tegemise eesmärgiks on võtta pikk tekst ja muuta see lühemaks kokkuvõtteks.

In [19]:
# Initseerime pipeline'i teksti kokkuvõtete tegemiseks
classifier = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [20]:
# Input text
input_text = r"""
We operate in the markets for electricity and gas sales in the Baltic States, 
Finland and Poland, as well as on the international market for liquid fuels. 
We create energy solutions from the production of electricity, 
heat and fuels to innovative sales, 
client services and energy related additional services. 
Our ambition is to offer our clients useful and convenient energy solutions 
and to produce energy ourselves, in an increasingly environment conserving way, 
as thus we will make our contribution into making the world cleaner.

The aim of Eesti Energia is to ensure the profitability of the group 
regardless of the shocks on the world economy. 
Eesti Energia, in the last five years, has created value for the state 
amounting to 1.3 billion Euros. We believe that by helping our clients 
use energy as wisely as possible and offering simple and useful services, 
we will lay the foundation for the long-term competitiveness, 
profitability and dividend payment capacity of Eesti Energia.
"""

# Palume maksimaalselt 100 sõnalist kokkuvõtet
result = classifier(input_text, max_length=100)
pprint(f"Teksti kokkuvõte: {result[0]['summary_text']}")

('Teksti kokkuvõte:  Eesti Energia creates energy solutions from the '
 'production of electricity, heat and fuels to innovative sales, client '
 'services and energy related additional services . In the last five years, '
 'the group has created value for the state amounting to 1.3 billion Euros . '
 'The aim of the group is to ensure the profitability of the business '
 'regardless of the shocks on the world economy .')
