Conceptos básicos del procesamiento del lenguaje natural en 10 minutos

Bienvenido, me llamo Luis y esta vez les traigo un post.

Estás aquí porque también quieres aprender el procesamiento del lenguaje natural lo más rápido posible, como yo.

Índice

Empecemos

Lo primero que necesitamos es instalar alguna dependencia

2. Descargue un IDE o instale un cuaderno Jupyter.

Para instalar el cuaderno Jupyter, simplemente abra su cmd (terminal) y escriba pip install jupyter-notebook después de ese tipo jupyter notebook para ejecutarlo, puede ver que su computadora portátil está abierta en http://127.0.0.1:8888/ token.

3. Instalar paquetes

pip install nltk

NLTK: Es una librería de Python que podemos utilizar para realizar todas las tareas de PNL (lematización, lematización, etc.)

Antes de aprender algo, primero comprendamos la PNL.

El lenguaje natural se refiere a la forma en que los humanos nos comunicamos entre nosotros y el procesamiento básicamente consiste en procesar los datos de una forma comprensible.

Podemos decir que NLP (procesamiento del lenguaje natural) es una forma que ayuda a las computadoras a comunicarse con los humanos en su propio idioma.

Es uno de los campos de investigación más amplios porque existe una gran cantidad de datos y, a partir de esos datos, una gran cantidad de datos son datos de texto.

POST RELACIONADO: XGBoost vs LightGBM en un conjunto de datos de alta dimensión

Entonces, cuando hay tantos datos disponibles, necesitamos alguna técnica que podamos procesar y recuperar información útil de ellos.

Ahora que entendemos qué es la PNL, comencemos a comprender cada tema uno por uno.

1. Tokenización (Tokenization)

La tokenización es el proceso de dividir todo el texto en tokens.

Es principalmente de dos tipos:

Tokenizer de palabras (separados por palabras).
Tokenizer de oración (separado por oración).

import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
example_text = "Hello there, how are you doing today? The weather is great today. 
The sky is blue. python is awsome"
print(sent_tokenize(example_text))
print(word_tokenize(example_text))

En el código anterior. Primero, estamos importando nltk , en la segunda línea, estamos importando nuestros tokenizadores sent_tokenize,word_tokeniz de la biblioteca nltk.tokenize , luego, para usar el tokenizador en un texto, sólo necesitamos pasar el texto como un parámetro en el tokenizador.

La salida se verá así

##sent_tokenize (Separated by sentence)
['Hello there, how are you doing today?', 'The weather is great today.', 'The sky is blue.', 'python is awsome']

##word_tokenize (Separated by words)
['Hello', 'there', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'today', '.', 'The', 'sky', 'is', 'blue', '.', 'python', 'is', 'awsome']

2. Palabras vacías (Stopwords)

En general, las palabras vacías son las palabras en cualquier idioma que no agregan mucho significado a una oración. En PNL, las palabras vacías son aquellas palabras que no son importantes para analizar los datos.

Ejemplo: él, ella, hola, etc.

Nuestra tarea principal es eliminar todas las palabras vacías del texto para procesarlo más.

Hay un total de 179 palabras vacías en inglés, usando NLTK podemos ver todas las palabras vacías en inglés.

Solo necesitamos importar stopwords de la biblioteca nltk.corpus .

from nltk.corpus import stopwords
print(stopwords.words('english'))

######################
######OUTPUT##########
######################
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
"you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 
'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 
'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 
'with', 'about', 'against', 'between', 'into', 'through', 'during', 
'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 
'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 
'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 
'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 
'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', 
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', 
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', 
"shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 
'wouldn', "wouldn't"]

Para eliminar palabras vacías para un texto en particular

from nltk.corpus import stopwords
text = 'he is a good boy. he is very good in coding'
text = word_tokenize(text)
text_with_no_stopwords = [word for word in text if word not in stopwords.words('english')]
text_with_no_stopwords

##########OUTPUT##########
['good', 'boy', '.', 'good', 'coding']

3. Derivado (Stemming)

Derivado es el proceso de reducir una palabra a su raíz de palabra que se agrega a sufijos y prefijos o a las raíces de palabras conocidas como lema.

En palabras simples, podemos decir que derivar es el proceso de eliminar el plural y los adjetivos de la palabra.

Ejemplo :

amado → amar, aprender → aprender

En Python, podemos implementar la derivación usando PorterStemmer. podemos importarlo de la biblioteca nltk.stem.

Una cosa para recordar de Stemming es que funciona mejor con palabras sueltas.

from nltk.stem import PorterStemmer
ps = PorterStemmer()    ## Creating an object for porterstemmer
example_words = ['earn',"earning","earned","earns"]  ##Example words

for w in example_words:
    print(ps.stem(w))    ##Using ps object stemming the word

##########OUTPUT##########
earn
earn
earn
earn

Here we can see that earning,earned and earns are stem to there lemma or root word earn.

4. Lematizante (Lemmatizing)

Lematización generalmente se refiere a hacer las cosas correctamente con el uso de vocabulario y análisis morfológico de palabras, normalmente con el objetivo de eliminar solo las terminaciones flexivas y devolver la forma base o diccionario de una palabra, lo que se conoce como lema.

En palabras simples, la lematización hace el mismo trabajo que la derivación, la diferencia es que la lematización devuelve una palabra significativa.

Ejemplo:

Derivado

historia → histori

Lematizante

historia → historia

Se usa principalmente al diseñar chatbots, bots de preguntas y respuestas, predicción de texto, etc.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() ## Create object for lemmatizer
example_words = ['history','formality','changes']
for w in example_words:
print(lemmatizer.lemmatize(w))

#########OUTPUT############
----Lemmatizer-----
history
formality
change
-----Stemming------
histori
formal
chang

5. WordNet

WordNet es la base de datos léxica, es decir, un diccionario para el idioma inglés, diseñado específicamente para el procesamiento del lenguaje natural.

Nosotros podemos usar wordnet para encontrar sinónimos y antónimos.

En Python, podemos importar wordnet desde nltk.corpus.

Código para encontrar sinónimos y antónimos de una palabra determinada.

from nltk.corpus import wordnet

synonyms = []   ## Creaing an empty list for all the synonyms
antonyms =[]    ## Creaing an empty list for all the antonyms
for syn in wordnet.synsets("happy"): ## Giving word 
   for i in syn.lemmas():        ## Finding the lemma,matching 
       synonyms.append(i.name())  ## appending all the synonyms       
       if i.antonyms():
           antonyms.append(i.antonyms()[0].name()) ## antonyms
print(set(synonyms)) ## Converting them into set for unique values
print(set(antonyms))

#########OUTPUT##########
{'felicitous', 'well-chosen', 'happy', 'glad'}
{'unhappy'}

6. Parte del etiquetado de voz (Part of Speech Tagging)

Es un proceso de convertir una oración en formas: una lista de palabras, una lista de tuplas (donde cada tupla tiene una forma (palabra, etiqueta)). La etiqueta en el caso es una etiqueta de parte del discurso y significa si la palabra es un sustantivo, adjetivo, verbo, etc.

Parte de la lista de etiquetas de voz

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to’ the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP possessive wh-pronoun whose
WRB wh-abverb where, when

En Python, podemos hacer etiquetado pos usando nltk.pos_tag .

import nltk
nltk.download('averaged_perceptron_tagger')

sample_text = '''
An sincerity so extremity he additions. Her yet there truth merit. 
Mrs all projecting favourable now unpleasing. Son law garden chatty temper. 
Oh children provided to mr elegance marriage strongly. 
Off can admiration prosperous now devonshire diminution law.
'''

from nltk.tokenize import word_tokenize
words = word_tokenize(sample_text)

print(nltk.pos_tag(words))

################OUTPUT############
[('An', 'DT'), ('sincerity', 'NN'), ('so', 'RB'), ('extremity', 
'NN'), ('he', 'PRP'), ('additions', 'VBZ'), ('.', '.'), ('Her', 
'PRP$'), ('yet', 'RB'), ('there', 'EX'), ('truth', 'NN'), ('merit', 
'NN'), ('.', '.'), ('Mrs', 'NNP'), ('all', 'DT'), ('projecting', 
'VBG'), ('favourable', 'JJ'), ('now', 'RB'), ('unpleasing', 'VBG'), 
('.', '.'), ('Son', 'NNP'), ('law', 'NN'), ('garden', 'NN'), 
('chatty', 'JJ'), ('temper', 'NN'), ('.', '.'), ('Oh', 'UH'), 
('children', 'NNS'), ('provided', 'VBD'), ('to', 'TO'), ('mr', 'VB'), 
('elegance', 'NN'), ('marriage', 'NN'), ('strongly', 'RB'), ('.', 
'.'), ('Off', 'CC'), ('can', 'MD'), ('admiration', 'VB'), 
('prosperous', 'JJ'), ('now', 'RB'), ('devonshire', 'VBP'), 
('diminution', 'NN'), ('law', 'NN'), ('.', '.')]

7. Bolsa de palabras (`Bag of words`)

Hasta ahora hemos entendido sobre tokenizar, derivar y lematizar. todos estos son parte de la limpieza del texto, ahora, después de limpiar el texto, necesitamos convertir el texto en algún tipo de representación numérica llamada vectores para que podamos enviar los datos a un modelo de aprendizaje automático para su posterior procesamiento.

Para convertir los datos en vectores, utilizamos algunas bibliotecas predefinidas en Python.

Veamos cómo funciona la representación vectorial.

sent1 = he is a good boy
sent2 = she is a good girl
sent3 = boy and girl are good 
       |
       |
After removal of stopwords , lematization or stemming
sent1 = good boy
sent2 = good girl
sent3 = boy girl good  
       | ### Now we will calculate the frequency for each word by
       |     calculating the occurrence of each word
word  frequency
good     3
boy      2
girl     2
       | ## Then according to their occurrence we assign o or 1 
       |    according to their occurrence in the sentence
       | ## 1 for present and 0 fot not present
       f1  f2   f3
       girl good boy   
sent1    0    1    1     
sent2    1    0    1
sent3    1    1    1

### After this we pass the vector form to machine learning model

El proceso anterior se puede realizar utilizando un CountVectorizer en Python, podemos importar lo mismo desde sklearn.feature_extraction.text .

CÓDIGO para implementar CountVectorizer en Python

import pandas as pd
sent = pd.DataFrame(['he is a good boy', 'she is a good girl', 'boy 
and girl are good'],columns=['text'])
corpus = []
for i in range(0,3):
     words = sent['text'][i]
     words  = word_tokenize(words)
     texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
     text = ' '.join(texts)
     corpus.append(text)
print(corpus)   #### Cleaned Data

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
X = cv.fit_transform(corpus).toarray()
X  ## Vectorize Form 

############OUTPUT##############
['good boy', 'good girl', 'boy girl good']
array([[1, 0, 1],
    [0, 1, 1],
    [1, 1, 1]], dtype=int64)

Felicitaciones, ahora conoce los conceptos básicos de la PNL.

Gracias por leer y no te olvides de compartir tus opiniones a continuación.

Conceptos básicos del procesamiento del lenguaje natural en 10 minutos

Empecemos

1. Tokenización (Tokenization)

2. Palabras vacías (Stopwords)

3. Derivado (Stemming)

4. Lematizante (Lemmatizing)

5. WordNet

6. Parte del etiquetado de voz (Part of Speech Tagging)

Parte de la lista de etiquetas de voz

7. Bolsa de palabras (`Bag of words`)

Añadir comentario

Cancelar la respuesta

Conceptos básicos del procesamiento del lenguaje natural en 10 minutos

Empecemos

1. Tokenización (Tokenization)

2. Palabras vacías (Stopwords)

3. Derivado (Stemming)

4. Lematizante (Lemmatizing)

5. WordNet

6. Parte del etiquetado de voz (Part of Speech Tagging)

Parte de la lista de etiquetas de voz

7. Bolsa de palabras (Bag of words)

Artículos relacionados

Añadir comentario

Cancelar la respuesta

7. Bolsa de palabras (`Bag of words`)