I mean yeah greater than or equal to 4 are going to be considered as positive all right so if we look at a will rendus function first yeah and then we are going to apply this calculate sentiment function directly on the data so this is the best part of Python pundits library where we can play we can just write a single function use a different function and directly apply it using the apply yaar function here which is available in pandas and we can convert the data according to our requirement so what I am doing here I am creating one more new column called sentiment inside the review stable or the data frame and then I am applying calculate sentiment so what this calculates sentiment will do it will check if the rating is less than 4 it will make it as negative if it is greater than or equal to 4 then it will make it as positive so let us friend this and now if we check the contents of our reviews data set so we got one more column called sentiment added at the end and if the review is negative it will be 0.

If it is positive it will be 1 for safer side what I have done here is I have just stored this complete reviews data frame into a CSV file so in case in future if you want to make use of the same file for training the data you don't have to rerun the previous steps all of the previous steps can be avoided and you can directly use this data to all train the model so you do this reviews dot to CSV a CSV file will be created in the folder where we are running the code running the Jupiter notebook so if I just open this so now the complete data has been stored inside a CSV file all the records are available here we can directly use this by making yourself panda's dot red underscore CSV function and we can directly read the data instead of doing all this parsing beautiful soap and beautiful soap and all of the previous steps okay so now we are going to do the actual part of building a machine learning model.

Machine learning models are basically supervised and unsupervised learning models which will train the data by identifying the patterns in the data and based on the training it will learn what is inside the data and based on that when we are giving a new data say input it will test the data and it will classify the data according to our requirements so I'm not going to explain the machine learning model as such in detail but this is the concept you can remember this concept and I will tell you which are the libraries where you can find all the different machine learning models and how you can use them in NLP ok so now we are going to store the reviews and the sentiment we are not going to use the ratings column here we are just going to use the review column and the sentiment column and we are going to store it into two different arrays here textures there River will store all the reviews and the why is there ever will store all the sentiments ok so once we store it inside this if I am just typing text here and let us see what is the type of this texture it will be in the form of Health series data similarly y also will be stored as an array or a series data okay so now next step.

We are going to split the data into train and test at a ratio of 70 by 30 what is mean by this why do we need to split the data so now that we got our complete data set with 992 records the net and all 992 are the views okay so we need to feed an input to the machine learning model from which it will understand it'll learn looking at the patterns of the data looking at what are the kind of text that is used in each review and how many times it has been used based on that how it is classified as a positive or negative sentence this is what the machine learning models are going to learn from our data so in order to do that we need to first train the model then we have to test the model for training the model we are going to use a library called scikit-learn and within that we are going to use a class called model underscore selection.

We'll import the function in our treinta split so we don't have to manually split the data into 70 by 30 if you are giving a random seed on a random criteria the data will be picked from the data set and it will be split into x and y which you saw test and train were just saw at the ratio of what whatever we are giving us input here so the test size we have mentioned here is 0.33 which means 33% of the data is going to be a test data and the remaining 67% is going to be the train data here so let us run this once we run this data let us explore what is inside this Raymond test data we can directly run this and see what is inside Textron train takes underscore train so if I just ran it I can see that and if you look at the left side the index here right it is not their proper sequence because it is a random split it'll it will not be 0 1 2 3 etcetera it is taken another angle basis similarly if you look at the test rate also it will be split randomly so the data is split randomly interest rate as well.

Okay now let us say explore the content explore the content so how many records are inside the Train and what what is the positive so how many are positive and how many are negative so let us just look at it okay so their unique values inside Y underscore train here is 0 & 1 0 is 4 negative and 1 is 4 positive similarly the unique value inside y underscore test let us see it is also 0 & 1 so we have both the positive and negative sentiment split across saw both the train and test data this is very much important because we need to have a proper distribution of the classes inside the data only then we will be able to train the model effectively ok so the number of sand samples per sentiment we can say it by running this we have 136 negative sentiments and 528 positive sentiments in the data so similarly in the test data we have 68 negative sentiments and 260 positive sentiments so at the ratio of 70 by 30 with the available data the split has been written a pretty decent so let us proceed further and extract the features from this data so if you remember I was explaining about the bag of words model where we can create a corpus of the unique words you're used across the data set and based on that corpus will create the features in the data by creating a sparse n by n matrix Anatomy.

So another technique which we can use here is tf-idf tf-idf is known as term frequency inverse document frequency so oh what is mean by this term frequency inverse document frequency so if you look at the formula here we have calculated the TF term frequency first and then we are multiplying it with the log of n plus 1 divided by n W plus 1 plus 1 okay to mention this in simple terms T of means number of times term T appears in a document divided by total number of turns in the document so log of idea of what we have done here inverse document frequency which is we have taken the log of the total number of documents divided by number of documents with the term T in it so this will basically create a weight for each word so for instance I am using the word happy there will be a weight edge for the word happy in my review.

I'm using the word dissatisfied there will be a weight edge for it I'm using the word good there'll be a weight age for it depending on this sort e of idea formula so do we need to write this particular function from scratch no we don't have to write it we again have it inside the scikit-learn feature extraction text and then we can import the function tf--idf vectorizer to convert our data so let us run this so what this will do is so in the back of bag of words model I mentioned that we will create a spouse matrix of the data the number of reviews will be in the end axis and the number of features unique words will be in the M axis and you will fill it with the number of occurrences so in it is an N by M sparse matrix the same thing will happen in tf-idf s well so let us see that here we are going to use the function tf-idf vectorizer and we have the minimum document frequency norm normalization all this saw whatever parameters we have here you just can select the function click on Shift + tab and then click on this plus sign you can see the explanation for each parameter inside it.

Post a Comment

Previous Post Next Post