What it will do is it will parse the HTML tanks it will extract it will find all the reviews and it will store it inside an array so using this function beautiful soap HTML and HTML dot parser I am going to parse the files this will take a moment so once we get a number displayed here it means that the code has executed successfully so it is still running we'll have to wait okay so some of you are asking me questions on how would we know that ellipses texts are is part of the HTML so let me just open an HTML a random one just views view this page source okay and I'm going to search ellipsis underscore text so if you see here under the class ellipsis underscore text we have all the reviews so the same tag on the same class has been used across all the HTML files to store the reviews and the itemprop for this particular review is description.

So whenever you have this tag this attribute itemprop and description is available and this attribute class and ellipsis text is available we are finding all these values in the code and then we are passing it here so now the passing has got executed yeah and those who are asking me the question so is beautiful so python library yes it is a python library and Jupiter is an open source saw software yeah you need to download anaconda and you can download the latest version yeah and Jupiter is basically an interactive interface where you can run your raw Python code and look at the results then in there rather than running it from the back end now so now what we have done we have passed all the reviews inside this review saw data I mean reviews array if you look at this reviews so we got all the reviews populated here I'm just going to delete it okay similarly we are going to populate all the ratings rating is having an attribute itemprop and the rating value is the name of the attribute so we are going to pass all the ratings and store it inside the rating is rating array so once we store the ratings.

We can also check how the ratings are displayed here so here we got all the ratings 5.0 3.0 4.0 etc okay so the question GPU is needed or do we need CPU we can make use of CPU for running this so each library we need to download all the libraries you can download the library libraries using pepp install pepp installed with the library name so for beautiful soap the name of the libraries be s for now we are going to add a library called re so what this library will do is it will la handle the regular expressions in the text and we have a small function here called a clean HTML so if you look at this function we are doing a compilation read or compile of this is the syntax are less than dot store question mark and greater than this is how a HTML tag usually looks like isn't it so we are going to take all those saw contents which are inside the HTML tag and we are going to remove them and once we clean the HTML.

We'll get only the ratings and the reviews as part of our raw data output so we are cleaning the HTML here then if you look at this there is an array called raids so inside this we are going to store all the ratings so we are passing the ratings and then we are cleaning all the rating content and we are taking only the values 5.0 4.0 6.0 sorry 1.0 2.0 these are the values which will get stored inside rates so if you look at rates now previously while we were seeing the ratings saw we had hashed email tags and in between that we had the ratings now it has been cleaned up and we got just the rate values alone yeah similarly we are going to do the same activity cleaning the HTML file on the reviews so once we clean now the HTML tags in the reviews are reviews would look like this so we got three texts directly if at all there are any other special characters here it is the slash and character which we can see the next line characters so we need to clean that as well so let us see how we are going to clean that now we are going to import pandas pandas is a library which is available by default in Python and this is a great library for handling go all kind of data so what this library will do is it can convert the data into a table format which is called as data frame.

We can do our further processing and everything on that table so now what we are going to do we are going to create a data frame called reviews PD dot data frame this is the formula which we are using to create a data frame inside that we have reviews as one column and ratings as the other column and for the reviews column we are going to store the reviews array which we created in the previous step and for the rating column we are going to store the rates array which we have created in the step 16 here so let us run this and see what happens yeah now that I have executed reviews let us see what is inside this reviews yeah now we got two columns one is ratings and the other one is reviews okay but still we did not handle these slash and special character in the previous or clean HTML step so we need to handle that as well yeah so Satya yes we are going to convert this string into float I'll explain it in the next steps.

Yeah okay so first we'll replace all these lation characters we'll remove them from the data and then next we'll drop all the duplicates if there are any duplicate records saw inside the data set we don't need them for our model building process so we are just going to drop all the duplicates using this function called Rob duplicates and then let us count the number of reviews and see it is 992 reviews in total so previously while we did a count it was 996 which means there are four duplicate records which got removed after dropping the duplicates now let us look at how our data looks like so reviews dot head when you are using head and you are giving a number inside it you can see that many records in your data set so now we are looking at first five records with ratings and reviews so in order to do further manipulations on the data first we need to see whether we are using the correct data types saw in our data.

I am looking at reviews dot d types this is a function which will give you the data types so this first column ratings is of type object but it is actually a number as we can see it is a float number right so and when we look at the column reviews the second column it is text so it being an object is still fine because it is still handled as a string now first let us convert this ratings as type float we're just going to convert the data type and then we are going to look at the data types so now we have ratings as float and review this object okay so when we see the process flow here we have completed loading the HTML data parsing it using beautiful soap converting the data into data frame calculating this sentiment that is the next step we are going to look at so how we are going to calculate the sentiment we are not going to use any complex steps here we'll just keep it simple we are going to look at the ratings which are less than 4.0 if the rating is less less than 4.0 it was like rating from 1 to 3.5 so all these ratings will be considered as negative sentiment and all the ratings which are above greater than or equal to 4 are going to be considered.

Post a Comment

Previous Post Next Post