Categories
easy payday loans online

Problem Statement as being a information scientist for the marketing division at reddit.

Problem Statement as being a information scientist for the marketing division at reddit.

i have to discover the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this full situation will be fairly safe thus I will make use of the precision rating and set up a baseline of 63.3per cent to price success. Making use of TFiDfVectorization, I’ll get the function value to ascertain which terms have the greatest prediction energy for the prospective factors. If successful, this model may be utilized to focus on other pages which have comparable regularity associated with the words that are same expressions.

Data Collection

See dating-advice-scrape and relationship-advice-scrape notebooks with this component.

After turning most of the scrapes into DataFrames, we spared them as csvs that you can get within the dataset folder with this repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless for me.
  • combined name and selftext column directly into one brand brand new columns that are all_text
  • exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 which means that if i find the value occurring usually, i will be appropriate 63.3% of times.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad score with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Merely increasing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a get a cross val to 82.3 Nevertheless, these rating disappeared.

I believe Tfidf worked top to reduce my overfitting due to variance issue because

I customized the end terms to simply take the ones away that have been really too regular to be predictive. It was a success, but, with additional time we probably could’ve tweaked them much more to improve all ratings. Taking a look at both the solitary terms and terms in categories of two (bigrams) had been the most readily useful param that gridsearch recommended, nevertheless, most payday loans VT of my top many predictive terms finished up being uni-grams. My initial listing of features had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term had been expected to show as much as 2, helped be rid of these. Gridsearch additionally proposed 90% max df rate which aided to eradicate oversaturated terms also. Finally, establishing max features to 5000 decreased cut down my columns to about 25 % of whatever they had been to simply concentrate probably the most frequently employed terms of that which was kept.

Summary and tips

Also though i would really like to have greater train and test ratings, I became in a position to effectively lower the variance and you will find positively a few terms which have high predictive energy

and so I think the model is prepared to introduce a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. It was found by me interesting that taking out fully the overly used terms assisted with overfitting, but brought the precision rating down. I believe there was probably nevertheless space to relax and play around with the paramaters regarding the Tfidf Vectorizer to see if various end terms produce a different or

About

Used Reddit’s API, demands collection, and BeautifulSoup to clean articles from two subreddits: Dating information & union information, and trained a classification that is binary to anticipate which subreddit confirmed post originated in

Leave a Reply

Your email address will not be published. Required fields are marked *