Week 2

As week 2 is coming to a close, I feel like we’ve made a lot of progress. Julia and I had to read through and summarize a total of eleven papers this week, and I feel like we more or less understand what we’re going to be working on. The ARC method is exciting, because it’s essentially a new idea that’s understudied, so any experience we have with it will be important to its development.

In addition to our reading, we’ve done some early work for Fernando. We’ve familiarized ourselves with the transcriptions of a few interviews relevant to our line of research, and are in the process of transcribing another interview (which Fernando assigned us mostly so we would understand the process if we needed to do so again in the future).

I’ve looked over a dataset of ~400 forum posts that had previously been qualitatively tagged and added info to the excel sheet about word counts. This required a little bit of wrangling, because for some reason some of the posts were missing spaces following punctuation. I managed to rectify this by using search and replace for “.” with “. “, and then in order to fix posts that now had two spaces after them replacing “.  ” with “. “. There were a couple of other errors throughout the document in a similar vein, that I assume came from copying and pasting from the forums. After cleaning all of that up I found a series of excel commands that can calculate the number of words in a given cell, and copied and pasted that across each cell that was supposed to contain a word count. I then calculated the average word counts across each individual forum, types of forums, and all forums on a separate sheet at Fernando’s request.

Fernando mentioned that he would like to automate the collection and tagging of this data, such that a program could pull posts from similar websites and add qualitative tags such as “Seeking emotional support indirectly” or “Seeking information directly”. I think the data we have currently might work well as a training set for a machine learning algorithm which could perform such a task. Given that this algorithm would largely exist to process large amounts of data quickly, I could see allowing the machine to have a “difficult to categorize” option. The idea being that the researchers would intervene in a small % of the tags where the machine’s probability of a correct guess is below a certain threshold.  I will speak to Dr. Natarajan about it when he has the time (I’m sure there’s literature where someone has used this sort of approach before, I’m just not knowledgeable enough to find it easily).

I’m also not sure where applying this approach would fit into our work over the summer. It’s kind of a separate idea from our use of the ARC method, so it may well not fit into that paper. Right now I feel very motivated, and like I could jump between two related projects if need be, but depending on the amount of effort that needs to be put into the forum data I may be eating my words on that in a couple of week’s time. Overall I’m feeling very motivated at the present moment, and am hoping to really start diving into the problem solving early next week now that I feel more or less aware of where our research fits.