STAT 7008 - Assignment 2
Due Date by 31 Oct 2018
Question 1 (hashtag analysis)
1. Tweets1.json corresponds to tweets received before a Presidential
debate in 2016 and all other data files correspond to tweets received
immediately after the same Presidential debate. Please download the
files from the following link:
Please write codes to read the data files tweets1.json to tweets5.json
and combine tweets2.json to tweets5.json to a single file, named
tweets2.json. Determine the number of tweets in tweets1.json and
2. In order to clean the tweets in each file with a focus on extracting
hashtags, we observe that 'retweeted_status' is another tweet within
a tweet. We select tweets using the following criteria:
- Non-empty number of hashtags either in 'entities' or in its
- There is a timestamp.
- There is a legitimate location.
- Extract hashtags that were written in English or convert hashtags
that were written partially in English (Ignore non-english
Write a function to return a dictionary of acceptable tweets, locations
and hashtags for both tweets1.json and the tweets2.json respectively.
3. Write a function to extract the top n tweeted hashtags of a given
hashtag list. Use the function to find the top n tweeted hashtags of the
tweets1.json and the tweets2.json respectively.
4. Write a function to return a data frame which contains the top n
tweeted hashtags of a given hashtag list. The columns in the returned
data frame are 'hashtag' and 'freq'.
5. Use the function to produce a horizontal bar chart of the top n
tweeted hashtags of the tweets1.json and tweets2.json respectively.
6. Find the max time and min time of the tweets1.json and the
tweets2.json respectively.
7. For each interval defined by (min time, max time), divide it into 10
equally spaced periods respectively.
8. For a given collection of tweets, write a function to return a data frame
with two columns, hashtags and their time of creation. Use the
function to produce data frames for the tweets1.json and the
tweets2.json. Use pandas.cut or else, create a third column 'level' in
each data frame which cuts the time of creation by the corresponding
interval obtained in part 7 respectively.
9. Use pandas.pivot or else, create a numpy array or a pandas data frame
whose rows are time period defined in part 7 and whose columns are
hashtags. The entry for the ith time period and jth hashtag is the
number of occurrence of the jth hashtag in ith time period. Fill the
entry without data by zero. Do this for tweets1.json and the
tweets2.json respectively.
10. Following part 9, what is the number of occurrence of hashtag 'trump'
in the sixth period in the tweets1.json? What is the number of
occurrence of hashtag 'trump' in the eighth period in the tweets2.json?
11. Using the tables obtained in part 9, we can also find the total number
of occurrences for each hashtag. Rank these hashtags in decreasing
order and obtain a time plot for the top 20 hashtags in a single graph.
Rescale the size of the graph so that it is not too small nor too large.
Do this for both tweets1.json and the tweets2.json respectively.
12. The zip_codes_states.csv contains city, state, county, latitude and
longitude of US. Read the file.
13. Select tweets in tweets1.json and the tweets2.json with locations only
in the zip_codes_states.csv. Remove also the location 'london'.
14. Find the top 20 tweeted locations in both tweets1.json and the
tweets2.json respectively.
15. Since there are multiple (lon, lat) pairs for each location, write a
function to return the average lon and the average lat of a given
location. Use the function to generate the average lon and the average
lat for every locations in tweets1.json and the tweets2.json.
16. Combine tweets1.json and tweets2.json. Then, create data frames
which contain locations, counts, longitude and latitude in tweets1.json
and the tweets2.json.
17. Using the sharpfile of US states st99_d00 and the help of the website
produce the following
18. (Optional)
Using polygon patches and the help of the website
produce the following
Question 3 (extract hurricane paths)
The website http://weather.unisys.com provides hurricane paths data from
1850. We work to extract hurricane paths for a given year.
1. Since the link contains the hurricane information varies with years and
the information is contained in multiple pages, we need to know the
starting page and the total number of pages for a given year. What is
the appropriate starting page for year = '2017'?
2. In order to solve the second question, we try inputting a large number
as the number of pages for a given year. Use an appropriate number,
write a function to extract all links each of which holds information on
the hurricanes in '2017'.
3. Some of the collected links provide summary of hurricanes which do
not lead to correct tables. Remove those links.
4. For each valid hurricane link, it contains four set of information:
- Date
- Hurricane classification
- Hurricane name
- A table of hurricane positions over dates
Since the entire information is contained in a text file provided in the
corresponding webpage defined by the link, write a function to
download the text file and read (without saving it to a local directory)
the text file (at this moment, you don’t need to convert the data to
other format).
5. With the downloaded contents, write a function to convert the
contents to a list of dictionaries. Each dictionary in the list contains the
following keys: Date, Category of the hurricane, Name of the hurricane
and a table of information for the hurricane path. Convert the Date in
each dictionary to datetime objects. Since the recorded times for the
hurricane paths used the Z-time, convert it to datetime object with the
help of http://www.theweatherprediction.com/basic/ztime/.
6. We find some missing data in the Wind column of some tables. Since
the classification of a hurricane at a given moment can be found in the
Status column of the same table and the classification also relates to
the wind speed at that moment, use the classification to impute the
missing wind data. You may want to read the following website
7. Plot the hurricane paths of year '2017'size by the wind speed and color
by the classification status.
If you can produce your graph in a creative way, bonus marks will be
8. (Optional)
Convert the above functions as function of year so that when we
change year, you will be able to generate plot of the hurricane paths
in that year easily.


