Today I want to share some notes on data analysis with Python and a set of libraries and tools.
With a given output
First of all I'm not a Python developer so my code might seem a bit clumsy. But I'm trying to improve so please
be kind to me :)
I this section I'm going to play a little with Python, pandas and some public datasets. If you are interested go
ahead.
1) Installing Python, for data analysis it's better to install EPD Free. Go here and get it for your system https://www.enthought.com/
2) Then installing a pandas. Go here https://pypi.python.org/pypi/pandas, get
the appropriate version and install it.
Verify that everything is ok and running
> ipython
> import pandas
If there are no errors - we can go further.
First of all importing pandas and loading data. The data is split by 3 files.
unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames) rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames) mnames = ['movie_id', 'title', 'genres'] movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)
Then merge tree different files:
merged_data = DataFrame(pd.merge(pd.merge(users, ratings), movies))
And then we are ready to play with it.
#getting aggregation by rating print merged_data.groupby('rating').size()[:5] #15 titles sorted by name print sorted(merged_data.title.unique())[:15] #mean rating by age and gender print merged_data.pivot_table('rating', rows='age', cols='gender', aggfunc='mean')
rating 1 56174 2 107557 3 261197 4 348971 5 226310 ['$1,000,000 Duck (1971)', "'Night Mother (1986)", "'Til There Was You (1997)", "'burbs, The (1989)", '...And Justice for All (1979)', '1-900 (1994)', '10 Things I Hate About You (1999)', '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)', '13th Warrior, The (1999)', '187 (1997)', '2 Days in the Valley (1996)', '20 Dates (1998)', '20,000 Leagues Under the Sea (1954)'] gender F M age 1 3.616291 3.517461 18 3.453145 3.525476 25 3.606700 3.526780 35 3.659653 3.604434 45 3.663044 3.627942 50 3.797110 3.687098 56 3.915534 3.720327
I should say It's a really powerful tool for a rapid prototyping, when it's needed to slice a data and calculate some
statistics. Python and pandas make that easy. Going to go deeper in learning pandas.
Check out the git repository for updates.
I should say that the examples are mostly taken from the "Python for Data Analysis" book by We McKinney. So
if you are interested in - get the book.
No comments:
Post a Comment