Saturday, April 6, 2013

Data Analysis: Python and Pandas

Today I want to share some notes on data analysis with Python and a set of libraries and tools. 
First of all I'm not a Python developer so my code might seem a bit clumsy. But I'm trying to improve so please be kind to me :)

I this section I'm going to play a little with Python, pandas and some public datasets. If you are interested go ahead. 

First of all we need to install prerequisites, it's Python and pandas. 

1) Installing Python, for data analysis it's better to install EPD Free. Go here and get it for your system https://www.enthought.com/

2) Then installing a pandas. Go here https://pypi.python.org/pypi/pandas, get the appropriate version and install it. 

Verify that everything is ok and running 

> ipython 
> import pandas 

If there are no errors - we can go further. 

First of all importing pandas and loading data. The data is split by 3 files. 

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)
Then merge tree different files: 
            merged_data = DataFrame(pd.merge(pd.merge(users, ratings), movies))
        

And then we are ready to play with it. 

            #getting aggregation by rating
            print merged_data.groupby('rating').size()[:5]
            
            #15 titles sorted by name
            print sorted(merged_data.title.unique())[:15]
            
            #mean rating by age and gender
            print merged_data.pivot_table('rating', rows='age', cols='gender', aggfunc='mean')

        
With a given output
rating
1          56174
2         107557
3         261197
4         348971
5         226310
['$1,000,000 Duck (1971)', "'Night Mother (1986)", "'Til There Was You (1997)", "'burbs, The (1989)", '...And Justice for All (1979)', '1-900 (1994)', '10 Things I Hate About You (1999)', '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)', '13th Warrior, The (1999)', '187 (1997)', '2 Days in the Valley (1996)', '20 Dates (1998)', '20,000 Leagues Under the Sea (1954)']
gender         F         M
age                       
1       3.616291  3.517461
18      3.453145  3.525476
25      3.606700  3.526780
35      3.659653  3.604434
45      3.663044  3.627942
50      3.797110  3.687098
56      3.915534  3.720327
        
I should say It's a really powerful tool for a rapid prototyping, when it's needed to slice a data and calculate some statistics. Python and pandas make that easy. Going to go deeper in learning pandas. 
Check out the git repository for updates. 



I should say that the examples are mostly taken from the "Python for Data Analysis" book by We McKinney. So if you are interested in - get the book. 

No comments:

Post a Comment