movielens数据集的ratings.dat各个字段什么意思

MovieLens 1M数据集
本文所属图书&>&
还在苦苦寻觅用python控制、处理、整理、分析结构化数据的完整课程?《利用python进行数据分析》含有大量的实践案例,你将学会如何利用各种python库(包括numpy、pandas、matplotlib以及ipython等)高效地解决各&&
GroupLens Research(http://www.grouplens.org/node/73)采集了一组从20世纪90年末到21世纪初由MovieLens用户提供的电影评分数据。这些数据中包括电影评分、电影元数据(风格类型和年代)以及关于用户的人口统计学数据(年龄、邮编、性别和职业等)。基于机器学习算法的推荐一般都会对此类数据感兴趣。虽然我不会在本书中详细介绍机器学习技术,但我会告诉你如何对这种数据进行切片切块以满足实际需求。
MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。它分为三个表:评分、用户信息和电影信息。将该数据从zip文件中解压出来之后,可以通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)
利用的切片语法,通过查看每个DataFrame的前几行即可验证数据加载工作是否一切顺利:
In [334]: users[:5]
&& user_id gender& age& occupation&&& zip
0&&&&&&& 1&&&&& F&&& 1&&&&&&&&& 10& 48067
1&&&&&&& 2&&&&& M&& 56&&&&&&&&& 16& 70072
2&&&&&&& 3&&&&& M&& 25&&&&&&&&& 15& 55117
3&&&&&&& 4&&&&& M&& 45&&&&&&&&&& 7& 02460
4&&&&&&& 5&&&&& M&& 25&&&&&&&&& 20& 55455
In [335]: ratings[:5]
&& user_id& movie_id& rating& timestamp
0&&&&&&& 1&&&&& 1193&&&&&& 5&
1&&&&&&& 1&&&&&& 661&&&&&& 3&
2&&&&&&& 1&&&&&& 914&&&&&& 3&
3&&&&&&& 1&&&&& 3408&&&&&& 4&
4&&&&&&& 1&&&&& 2355&&&&&& 5&
In [336]: movies[:5]
&& movie_id&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& title&&&&&&&&&&&&&&&&&&&&&&& genres
0&&&&&&&& 1&&&&&&&&&&&&&&&&&&& Toy Story (1995)&& Animation|Children's|Comedy
1&&&&&&&& 2&&&&&&&&&&&&&&&&&&&&& Jumanji (1995)& Adventure|Children's|Fantasy
2&&&&&&&& 3&&&&&&&&&&&& Grumpier Old Men (1995)&&&&&&&&&&&&&&& Comedy|Romance
3&&&&&&&& 4&&&&&&&&&&& Waiting to Exhale (1995)&&&&&&&&&&&&&&&&& Comedy|Drama
4&&&&&&&& 5& Father of the Bride Part II (1995)&&&&&&&&&&&&&&&&&&&&&&& Comedy
In [337]: ratings
&class 'pandas.core.frame.DataFrame'&
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id&&&&& 1000209& non-null values
movie_id&&&& 1000209& non-null values
rating&&&&&& 1000209& non-null values
timestamp&&& 1000209& non-null values
dtypes: int64(4)
注意,其中的年龄和职业是以编码形式给出的,它们的具体含义请参考该数据集的README文件。分析散布在三个表中的数据可不是一件轻松的事情。假设我们想要根据性别和年龄计算某部电影的平均得分,如果将所有数据都合并到一个表中的话问题就简单多了。我们先用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键:
In [338]: data = pd.merge(pd.merge(ratings, users), movies)
In [339]: data
&class 'pandas.core.frame.DataFrame'&
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id&&&&&& 1000209& non-null values
movie_id&&&&& 1000209& non-null values
rating&&&&&&& 1000209& non-null values
timestamp&&&& 1000209& non-null values
gender&&&&&&& 1000209& non-null values
age&&&&&&&&&& 1000209& non-null values
occupation&&& 1000209& non-null values
zip&&&&&&&&&& 1000209& non-null values
title&&&&&&&& 1000209& non-null values
genres&&&&&&& 1000209& non-null values
dtypes: int64(6), object(4)
In [340]: data.ix[0]
user_id&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1
movie_id&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1
rating&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 5
timestamp&&&&&&&&&&&&&&&&&&&&&&
gender&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& F
age&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1
occupation&&&&&&&&&&&&&&&&&&&&&&&&&&&& 10
zip&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 48067
title&&&&&&&&&&&&&&&&&&& Toy Story (1995)
genres&&&&&&& Animation|Children's|Comedy
现在,只要稍微熟悉一下pandas,就能轻松地根据任意个用户或电影属性对评分数据进行聚合操作了。为了按性别计算每部电影的平均得分,我们可以使用pivot_table方法:
In [341]: mean_ratings = data.pivot_table('rating', rows='title',
&& ....:&cols='gender', aggfunc='mean')
In [342]: mean_ratings[:5]
gender&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& F&&&&&&&& M
$1,000,000 Duck (1971)&&&&&&&& 3..761905
'Night Mother (1986)&&&&&&&&&& 3..352941
'Til There Was You (1997)&&&&& 2..733333
'burbs, The (1989)&&&&&&&&&&&& 2..962085
...And Justice for All (1979)& 3..689024
该操作产生了另一个DataFrame,其内容为电影平均得分,行标为电影名称,列标为性别。现在,我打算过滤掉评分数据不够250条的电影(随便选的一个数字)。为了达到这个目的,我先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:
In [343]: ratings_by_title = data.groupby('title').size()
In [344]: ratings_by_title[:10]&
Out[344]:&
$1,000,000 Duck (1971)&&&&&&&&&&&&&&& 37
'Night Mother (1986)&&&&&&&&&&&&&&&&& 70
'Til There Was You (1997)&&&&&&&&&&&& 52
'burbs, The (1989)&&&&&&&&&&&&&&&&&& 303
...And Justice for All (1979)&&&&&&& 199
1-900 (1994)&&&&&&&&&&&&&&&&&&&&&&&&&& 2
10 Things I Hate About You (1999)&&& 700
101 Dalmatians (1961)&&&&&&&&&&&&&&& 565
101 Dalmatians (1996)&&&&&&&&&&&&&&& 364
12 Angry Men (1957)&&&&&&&&&&&&&&&&& 616
In [345]: active_titles = ratings_by_title.index[ratings_by_title &= 250]
In [346]: active_titles
Index(['burbs, The (1989), 10 Things I Hate About You (1999),
&&&&&& 101 Dalmatians (1961), ..., Young Sherlock Holmes (1985),
&&&&&& Zero Effect (1998), eXistenZ (1999)], dtype=object)
该索引中含有评分数据大于250条的电影名称,然后我们就可以据此从前面的mean_ratings中选取所需的行了:
In [347]: mean_ratings = mean_ratings.ix[active_titles]
In [348]: mean_ratings
&class 'pandas.core.frame.DataFrame'&
Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)
Data columns:
F&&& 1216& non-null values
M&&& 1216& non-null values
dtypes: float64(2)
为了了解女性观众最喜欢的电影,我们可以对F列降序排列:
In [350]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
In [351]: top_female_ratings[:10]&
Out[351]:&
gender&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& F&&&&&&&& M
Close Shave, A (1995)&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 4..473795
Wrong Trousers, The (1993)&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 4..478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)&&&&&&&&&& 4..464589
Wallace & Gromit: The Best of Aardman Animation (1996)& 4..385075
Schindler's List (1993)&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 4..491415
Shawshank Redemption, The (1994)&&&&&&&&&&&&&&&&&&&&&&& 4..560625
Grand Day Out, A (1992)&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 4..293255
To Kill a Mockingbird (1962)&&&&&&&&&&&&&&&&&&&&&&&&&&& 4..372611
Creature Comforts (1990)&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 4..272277
Usual Suspects, The (1995)&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 4..518248
您对本文章有什么意见或着疑问吗?请到您的关注和建议是我们前行的参考和动力&&
(window.slotbydup=window.slotbydup || []).push({
id: '2467141',
container: s,
size: '1000,90',
display: 'inlay-fix'
您的浏览器不支持嵌入式框架,或者当前配置为不显示嵌入式框架。
(window.slotbydup=window.slotbydup || []).push({
id: '2467142',
container: s,
size: '1000,90',
display: 'inlay-fix'
(window.slotbydup=window.slotbydup || []).push({
id: '2467143',
container: s,
size: '1000,90',
display: 'inlay-fix'
(window.slotbydup=window.slotbydup || []).push({
id: '2467148',
container: s,
size: '1000,90',
display: 'inlay-fix'本文翻译自python for Data analysis
作者O’reilly  主要涉及以下几个方面1.求按某列求平均值(按性别,求每条电影评分的平均值),pivot_table()2.过滤,(选出评分个数多于250个title) 3.增加列4.求方差[code]import pandas as pd
import numpy as np
#添加数据 (此程序和数据在同一个文件下)
unames=['user_id','gender','age','occupation','zip']
users=pd.read_table('users.dat',sep='::',header=None,names=unames)
rnames=['user_id','movie_id','rating','timestap']
ratings=pd.read_table('ratings.dat',sep='::',header=None,names=rnames)
mnames=['movie_id','title','genres']
movies=pd.read_table('movies.dat',sep='::',header=None,names=mnames)
data=pd.merge(pd.merge(ratings,users),movies)
#mean_ratings=data.pivot_table('rating',rows='title',cols='gender',aggfunc='mean')
meant_ratings=pd.pivot_table(data,values='rating',
index=['title'],columns=['gender'],
aggfunc=np.mean)
pivot_table使用结果:
$1,000,000 Duck (1971)
'Night Mother (1986)
'Til There Was You (1997)
'burbs, The (1989)
...And Justice for All (1979)
注意index,columns,以及要处理的values
#每一个title下的评分有多少个,使用了size()方法
ratings_by_title =data.groupby('title').size()
#选出评分个数大于250个的title
active_titles=ratings_by_title.index[ratings_by_title&=250]
#选出这些title的平均评分
.ix是dataFrame的切片方法
mean_ratings=meant_ratings.ix[active_titles]
#选出评分高的电影,在这里按照F的降序排列
sort_index
top_female_ratings=mean_ratings.sort_index(by='F',ascending=False)
#找出男女评分差距,给mean_ratings添加一列
mean_ratings['diff']=mean_ratings['M']-mean_ratings['F']
sorted_by_diff=mean_ratings.sort_index(by='diff')
#降序,取前15个
sorted_by_diff[::-1][:15]
#计算整个数据空间的评分的标准差
rating_std_by_title=data.groupby('title')['rating'].std()
#选出评分个数大于250的标准差
并排序(这里用order方法,因为不用制定按哪一列排序(上面用sort_index制定按‘F’列排序))
rating_std_by_title=rating_std_by_title.ix[active_titles]
rating_std_by_title.order(ascending=False)[:10]
如果您想留下此文,您可以将其发送至您的邮箱(将同时以邮件内容&PDF形式发送)
相关文章推荐
(Ctrl+Enter提交) &&
已有0人在此发表见解
&在& 20:29收藏到了
&&在信息爆炸的时代,您的知识需要整理,沉淀,积累!Lai18为您提供一个简单实用的文章整理收藏工具,在这里您可以收藏对您有用的技术文章,自由分门别类,在整理的过程中,用心梳理自己的知识!相信,用不了多久,您收藏整理的文章将是您一生的知识宝库!
· 蜀ICP备号-1From RecSysWiki
MovieLens is a recommender system and virtual community website that recommends films based on user-provided .
Three different datasets from the MovieLens system have been released by the
research group:
MovieLens 100k, containing 100,000 ratings
MovieLens 1M, containing about 1,000,000 ratings
MovieLens 10M, containing about 10,000,000 ratings, plus
information
All datasets additionally contain additional
and , in particular:
the movies'
keys, allowing easy access to more movie attributes using IMDB's plain text data files,
movie release dates and genres
user age, gender, postal code, and occupation (not for MovieLens 10M)
All 3 MovieLens datasets can be used free of charge for research purposes.
The use of the datasets must be acknowledged, and copies of resulting publications must be sent to GroupLens.
Redistribution without explicit permission is not allowed.
All 3 datasets also contain timestamps.
In the following, we focus on the differences between the 3 variants.
Tag events
10,000,054
The smallest dataset contains one
for 5-fold ,
and two splits with exactly 10 ratings per user, where the test sets are disjoint.
It was collected from September 19th, 1997 to April 22nd, 1998.
The rating file is tab-separated. The other data files are separated by vertical bars (|).
This dataset contains ratings by users who joined the platform in the year 2000.
All files are separated by double colons (::).
The largest MovieLens dataset contains scripts for generating the same splits as the ones for the 100k variant.
Additionally, there is a file with tagging events.
The file format is identical to MovieLens 1M.
In contrast to the two smaller sets, which have integral ratings from 1 to 5 stars, MovieLens 10M has ratings from 0.5 to 5, with a step size of 0.5.
J. Herlocker, J. Konstan, A. Borchers, J. Riedl: An Algorithmic Framework for Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. 1999.温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!&&|&&
LOFTER精选
网易考拉推荐
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
阅读(174)|
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
历史上的今天
loftPermalink:'',
id:'fks_',
blogTitle:'用python写一个载入和处理Movielens数据的示例程序',
blogAbstract:'\n\t\t\tmovielens是一个开源的训练和测试推荐系统的数据包。\n测试自己的推荐系统的准备工作就是把这个测试数据载入程序,并且处理成自己程序设定的格式。\n以下是python源码:(路径都是我电脑中的文件路径,大家需要修改成自己的路径)\n#!/usr/bin/python\n#这个程序先把100k那个包里u.item的电影id和电影名称提取出来,并生成一个电影字典movies,然后\n#将u.data中的用户id和电影id以及对应的rating提取出来,生成一个用户的打分字典。\ndef\nloadMovieLens_100k(path=\'C:/Users/Administrator/Desktop/ci_code/ml-100k\'):',
blogTag:'',
blogUrl:'blog/static/',
isPublished:1,
istop:false,
modifyTime:0,
publishTime:2,
permalink:'blog/static/',
commentCount:0,
mainCommentCount:0,
recommendCount:0,
bsrk:-100,
publisherId:0,
recomBlogHome:false,
currentRecomBlog:false,
attachmentsFileIds:[],
groupInfo:{},
friendstatus:'none',
followstatus:'unFollow',
pubSucc:'',
visitorProvince:'',
visitorCity:'',
visitorNewUser:false,
postAddInfo:{},
mset:'000',
remindgoodnightblog:false,
isBlackVisitor:false,
isShowYodaoAd:false,
hostIntro:'',
hmcon:'0',
selfRecomBlogCount:'0',
lofter_single:''
{list a as x}
{if x.moveFrom=='wap'}
{elseif x.moveFrom=='iphone'}
{elseif x.moveFrom=='android'}
{elseif x.moveFrom=='mobile'}
${a.selfIntro|escape}{if great260}${suplement}{/if}
{list a as x}
推荐过这篇日志的人:
{list a as x}
{if !!b&&b.length>0}
他们还推荐了:
{list b as y}
转载记录:
{list d as x}
{list a as x}
{list a as x}
{list a as x}
{list a as x}
{if x_index>4}{break}{/if}
${fn2(x.publishTime,'yyyy-MM-dd HH:mm:ss')}
{list a as x}
{if !!(blogDetail.preBlogPermalink)}
{if !!(blogDetail.nextBlogPermalink)}
{list a as x}
{if defined('newslist')&&newslist.length>0}
{list newslist as x}
{if x_index>7}{break}{/if}
{list a as x}
{var first_option =}
{list x.voteDetailList as voteToOption}
{if voteToOption==1}
{if first_option==false},{/if}&&“${b[voteToOption_index]}”&&
{if (x.role!="-1") },“我是${c[x.role]}”&&{/if}
&&&&&&&&${fn1(x.voteTime)}
{if x.userName==''}{/if}
网易公司版权所有&&
{list x.l as y}
{if defined('wl')}
{list wl as x}{/list}

我要回帖

更多关于 movielens 的文章

 

随机推荐