Overview


The task of WISE 2015 Challenge is to match user identifiers from various websites. The task is the basis of many applications, such as opinion mining, online advertisement, and information recommendation. Since data from different sources if of different structure and contain different kinds of content, the task is challenging and of no standard solution.

We provide data crawled from three different websites in China, namely Renren, 36.cn, and Zhaopin, with a few known mappings of user identifiers. Attendees of the challenge are required to map user identifiers for those unlabeled users, if they are thought to be the same person. Note that this task is open, which means attendees can use not only data provided by the task but also any other open data that can be accessed.

Data


All data are stored in plain text and each record in every data set is stored in one row. Fields in one row are separated by "\t".

Examples of two rows are provided as follows: (Data Schema: Appendix.pdf)

1      1360536135      大学      2002-01-01      南京财经大学         南京财经大学(2002年)

2      1360536135      大学      2010-01-01      中国科学技术大学   MBA


Data are crawled from the following three websites:

  • Renren (http://www.renren.com/):

    Renren is one of the most popular social networking services in China. Users' public profiles from the website are provided.

  • 36.cn (http://www.36.cn/):

    36.cn is an online job-hunting service in China. Users' open resumes from it are provided.

  • Zhaopin (http://www.zhaopin.com/):

    Zhaopin is another online job-hunting service in China. Users' open resumes are provided.

The combination of a user identifier and a source identifier are used as the identifier of a record in this challenge.


We provide a training data set (train.zip) with known matches stored in the file answer.zip. We provide another testing data set (test.zip) on which attendees of the challenge should find matching pairs of user identifiers.

Data can be accessed from any one of the following three links:

Task


Given the data set, attendees are required to find solutions to automatically match user identifiers from different data sources representing the same person.

For example, there is a user profile record user_id1in src_id_1, with the form:

==============================================================

{"name":"Steven Paul Jobs", ...,"company":”Apple”, ...}.

==============================================================

And there is another user profile record user_id2in src_id_2, with the form:

==============================================================

{"name":"Steve Jobs", ..., "work_for":"Apple", ...}.

==============================================================

Note that in the above example, for convenience, the fields are annotated with field names, which is different to that in the data set we provide.

You are required to find these two records and label them as a match, in the form of:

==============================================================

“user_id1:source1     user_id2:sourcre2”

==============================================================

Evaluation


Results will be evaluated based on ground truth labeled by experts. Precision and recall are considered in evaluation. The labeled data will be published online after the announcement of winners.

TimeLine


September 3, 2015

October 5, 2015

November 1, 2015

Challenge announcement

Result submission

Winner announcement

Awards


Reports from winners and runner-ups have opportunities to be recommendated to prestigeous journals. Students in these teams are rewarded with internships in Ping An Technology (Shenzhen) Co.,Ltd.

Contact


Email: wisechallenge2015@gmail.com

Co-Chairs:

  • Weining Qian, East China Normal University

  • Qiulin Yu, Ping An Technology (Shenzhen) Co.,Ltd.

Guideline


Attendees should submit both results and a report to the WISE 2015 Challenge contact email address: wisechallenge2015@gmail.com

Attendees are expected to submit results of a solution in a single file, with each line representing a match of person between source A and source B in the following format (with a "\t" between two identifiers):

==============================================================

user_id1:source_A     user_id2:source_B

==============================================================

A report describing the solution to the problem should be submitted to the email with the result. Details of how the attendees finish the chanllenge tasks should be introduced in the report, while the results should be summarized.