Tutorial: MTurk using Python in Jupyter Notebook

ipython笔记本是数据科学家在Jupyter笔记本界面中分析数据和训练机器学习模型的强大工具。在本教程中,我们将解释如何使用mturk注释培训数据,all from within the Jupyter application.To get started quickly you can download the笔记本HTML对于本教程。

Jupyter Notebook setup

Jupyter Notebook是一个开源的Web应用程序,允许您创建和共享包含实时代码的文档,方程,形象化和叙事文本。它是从伊普利森计划中诞生的,虽然它仍然最常用于python,it supports over 40 languages including R,朱丽亚and Scala.

The easiest way to get started with Jupyter is to install the蟒蛇分布Python 3。其中包括python,Jupyter笔记本,and a number of other useful packages.如果你不想安装蟒蛇,可以按照说明使用PIP安装Jupyter在这里.

一旦你安装了Jupyter,只需运行命令jupyter notebook从要存储笔记本和资产的目录中的终端或命令提示。This will launch Jupyter on your computer and open a web browser window.

如果您在存储示例笔记本和HTML的同一目录中打开jupyter,您将在列表中看到这些文件。

Within each notebook you'll have the ability to define cells.在Python中工作时,最常见的是代码markdown.Code cells are where you place the Python code you want to run and markdown cells are used for providing descriptions of the code and the steps you are taking.

To run code cells,you can select the cell and then select the Run Cell button or Control-Enter to execute it.Any output from the code steps in that cell will appear below the cell.

帐户设置

如果你还没有,您需要设置链接在一起的mturk和aws帐户,以便将mturk与python结合使用。The MTurk account will be used to post tasks to the MTurk crowd and the AWS accounts will be used to connect to MTurk via API and provide access to any additional AWS resources that are needed to execute your task.

  1. If you don't have an AWS account already,参观网址:https://aws.amazon.com并创建一个可用于项目的帐户。
  2. If you don't have an MTurk Requester account already,参观https://requester.mturk.com网站并创建一个新帐户。

After you've setup your accounts,你需要把它们联系在一起。当同时登录到您的aws帐户和mturk帐户的根目录时,参观https://requester.mturk.com/developer/开发人员to link them together.

配置配置文件

要调用mturk,您需要将计算机配置为具有正确凭据的配置文件。To get started,create a new AWS IAM User or select an existing one you plan to use.添加亚马逊河机械土耳其全境policy to your user.然后选择安全凭据选项卡并创建新的访问密钥,复制访问密钥和秘密访问密钥以备将来使用。

使用此帐户配置计算机的最简单方法是安装AWS命令行界面(CLI)。您可以通过在命令行中键入pip install awscli.After it is installed you can runaws configure --profile mturkto configure anmturk调用API时将使用的计算机配置文件。When prompted,provide the Access Key Id and Secret Access Key Id you captured above.对于一个地区,你可以进入“美国东部-1”并将输出格式保留为“无”。

pip install awscli
aws configure --profile mturk

更多要安装的库

作为最后一步,我们将安装boto3和xmltodict库。boto3包是一个易于使用的python库,用于访问AWS。The xmltodict library makes it much easier to work with the XML data returned my MTurk.

pip install boto3
pip安装xmltodict

Jupyter入门

Now that we've installed all of the necessary tools,我们可以开始使用我们的第一个笔记本。Start by creating a new Python 3 notebook in Jupyter or opening the sample notebook you downloaded earlier.

首先我们要进口Boto3,XMLtoDICT,以及JSON包。

导入BOT-3
import xmltodict
导入JSON

接下来,我们将创建一个用于发出请求的mturk客户机。MTurk has two environments you can work in.生产环境用于发布希望工人完成的任务。沙盒是一个测试环境,您可以在将任务发布给工作人员之前使用它来测试任务。使用沙盒没有成本,但因为manbetx官方网站登录它只用于测试,张贴在那里的项目将不会完成,除非你自己完成它们。

下面的代码将根据在生产中创建“点击”.请注意,在sessionvariable below,我们引用了mturk以前创建的配置文件。如果使用default轮廓。

create_hits_in_production = False
environments = {
"生产“: {
"endpoint":"https://mturk-requester.us-east-1.amazonaws.com网站",,
"“预览”:"//www.us-65.com/mturk/preview/预览""
},
"sandbox": {
"endpoint":
""https://mturk-requester-sandbox.us-east-1.amazonaws.com网站",,
"“预览”:"https://workersandbox.mturk.com/mturk/preview/预览""
},
}
mturk_环境=环境生产“]如果在其他生产环境中创建“命中”sandbox"]
session = boto3.Session(profile_name='mturk')
client=会话.client(
service_name='mturk',,
Region_name='US-East-1',,
endpoint_url=mturk_environment[“endpoint”],,

Once you've created your client you can check to see if it's setup correctly by getting your current MTurk account balance.Note that in the Sandbox environment your balance is always $10,000。

打印(client.get_account_balance()['availablebalance'])

定义您的任务

对于这个项目,我们将得到一组推特的感想,我们计划训练一个模型来评估。我们将为每个tweet创建一个mturk人类智能任务(hit),并将每个tweet分配给五个工人,这样我们就可以纠正偏差和质量。

tweets=[在科学课上…urgh...愚蠢的计划……,
“嗯,早餐吃什么?…面包上的蜂蜜',
“在家工作X”,,
离开镇上几天。会想念我的女儿的]

要将任务提交到mturk,您需要创建一个HTML模板,该模板将为每个项目显示给工作人员。You can either use the HTML template you downloaded earlier or create a template within Jupyter by selecting New->Text File from the Jupyter Home page.

我们可以将此文件重命名为motionquestion.html,并从实例to get started.

您可能已经注意到,在这个HTML中间是对一个名为内容。我们要换这个${content}variable with each tweet when we publish our tasks.

${content}

现在我们可以从文件中读取这个HTML,并使用mturk所需的问题布局XML对其进行包装。我们会用这个question_xmlvariable later on.

html_layout=open('./motionquestion.html','r').read()
问题XML =""" http://mechanicalturk.amazonaws.com/awsmechanicalturkdataschemas/2011-11-11/htmlquestion.xsd">
{}
六百五十
""""
question_xml=question_xml.format(html_layout)

在最后的设置步骤中,we'll define the attributes that will be applied to each HIT.如上所述,对于每项任务,我们将有五名员工审查每一条推文。我们在下面的定义中对此进行了详细说明,并在参数中指出,命中率在worker.mturk.com网站上持续不超过一小时,工人们在不到十分钟的时间内对每一个项目作出反应。每个回复都有0.05美元的奖励,因此此任务的总员工奖励为0.25美元加上0.05美元的mturk费用。适当的标题,描述,keywords are also provided to let Workers know what is involved in this task.

TaskAttributes = {
'MaxAssignments': 5,,
# How long the task will be available on MTurk (1 hour)
'LifetimeInSeconds': 60*60,,
#每个项目工人必须完成多长时间(10分钟)
'assignmentdurationinseconds':60*10,,
# The reward you will offer Workers for each response
“奖励”:“0.05”,,
“标题”:“为推特提供情感”,,
“关键词”:“情感,tweet',,
“描述”:“以1到10的比例评价微博的情绪。”
}

创建任务

现在我们准备将这些任务发布到mturk,以便工人查看和完成它们。First we'll create a结果数组以包含有关我们提交的每个命中的信息。We'll also create a variable to contain the ID of the HIT Type that is generated for this task.然后我们将循环浏览设置中的每个tweet,并使用前面定义的属性和question_xml我们创造了。注意,我们正在替换内容改变tweet。我们将把生成的tweet和hit id附加到results变量中。最后一步是返回指向点击的链接,以便在https://worker.mturk.com网站https://workersandbox.mturk.com.

结果=[]
点击“类型”
for tweet in tweets:
response = client.create_hit(
* * TaskAttributes,,
Question=question_xml.replace('${content}',tweet)

hit_type_id=response['hit']['hit type id']
results.append({
“推特”:推特,
'hit_id': response['HIT']['HITId']
})

print("您可以在此处查看点击率:“)
print(mturk_environment['preview']+"?GypID= {}.format(hit_type_id))

得到结果

根据任务的不同,结果将在几分钟到几小时内提供。在Jupyter中,我们可以运行以下命令来检索每个命中的状态以及工人提供的响应。因为我们只是更新结果数组,we can run this as frequently as we wish until the HITs are completed to get the status of our tasks.

对于结果数组中的每个项,我们执行以下步骤:

  1. 获取命中的当前状态并将其存储在结果数组中。
  2. Get a list of the Assignments that have been completed for each item and store the count of Assignments completed into the results array.
  3. 循环遍历每个分配,并将分配的详细信息和结果捕获到一组答案中。
  4. Approve each Assignment so that the $0.05 reward will be distributed to Workers.
  5. 将答案存储在结果数组中,并计算该项的平均响应。
for item in results:

#获取命中的状态
hit=client.get_hit(hit id=item['hit_id'])
项目['status']=点击['hit']['hit status']
# Get a list of the Assignments that have been submitted
assignments list=client.list_assignments_for_hit(
hit id=项目[‘hit_id’],,
assignmentstatus=[“已提交”,“批准”,,
最大结果=10

assignments = assignmentsList['Assignments']
项目['作业提交数']=len(作业)
回答=
对于分配中的分配:

#检索每个分配的属性
worker_id=分配['worker id']
分配_id=assignment['assignment id']

# Retrieve the value submitted by the Worker from the XML
answer_dict=xmltodict.parse(赋值[“answer”])
answer = answer_dict['QuestionFormAnswers']['Answer']['FreeText']
answers.append(int(answer))。

# Approve the Assignment (if it hasn't been already)
if assignment['AssignmentStatus'] == 'Submitted':
客户端。批准分配(
AssignmentId=assignment_id,,
OverrideRejection=False


# Add the answers that have been retrieved for this item
项目['答案']=答案
if len(answers) > 0:
item['avg_answer'] = sum(answers)/len(answers)
打印(json.dumps(结果,缩进=2)

At the end of the task we'll get a completed array of results that you can then use to train a model or process further.

[
{
"tweet":in science class right now...urgh...愚蠢的计划……,,
"希特里德:3SU800BH87Y9GXNK9PUO0R56NK3UQ6英寸,,
"status":可审查的,,
"作业\已提交\计数”: 5,,
"回答“: [
2,,
2,,
1,,
4,,
2
,,
"回答“:2.2
},
{
"tweet":hmmm what to have for breaky?…吐司蜜,,
"希特里德:3YLPJ8OXX9JU8WUHGXYYHV35LA2X4J",,
"status":可审查的,,
"作业\已提交\计数”: 5,,
"回答“: [
7,,
5,,
8,,
7,,
6
,,
"回答“:6.6
},
{
"tweet":在家工作X”,,
"希特里德:3vzya8pitp447ps6rqs1d9m8bzd502英寸,,
"status":可审查的,,
"作业\已提交\计数”: 5,,
"回答“: [
7,,
5,,
8,,
5,,

,,
"回答“:6
},
{
"tweet":Headed out of town for a few days.会想念我的女孩”,,
"希特里德:3uliz0h1vb4eedon9w1re5igrj515英寸,,
"status":可审查的,,
"作业\已提交\计数”: 5,,
"回答“: [
8,,
4,,
5,,
5,,
6
,,
"回答“: 5.6
}
]
响应
The author has chosen not to show responses on this story.You can still respond by clicking the response bubble.