2010年6月14日,星期一

语言学研究中的定量标准很弱?吉布森/费多连科和斯普劳斯/阿尔梅达之间的辩论

The following is an exchange regarding the nature of linguistic data between 特德Gibson and Evelina Fedorenko in one corner and Jon Sprouse and 迪奥戈 Almeida in the other. The exchange was sparked by (i) the Gibson和Fedorenko TICS论文 和(ii) unpublished response by Jon Sprouse and 迪奥戈 Almeida 到那张纸。 大卫提供了关于该问题的预览评论 这里。下面的交换通过电子邮件进行了几天。那些参与其中的人允许我将其发布在这里。我删除了以前包含该辩论内容的单独文章。再过几天,我将发布民意测验,以了解人们的想法。请享用...

TED:
您注意到,来自语言学/语法文献的一种特殊比较比来自心理语言学文献的一种特殊比较提供了更强的效果(Gibson&Thomas,1999)。从这个观察中可以得出结论,语言学家感兴趣的影响大于心理语言学家感兴趣的影响。

这是谬论。您从两个文献中的每个样本中抽取了一个示例,并得出结论,这些文献对不同的效应大小感兴趣。您需要从每个样本中随机抽取大量样本来得出结论。

请注意,这是一种重言式,表明您可以找到两个具有不同效果大小的比较:仅凭它自己并没有演示任何内容。通过选择不同的比较,我可以向您展示相反的效果大小比较。例如:

“语法”比较:2wh与3wh,其中2wh通常被认为比3wh差。

1. a。 2wh:Peter试图记住谁携带了什么。
3wh:b。彼得试图记住谁在什么时候携带。

“心理语言学”,通常认为中心嵌入比右分支更糟糕:
2. a。中心嵌入:新卡目录使很多困惑的研究生正在图书馆学习的古老手稿缺少一页。
右分支:b。新的卡片目录使正在研究图书馆中遗失一页的古老手稿的许多研究生感到困惑。

Clearly the effect size in the comparison in (2) is going to be much higher than in (1). 我不't think we want to make the opposite conclusion that you made in your paper.

Indeed the 3wh vs. 2wh comparison (a "syntax" question) is such a small effect as not to even be measurable (which is the point of Clifton et al (2006) and Fedorenko & Gibson (2010)). This is contrary to what has been assumed in the syntax literature (and which was the actual point of our TiCS 让ter).



乔恩:
嗨,特德,

Thanks for the comments. It is interesting to note that your comments apply equally well to your own TiCS 让ter and the longer manuscript that it advertises. I am sure there is a more diplomatic way to do this, but in the interest of brevity, I am going to use your own words to make the point:

您注意到,在调查实验中,很难从天真的受试者那里复制出语言学/语法文献中的一个特殊比较。从这一观察中,您可以得出结论,语言学家报告的影响是可疑的,并且由此产生的理论是不正确的。

This is fallacious reasoning. You sampled one example from a paper that has 70-odd data points in it (Kayne 1983), and a literature that has thousands, and concluded that this one 复制 failure means the literature is suspect. You need to do a large random sample from the literature to make the conclusion that you make.

Note that it is a tautology to show that you can find 复制 failures: this on its own doesn't demonstrate anything. I can show you many such 复制 failures in all domains of cognitive science. These are never interpreted as a death-knell for a theory or a 方法ology, so why is this one 复制 failure such a big problem for linguistic theory and linguistic 方法ology.

For the record, the point of our 让ter was to be constructive -- we were trying to figure out how it is that you could claim so much from a single 复制 failure, especially given that several researchers have reported running hundreds of constructions in quantitative experiments (e.g., Sam Featherston, 科林·菲利普斯) that corroborate linguists' 非正式的 judgments. 我不't really care if the effect sizes of the two literatures are always of a different magnitude or not (indeed, it is theories, not effect sizes, that determine the importance of an effect). What I do care about is your claim that a single 复制 failure is more important than the hundreds of (unpublishable!) 复制s that we've found. Linguists are serious people, and we take these empirical questions seriously... but we haven't found any evidence of a serious problem with linguistic data.

我的猜测是,您像本邮件中的许多人一样,认为语言理论面临一些挑战,尤其是如何将其与实时处理理论集成在一起。不幸的是,这些问题不是坏数据的结果(这很容易解决)。问题在于科学很难:复杂的表示理论很难与实时处理理论相集成,而这不能通过将数字附加到判断上来解决。

-乔恩


TED和EV:
[Sprouse quote:]"This is fallacious reasoning. You sampled one example from a paper that has 70-odd data points in it (Kayne 1983), and a literature that has thousands, and concluded that this one 复制 failure means the literature is suspect. You need to do a large random sample from the literature to make the conclusion that you make.

Note that it is a tautology to show that you can find 复制 failures: this on its own doesn't demonstrate anything. I can show you many such 复制 failures in all domains of cognitive science. These are never interpreted as a death-knell for a theory or a 方法ology, so why is this one 复制 failure such a big problem for linguistic theory and linguistic 方法ology."[End Sprouse quote]


We think it is misleading to refer to quantitative evaluations of claims from the syntax literature as "复制s" or "复制 failures". A 复制 presupposes the existence of a prior quantitative evaluation (an experiment, a corpus result, etc., i.e. data evaluating the theoretical claim). The claims from the syntax/semantics literature have mostly not been evaluated in a rigorous way. So it's misleading to talk about a "复制 failure" when there was only an observation to start with.

In the cases that we allude to in the TiCS 让ter and in another, longer, paper ("The need for quantitative 方法s in syntax", currently in submission 在 another journal; an in-revision version is available 这里,已经执行的定量实验不支持原始论文中要求保护的模式。令人担忧的是,文献中可能还有更多这样的案例,这使得解释理论主张变得困难。

其次,我们没有找到一个例子。我们已经记录了一些,大多数是其他人观察到的。请参阅我们的较长论文。我们相信还有更多。

无论如何,我们不认为文献中的全部或大多数判断都是错误的。假设90%的判断是正确的或正确的。问题在于知道要建立哪个理论的90%。如果一篇论文中有70个相关示例(如Sprouse建议的示例论文),则意味着大约63个正确。但是哪个63? 70选63就是12亿。有很多潜在的不同理论。严格来说,为什么不做实验?正如我们在较长的文章中观察到的,随着Amazon的Mechanical Turk的到来,再也不用做实验了,尤其是英语。 (此外,许多新的大型语料库现在可用–来自不同的语言–可用于评估有关语法和语义的假设。)

[Sprouse quote:]"What I do care about is your claim that a single 复制 failure is more important than the hundreds of (unpublishable!) 复制s that we've found. Linguists are serious people, and we take these empirical questions seriously... but we haven't found any evidence of a serious problem with linguistic data."


除了使用该术语的问题“replication”在这种情况下(如上所述),我们在评估语法/语义学文献中的权利要求方面的经验与Sprouse的经验不同。当我们进行定量实验以评估语法/语义学文献中的主张时,通常不会确切找到最初提出主张的研究人员的判断模式。实验结果几乎总是更加微妙,因此我们从定量实验中获得了很多信息(例如,不同比较中的效应大小,不同结构的相对模式,总体变异性等)。

[我的猜测是,您像我们在本电子邮件中的许多人一样,认为语言理论面临着一些挑战,尤其是如何将其与实时处理理论相集成。不幸的是,这些问题并非问题是数据不好(这很容易解决),问题在于科学很难:复杂的表示理论很难与实时处理理论集成,并且不能通过将数字附加到判断上来解决。 ”


我们从未声称进行定量实验会解决每个有趣的语言问题。但是我们确实认为这是一个先决条件,并且进行定量实验将解决一些问题。因此,我们认为这些领域的严谨性不会受到负面影响。

非常遗憾的是,

特德Gibson & Ev Fedorenko


DIOGO:
嗨,泰德,嗨,EV(和大家好)

Thanks for the comments on our unpublished 让ter, and for pointing us to your longer article under review.

Jon has already touched upon most of the issues I was going to bring up. However, there is still 在 least one important point that I would like to raise 这里 in response to some of your comments on your last e-mail, which are also made in your TICS 让ter and the longer manuscript you provided us with. Namely, I think you profoundly mischaracterize the way linguists work when you say things like:

"the prevalent 方法 in these fields involves evaluating a single sentence/meaning pair, typically an acceptability judgment performed by just the author of the paper, possibly supplemented by an 非正式的 poll of colleagues." (from TICS 让ter)

“ ...语法研究,通常只有一个实验性试验。” (摘自第7页)

"The claims from the syntax/semantics literature have mostly not been evaluated in a rigorous way. So it's misleading to talk about a "复制 failure" when there was only an observation to start with." (from last e-mail)


Nothing could be further from the truth. It is simply inaccurate to claim that linguists have lower 方法ological standards than other cognitive scientists simply because linguists do not routinely run 正式 acceptability judgments. Linguists test their theories in the exact same way other scientists do: By running experiments for which they (a) carefully construct relevant materials, (b) collect and examine the resulting data, (c) seek systematic 复制 and (d) present the results to the scrutiny of their peers. There is no "extra" rigour that comes from being able to run inferential statistics beyond what you get from thoughtfully evaluating theories, and systematically investigating the data that bear upon them (in the case of linguistics, through repeated single subject 非正式的 experiments that any native speaker can run).

When linguists evaluate contrasts between two (or more) sentence types, they normally run several different examples in their heads, they look for potential confounds, and consult other colleagues (and sometimes naive participants), who evaluate the sentence types in the same fashion. The fact that this whole set of procedures (aka, experiments) is conducted 非正式的ly does not mean it is not conducted carefully and systematically. I cannot stress this enough: The notion that (a) linguists routinely test their theories with only one specific pair of tokens 在 a time, (b) proceed to publish papers based on the evaluation of this single data point, and (c) that the results of this single subject/token experiment receives no serious or systematic scrutiny by other linguists, is entirely without basis in reality (eg, see Marantz 2005 and Culicover and Jackendoff's response to your TICS 让ter).

语言学家和其他科学家之间的唯一区别是,为了评估他们的实验(也是实验)的内部有效性,语言学家倾向于不依赖于推论统计方法。造成这种情况的可能原因之一是,语言学家通常看的是相当大的对比,在这种情况下,无需花费很多尝试就能对自己的直觉充满信心。顺便说一句,在这些情况下,统计数据也不需要花费大量的尝试:如果语言学家进行了符号测试,那么所有5个都朝着同一方向进行的试验就已经可以保证在.05水平上具有统计学上的显着结果(语言学家通常会评估令牌更多,顺便说一句)。

But what happens in the case where the hypothesized contrast is not that obvious? In these cases, linguists would do what any scientist does when confronted with unclear results: they would try to replicate the 非正式的 experiment (eg, by asking colleagues/naive subjects to evaluate instances of contrasts of the relevant type), or would seek alternative ways of testing the same question (eg, by running a 正式 acceptability judgment survey). It is understandable why linguists have historically preferred to take the first course of action: Replicating 非正式的 experiments is faster and cheaper, and systematic 复制 (the gold standard of scientific experimentation) provides the basis for the external validity of the results.

真诚的
迪奥戈

ps: I also think you are overstating the case that 正式 acceptability judgment experiments routinely reach different conclusions from established contrasts in the linguistic literature and you are overinterpreting the implications of the handful of 复制 failures that you cited. I won't go into detail 这里 in the interest of brevity, but I would be happy to share my thoughts in a future e-mail if you are interested.


TED:
Dear 迪奥戈:

Thanks for your thoughtful response to my earlier emails. 让 me jump right to the point:

你说:
Linguists test their theories in the exact same way other scientists do: By running experiments for which they (a) carefully construct relevant materials, (b) collect and examine the resulting data, (c) seek systematic 复制 and (d) present the results to the scrutiny of their peers. There is no "extra" rigour that comes from being able to run inferential statistics beyond what you get from thoughtfully evaluating theories, and systematically investigating the data that bear upon them (in the case of linguistics, through repeated single subject 非正式的 experiments that any native speaker can run).

... The fact that this whole set of procedures (aka, experiments) is conducted 非正式的ly does not mean it is not conducted carefully and systematically.


很抱歉这么钝,但这是不正确的。 *(a)构造了针对目标现象的多个示例,这些示例更为严格,这些示例针对有害因素进行了规范; (b)在天真的实验人群中评估材料。

这两点都很重要,但第二点是我发现许多语言研究人员都低估了这一点。无法在幼稚的人群(带有干扰因素的材料等)上评估假设的问题是,研究人员及其朋友存在无意识的认知偏见,使得他们对这些材料的判断接近于一文不值。 (这听起来很刺耳,但不幸的是,这是事实。)我是第一手知道的。当我们在 更长的纸 (请参阅第16-20页),如果您阅读了我的博士学位论文,很多情况下的判断结果是不正确的,这可能是由于我和我所问的人们的认知偏见。我们从论文中得出了这样一个错误判断的例子:有人认为,当嵌套修饰的相对从句结构修改主语(2)时,理解它们比修改对象(3)时要复杂得多(Gibson,1991,示例(第145-147页的(342b)和(351b)):

(1)狗咬的女人喜欢的男人吃鱼。
(2)我看到了被狗咬的女人喜欢的男人。

也就是说,认为(1)比(2)难处理。在进行这项研究时,我问了很多人,他们都同意了。我构造了各种类似的版本。我问的很多人都一致认为(1)比(2)还差。

但是,如果您对幼稚的对象和大量的填充物等进行实验,那么结果是没有这种效果。我进行了大约5次比较,但从未发现任何差异。两者的评级都不太令人满意(相对于许多其他事物而言),但是这两种结构的预测方向从未存在差异。

这里的问题很可能是认知偏见。我有一个可以预测差异的理论,而我所有的线人都有一个相似的理论(基本上,如Miller&Chomsky(1963)和Chomsky&Miller(1963)所建议的那样,更多的嵌套会导致更困难的处理)。因此,我们使用该理论来获得该理论所预测的判断。

如果您阅读有关认知偏见的文献,这是标准效果。引用我们较长的论文:

“在埃文斯·巴斯顿& Pollard's experiments (1983;cf.其他类型的确认偏差,例如Wason于1960年首次证明;参见尼克森(Nickerson),1998年,对许多相似类型的认知偏差实验的概述,人们被要求判断逻辑论证的可接受性。尽管实验参与者有时能够使用逻辑运算来判断所提出的论点的可接受性,但他们对真实世界中论点的合理性的了解最大,而与论据的合理性无关。因此,他们经常犯下判断错误,因为他们不知不觉地被现实世界中事件的可能性所偏见。

更一般而言,当要求人们判断语言示例或论点的可接受性时,他们似乎无法忽略潜在的相关信息源,例如世界知识或理论假设。例如,理解理论上的假设,即具有更多开放语言依赖性的结构比具有更少开放语言依赖性的结构更复杂,这可能导致实验参与者将具有更多开放依赖性的示例判断为更复杂……”

我们论文的主要观点之一是,仅谨慎和认真思考还远远不够。只是不够严谨,无法避免无意识的认知偏见的影响。为了变得严谨,您确实需要对幼稚主体的分析进行定量评估。语料库分析或对照实验。

你说:
语言学家和其他科学家之间的唯一区别是,为了评估他们的实验(也是实验)的内部有效性,语言学家倾向于不依赖于推论统计方法。造成这种情况的可能原因之一是,语言学家通常看的是相当大的对比,在这种情况下,无需花费很多尝试就能对自己的直觉充满信心。顺便说一句,在这些情况下,统计数据也不需要花费大量的尝试:如果语言学家进行了符号测试,那么所有5个都朝着同一方向进行的试验就已经可以保证在.05水平上具有统计学上的显着结果(语言学家通常会评估令牌更多,顺便说一句)。


我在回应乔恩先前的评论时所表达的观点仍然成立。如果您想声称语言学家倾向于检查的影响大小大于心理语言学家检查的影响大小,那么您需要证明这一点。您不能仅仅陈述它并期望其他人接受您的假设。我个人非常怀疑这是真的。我已经阅读了数百篇语法/语义学论文,并且在大多数论文中,有很多可疑的判断,这可能是效果较小或无效果的比较。

特德Gibson


DIOGO和JON:
亲爱的特德,

感谢您的答复。这是乔恩和我的共同答复。

[Gibson quote:] 让 me jump right to the point: "I am sorry to be so blunt, but this is just incorrect. There *is* extra rigor from (a) constructing multiple examples of the targeted phenomena, which are normed for nuisance factors;"


我们完全同意需要多个项目。实际上,我们只是告诉您,语言学家通常会评估任何提议的句子类型对比的多个实例。在这一点上,语言学家和心理语言学家之间根本没有区别(参见Marantz 2005)。

我们完全不同意您对幼稚参与者结果分配的优先级。没有特别的理由分配平均人数的30多名大学生作为真理的仲裁者。仅仅发现研究人员认为将要发生的事情与来自一群幼稚受试者的实验结果之间的区别并不是特别有益,特别是如果它们属于“无法复制”类型。有几个原因可能导致意外的空结果与“认知偏见”无关:

(1)实验能力不足

例如,在Gibson和Thomas(1999)中,您声称与最初的动机直觉相反,您没有发现(b)的评分优于(a):

一种。 *新卡目录使研究生困惑的研究生正在图书馆学习的古老手稿缺少一页。

b。新卡目录使研究生感到困惑的研究生的古代手稿缺少一页。

In our unpublished 让ter (see figure), we show that this is most likely an issue of power, because the effect is definitely there (it's just small and requires a large sample to have a moderate chance of being detected).

(2)实验设计存在问题,例如:

(i)实验使用的任务不一定对操纵敏感

例如,为什么您一定认为可接受性任务应该对所有处理困难都同样敏感?可能的情况是,可接受性判断可能不是使用权利的衡量标准。

(ii)实验使用的设计或任务并非最佳,无法揭示感兴趣的效果

Sprouse&Cunningham(正在审查中,第23页,图8随函附上)的数据表明,使用一个尺寸为Gibson&Thomas(1999)一半大小的样本,可以检测出上述(a)和(b)之间的对比。具有较低可接受性参考语句的幅度估计任务(但在使用较高可接受性参考语句时则根本不执行)。

None of these explanations invoke cognitive biases. We don't necessarily disagree that cognitive biases are a potential problem. We just think that before you invoke it as an explanation (1) you need positive evidence and (2) a failure to replicate the results from an 非正式的 experiment in a 正式 experiment is not positive evidence. In fact, had you assigned the kind of priority to 正式 experimental results with naive participants you seem to advocate in your previous e-mail, you would have been misled by the Gibson & Thomas (1999) data, and would have concluded the contrast is not real. In fact you yourself explained the result by appealing to the offline nature of the test, and not to cognitive biases, so why would should cognitive bias be the null hypothesis for linguistics?

此外,我们还可以找到完全相反的模式:明显的实验效果证实了最初的期望,但实验者认为这是反对他们的证据。以Wasow&Arnold(2005)为例。在较长的手稿中,您应这样说:

“寡妇& Arnold (2005) discuss three additional cases: the first having to do with extraction from datives;第二,与习语的句法灵活性有关;第三点与Chomsky(1955/1975)中讨论的某些重磅NP移位项目的可接受性有关。其中第一个似乎与菲利普斯特别相关’要求。在这个错误的判断数据示例中,Filmore(1965)指出像(1)这样的句子在语法上是不符合语法的:

(1)您是谁给这本书的?

兰根多恩(Kalang-Landon)&Dore(1973)在两个实验中检验了这个假设,发现许多参与者(“at least one-fifth”)完全接受这类句子的语法。瓦索&Arnold注意,此结果对语法文献影响很小。" (pp. 13-4)

它不应该't. If only one fifth of the sample in Langendoen et al. (1973) failed to show the expected contrast, the results are not problematic 在 all. In fact, they are actually highly signifcant, and overwhelmingly support the original proposal: A simple one-tailed sign test 这里 would give you a p-value of 1.752e-09 and a 95% CI for the probability of finding the result in the predicted direction of (0.7-1)). 让 me stress this again: what the experiment is actually telling you is that the results support the linguist's 非正式的 experiments, not the contrary, as Wasow &阿诺德似乎在想。

Wasow&Arnold(2005)自己的可接受性实验也是如此。他们决定检验乔姆斯基(Chomsky,1955年)对动词质点与宾语NP复杂性相互作用的直觉。他们测试了以下范例,其中(a-b)中的对象被认为比(c-d)中的对象更复杂。

一种。孩子们接受了我们所说的一切。(1.8)
b。孩子们接受了我们所说的一切。 (3.3)
C。孩子们接受了我们的所有指示。(2.8)
d。孩子们听了我们的所有指示。 (3.4)

乔姆斯基认为,(c)听起来比(a)更自然,并且(b)和(d)应该同样可以接受。这正是Wasow&Arnold(2005)发现的结果(请参见上述每种条件旁边的4分制的平均可接受性),并得出了非常显着的结果。这些结果还重复了它们的另一个条件,为简便起见,在此省略。然而,Wasow&Arnold(2005)声称:

“回答的差异很大。特别是,尽管具有复杂NP的拆分示例的平均得分远低于其他任何得分(即,结果支持乔姆斯基的直觉),但此类句子的回答中有17%是得分3或4。也就是说,大约有六分之一的时间,参与者认为这样的例子并不比尴尬更糟。”

再次,结果是非常有意义的,并支持而不是破坏了语言学家的原始直觉,但是Wasow&Arnold(以及您所引用的文章中的内容也同样如此)似乎与所提供的实验数据相反。

So where is this extra rigor that one gets by simply running 正式 acceptability judgments? It just seems to us that simply running a 正式 acceptability experiment with naive participants does not preclude one from being misled by one's results anymore than what happens in the case of 非正式的 experiments.

真诚的
迪奥戈 & Jon


TED和EV:
Dear 迪奥戈 & Jon:

关键不是*定量证据可以解决语言研究中的所有问题。关键在于,拥有定量数据是必要但非充分条件。就这样。

Without the quantitative evidence you just have a researcher's potentially biased judgment. 我不't think that that's good enough. It's not very hard to do an experiment to evaluate one's research question, so one should do the experiment. One is *never* worse off after doing experiment. You might find that the issue is complex and harder to address than you thought before doing the experiment. But even that is useful information.

我不't have anything more to say on this for now. Some day, I would be happy to debate you in a public forum if you like.

最良好的祝愿,

特德(& Ev)


DIOGO:
亲爱的特德,

让 me just add a few remarks to your last e-mail, and then 我不't think I have anything more to say on the matter either. Thanks for engaging with us in this discussion.

[吉布森引用:]“重点不是*定量证据能够解决语言研究中的所有问题。重点只是拥有定量数据是必要条件,而不是充分条件。仅此而已。”


And the point Jon and I are trying to make is that having quantitative data for linguistic research, while potentially useful, is not always necessary. The implication of your claim is also far from uncontroversial: it implies that linguistics, where quantitative 方法s are not widely used, fails to live to a "necessary" scientific standard. We think this is both false and misguided.

[Gibson quote:] "Without the quantitative evidence you just have a researcher's potentially biased judgment. 我不't think that that's good enough."


Here's the thing: a published judgment contrast in the linguistic literature, especially if it is a theoretically important one, has been replicated hundreds of times in 非正式的 experiments. When the contrast is uncontroversial, it will keep being replicated nicely and will 在tract no further 在tention. However, when the contrast is a little shaky, linguists are keenly aware of it, and weigh the theory it supports (or rejects) accordingly. Finally, when the contrast is not really replicable, it is actually challenged, because that is the one thing linguists do: they try to test their theories, and if some part of the theory is empirically weak, it will be challenged. I highly doubt that cognitive biases could play any significant role in this systematic 复制 process.

Now, 这里's where this 方法ology is potentially problematic: If there is a judgment contrast from a language for which there are very few professional linguists that are also native speakers and for which access to naive native speakers is limited. In this case, it is possible that a published judgment contrast will go unreplicated, and if faulty, could lead to unsound conclusions. In these cases, I totally agree that having quantitative data is probably necessary. But note that the problem 这里 is not the lack of quantitative data to begin with, the problem is with the lack of systematic 复制. Quantitative data only serves as a way around this problem.

[吉布森引述:]“做一个实验来评估一个人的研究问题并不是很难,所以应该做一个实验。”


The point is that linguists DO the experiment. They just do it 非正式的ly.

[吉布森语录:]“做完实验后,情况永远不会恶化。您可能会发现问题比做实验前想的要复杂和难于解决。但这甚至是有用的信息。”


The question is not whether or not one is worse off after doing the 正式 experiment. The question is whether or not one is necessarily better off.

There is a very clear cost in running a 正式 experiment versus an 非正式的 experiment. Formal experiments with naive participants take time (IRB approval, advertising on campus, having subjects come to the lab and taking the survey, or setting some web interface so they could do it from home, etc), and potentially money (if you don't have a volunteer subject pool, or if you use things like Amazon's Mechanical Turk). If you want linguists to adopt this mode of inquiry as "necessary", you have to show them that they would be better off doing so. That is the part where it is really not clear that they would.

You can try to show this in two ways: You can show linguists that (1) they get interesting, previously unavailable data or (2) show them that they are being misled by their 非正式的 data gathering 方法s and running the 正式 experiment really does fix that. Because otherwise, 什么是the point? If linguists just confirm over and over again that they keep getting the same results running naive participants as they get with their 非正式的 方法s (and this is what linguists like Jon, Sam Featherston, 科林·菲利普斯 and others keep telling you happens), then why should they bother going through a much slower, and much more costly 方法 that does not give them any more information than their quick, 非正式的, but highly replicable 方法 does?

最良好的祝愿,
迪奥戈

10条评论:

矢部秀一说过...

迪奥戈'的位置(以及乔恩'也许)对我来说似乎有点不稳定。

一方面,他说"But what happens in the case where the hypothesized contrast is not that obvious? In these cases, linguists would do what any scientist does when confronted with unclear results: they would try to replicate the 非正式的 experiment (eg, by asking colleagues/naive subjects to evaluate instances of contrasts of the relevant type), or would seek alternative ways of testing the same question (eg, by running a 正式 acceptability judgment survey)."

But on the other hand, 迪奥戈 and Jon say "So where is this extra rigor that one gets by simply running 正式 acceptability judgments?",表明没有这种额外的严格要求。

在我看来,这两个陈述是相互矛盾的。

矢部秀一

迪奥戈 said...

嗨秀一

我不't think there is any contradiction there. In cases where the results from an 非正式的 acceptability judgment are not so clear, linguists can try (i) 复制 (which they do), or (ii) a different way of getting 在 the same question, which can include looking 在 different data that bears on the same question (which they also do) or looking 在 the same constrast from a different angle (eg, doing a 正式 experiment, which they sometimes also do). It is not clear that any of these strategies is inherently better than the others. It's an empirical question, and so far, there is little evidence to claim that 正式 acceptability judgment experiments are the only way to go.

矢部秀一说过...

迪奥戈,

感谢您的解释。

在上面第二个引用的原始上下文中,您're comparing Wasow & Arnold's 正式 experimentation with Chomsky's reliance on his own intuition, so I took you to be saying that doing 正式 experiments doesn'除了提供"method" of relying on one'自己的直觉。一世'我很高兴得知那不是't your intention.

布赖恩·巴顿说过...

(第1部分,共2部分)

由于我不参与语言研究,因此,本次辩论对我个人几乎没有影响,这使我无所事事。这使我可以抽象地思考整个事情,因此我将以抽象的角度说明我的想法。

这一论点似乎可以归结为:一方说,为了获得科学上有效的结果,必须采取所有合理的步骤来确保除实验变量外的所有变量都保持恒定。是的,这并非总是可以实现的,但这是一个要追求的目标,它使其他人能够以尽可能少的潜在替代解释来检查和复制发现。

各方面的科学家普遍接受这里赞成的论点。只有TA’d for a psychology research 方法s class, this is exactly the sort of thing I try to drill into my students’负责人,我已经多次解释了这一段。

另一边说,同意这确实是要争取的东西,但是“informal”实验意味着一个由研究者和可能的几个同事组成的非常小的样本,可以非常近似地排除所有可能的解释变量,从而完成语言学的工作。简而言之,语言学是一个特例。

这里赞成的论点有两个基本组成部分:一个是执行几乎没有额外的好处“formal” experiments over “informal”在很多情况下是一种,另外两种是“formal”实验超过了上述微不足道的好处。有人认为,只有在影响很小且成本是时间和金钱(寻求IRB批准,支付受试者,数据库费用,购买测试设备等)的情况下,这些好处才特别有用。

我认为这些是对职位的公正刻画,剥夺了每篇论文的详细内容或影响或个人对某人的侵犯感觉’自己的方法。我认为这是事物的抽象方案,在这里可能会有所启发。最后,这场辩论归结为您想要成为什么样的人。思想实验够吗?缩小可能的结论范围(例如,但不排除实验者的偏见)是否足够?是否有必要尽一切可能充分排除替代解释吗?

布赖恩·巴顿说过...

(第2部分,共2部分)

对我来说,问题变成了“what is ‘每个合理的措施?’”我认为这就是我与Diogo以及Jon不同的地方。我不’认为获得IRB批准,获取天真的受试者(有偿或无偿),支付测试设备等费用都很高。换个角度看,行为实验的可接受费用为每小时10美元。而且,如果语言实验通常具有很大的效果,而无需进行几次试验,那么大多数实验程序都需要一个小时的时间,少数受试者也应如此。’s说10.要运行10个主题,可以使用4台可以运行Matlab的计算机,每台500美元,加上Matlab许可,这是合理的。那’s的启动成本约为2500美元。时间成本会因许多因素而异,所以我赢了’请尝试估计一下,但我认为保守地假设3个月的时间足以编写一个实验,购买计算机并使它超过IRB才是合理的。因此,要花3个月的时间才能获得2500美元的启动资金,外加10美元的实验费用,每10个参与者一个小时。即使我的数字大幅度下降,这与非常昂贵或难以置信的时间相去甚远,我’我只考虑了一个“formal” experiment over an “informal” one. That’比许多大学每年颁发的校园范围奖项要少得多。

让’s将其与需要大量设备(例如磁共振成像设备)的工作进行比较(例如,如果实验室进入已经拥有扫描仪的校园,那么我们’不要考虑购买扫描仪,维护它等很大程度上取决于中心的成本。—只是实验室使用费用)。在这里,我们的扫描仪每小时的使用费用约为500美元,加上我在上一个示例中列出的所有相同费用(实际上更高,因为与运行实验相比,需要更好的计算机来处理数据,但让我们 ’暂时不要理会)。这些成本是巨大的,而且它们加起来很快,并且会对研究产生很大影响(您’d最好将您的设计放下,并知道会产生什么效果:否“let’看看我尝试这个会怎样”),但由于优势而与之共存。

全面披露:我在功能性MRI实验室工作,但是我之前在行为实验室工作,所以我’我已经处理了这两个成本水平,这几乎可以肯定是我选择这些示例的原因。话虽如此,我个人认为表演的成本“formal”实验只不过是“informal”并能够排除实验者偏见的其他潜在混杂变量是值得付出的代价(此外:这里的重点不是我是否有偏见,我不会那样,而是其他人可以得出结论:我的偏见影响我的结果的可能性已降至最低)。我无权决定在语言学的特殊情况下这是否是正确的(也许比我所熟悉的要困难得多获得资金,或者放慢自己的速度以获得IRB批准的时间过长会使您处于竞争激烈的地位。相对于其他语言学家来说是不利的),但是在这个抽象的层面上,我认为“formal”实验属于以下类别“一切合理措施。 "

格雷格希科克说过...

尽管此讨论可能代表了有关结核病的最长记录,但我认为这是一个有趣的问题,非常值得讨论。当然,许多心理学家认为语言学研究不如定期计算t值的领域那么严格。特德和EV'如果遵循语言学家的敦促,就可能要求语言学进行更多的实验,而这是一门从外部看是更有利的科学。 (另一方面,不幸的是,它错误地认为语言学不是一门科学。)所有参与者都同意,对于语言理论所依赖的许多判断,一定要进行一些实验。乔恩(Jon)正是这样做的语言学家的先锋。

但是要注意量化严谨性。我们谁也没有时间或资源来量化实验的每个细节。例如,你们中那些从事语言学习的人是否量化了您学科的语言能力(或阅读技能)水平?还是您相信他们说母语的人(熟练的读者)? (在大多数情况下,我们都信任他们。)您是否抛出了一个话题,因为即使他们说他们是母语的,但他们的口音很重? (是)这个决定是基于定量数据还是实验者'直觉? (直觉。)

更重要的是:当您设计一个严格的定量心理语言实验,该实验使用阅读时间来衡量x对句子处理的影响时,是否要对所有刺激进行预测试,以确保随机抽取30名受试者来判断每个刺激作为语法?还是您使用自己的语言直觉来确定您的刺激是否属于您的科目'语言?您使用并相信自己的直觉!语言学家使用相同的语言。

有些事情简直就是'不需要经过实验测试:"plant"不确定,Necker立方体是双稳态的,"你看见谁了玛丽?"是不合语法的,而"你和谁见过玛丽?" is perfectly fine.

在现实世界中,在资源有限的情况下,我们确实需要对要量化实验的哪些方面做出理性的决定。如果我们时常做出错误的选择,我们可以放心,审核者会指出该错误,然后我们将运行实验3B。科学就是这样运作的。

我认为,更危险的情况(就放慢步伐而言)是掩盖定量严谨性的薄弱科学。

JS said...

An important issue came up in the course of this exchange, but was not really dealt with. At one point, 迪奥戈 made the following reply to an argument from 特德& Ev's manuscript:

特德& Ev's argument: "...兰根多恩(Kalang-Landon)&Dore(1973)在两个实验中检验了这个假设,发现许多参与者(“at least one-fifth”)完全接受这类句子的语法。瓦索&Arnold注意,此结果对语法文献影响很小。"

迪奥戈's response: "And it shouldn't. If only one fifth of the sample in Langendoen et al. (1973) failed to show the expected contrast, the results are not problematic 在 all. In fact, they are actually highly signifcant, and overwhelmingly support the original proposal: A simple one-tailed sign test 这里 would give you a p-value of 1.752e-09 and a 95% CI for the probability of finding the result in the predicted direction of (0.7-1)). 让 me stress this again: what the experiment is actually telling you is that the results support the linguist's 非正式的 experiments, not the contrary, as Wasow &阿诺德似乎在想。"

这里的含义是,在不同的可接受程度之间没有区别。在语法和非语法之间只有一个二元选择。如果20%的受试者判断为可接受,而1%的受试者则认为理论意义没有不同。在我看来,当涉及到语法理论时,这种黑白方式已成为常态。它还有助于证明使用单个观察值的合理性。如果语言理论对语法的预测更有概率,那么将有更多的动机从天真的对象那里获取大量样本。

迪奥戈 and Jon said...

感谢您的投入,JS。我认为这里有两个问题需要分开讨论。

第一个假设是您认为语言学家无视接受程度的不同。这只是不'没错。在整个语言理论的历史中,都已经根据效果大小进行了争论。例如,在1980年代,违反主权和违反ECP的行为之间存在着经典(臭名昭著)的区别,这两种行为都是在似乎完全相同的结构配置下进行的,并且都导致极端的不可接受性。区分的证据之一是,始终认为ECP违规比Subjacency违规更为严重(例如Huang,1982; Lasnik和Saito,1984; Chomsky,1986)。

The second is your assumption that all of the variability in acceptability judgments should be accounted for by the theory of grammar. This is a strong claim to make, which you can see by applying it to psycholinguistics rather than linguistics. In the lexical access literature, repetition priming (reaction times to words in a lexical decision task are faster on the second presentation than the first) is as close to a blow-out effect as you are going to get. We just looked 在 some old repetition priming data that 迪奥戈 has. If we look 在 the individual subject data, it turns out that 4 out of 21 subjects show negative priming (ie, they are slower for the second presentation when compared to the first) In other words, about 20% of the sample shows the opposite effect from 什么是theoretically predicted. Should psycholinguists interested in repetition priming be required to account for the 20% of the data that goes in the wrong direction?

大多数心理语言学家会说“不”(或者至少不一定)。词汇访问理论并不是所有反应时间的理论。它是影响反应时间的一个因素,但是我们假设还有许多其他因素可能导致这种可变性。反应时间只是揭示有关词汇访问机制的事实的一种方式。语法理论也是如此。语法理论不是可接受性判断的理论。我们假设语法是可接受性判断的主要预测指标,但是完整的可接受性判断理论将包括许多其他因素。可接受性判断只是揭示有关语法表示形式的事实的一种方式。就像反应时间一样,理论上是否应包括可接受性判断中的任何可变性实例,这是一个经验性的问题。'只能根据可变性的存在来决定。如您所料,这是语言学中的一个活跃的研究问题,桌上有几个不同的建议(例如,参见Sam Featherston和Frank Keller的著作)。

科林·菲利普斯 said...

很难拒绝引述亨利·格莱特曼的诱惑:"如果实验不值得做,那就不值得做得好。"

让 us not forget that one of the things that is 在 stake is how best to make use of scarce resources. Almost all of us are using money that comes from students'学费,或来自纳税人'口袋,当我们进行实验时,我们通常会花费宝贵的时间进行那些实验的年轻研究人员。我们有责任充分利用这些资源。在很多时候,我们还通过撰写赠款提案来争夺稀缺资源,这些提案认为我们的研究是当务之急,并且物有所值。

一些语言事实如此明显,以至于进行实验测试将浪费资源。例如,测试说英语的人是否能找到"John didn't leave"比"John left not"。其他一些现象则更加微妙,需要进行更细粒度的测试。 (但在那些情况下,可能会实验性地确认一个细微的差异,然后使用该差异做出绝对合理的分类判断。)然后,在许多其他情况下,其中一个仅需做出价值判断,而良心则需要弄清楚确定是否值得进行更出色的测试。这没有什么不寻常的。

我可能会补充说,在诸如此类的讨论中,经常会遇到这样的抱怨,即实验心理学家只会花更多的时间来做实验心理学家所花时间做的事情,而会更加重视语言学家。这种说法不能接受更严格的审查。如果一个人通过研究在研究中使用非常相似的实验方法的语言学家和心理学家来控制方法学实践,并询问这是否消除了跨领域的怀疑论甚至轻蔑的话,那么很明显,这种不信任感会加剧,可惜。同样,如果人们在一个科学共同体中寻找被另一领域的研究人员最重视的个人,那么很显然,影响力的最佳预测者是共同的利益,共同的问题或共同的结论,而不是共同的方法。因此,语言学家的观念'如果他们通过一些简单的判断实验来修饰论点,那将是更认真的考虑。

菲利普·霍夫迈斯特 said...

I'm sorry I found this interesting conversation so late---I found both perspectives illuminating. But I wanted to add several points in defense of what 特德& Ev have advocated.

First of all, I find the claim that linguists, when making 非正式的 judgments, control for nuance variables and use other scientific 方法s for controlling for personal bias a bit utopian. I'仅仅在语言学领域工作了10年,但是从数十名语法师和语义学家的得分经验来看,这可能是95%的情况的错误描述。语言学家最经常提出一个或两个例子,并提出给几个亲密的朋友或同事。所以,如果Diogo和Jon实际上发生了什么,那将是另外一个故事,但是我不知道'认为确实如此。什么's more, there'没有该数据的记录,也没有该数据的收集方式,存在的变化等的附带说明。这使得将来的研究人员难以评估,以及其他语言领域与标准实践之间的区别。

其次,关于``可复制性", I think Jon and 迪奥戈 have slightly misrepresented how linguistic research advances. In particular, when a researcher publishes a paper listing judgments for some set of stimuli, those judgments are taken as truth, unless strong evidence (in the form of corpus examples, 正式 experiments, etc) proves them otherwise. Submitting a paper with simply differing judgments from some prior author does not fly in the linguistics world. In this sense, I wouldn't call repetition of these judgments 复制, so much as citing or referencing someone else's findings.

最后,泰德的实力& Ev'从语法文献中错误结论和数据的发布中可以看出这一点。举个例子,海格曼'广泛使用的1997年句法教科书包含一些令人难以置信的可疑判断。一些被判定为不合语法的句子有严重的混淆(例如,句子可以't具有连贯的含义),但是赋予句子的判断具有深远的理论意义。在其他许多这样的例子中,对特定构造的判断因研究人员而异,并且所依据的理论也截然不同。例如,在对"either . . . or"从析取式的结构​​来看,我发现一些语言学家的判断与天真的一组对象在``任何一个". As 特德says, I'我肯定还有更多这样的情况。

我的两分钱。 。 。